ABSTRACT

The essence (see [Hayes et al., 1990] for a detailed description with exam­ ples) of our approach to categorization is to identify themes or concepts in a text by matching phrases related to the concepts. Phrases are specified as patterns of words built using arbitrary nestings of disjunction, negation, skip (up to n words), and optionality operators. An example might be the phrase “cold rolled bars” or the word lead, so long as it is not preceded by <o or fol­ lowed within 3 words by the word manager. Morphological equivalence sets (noun forms, verb forms), case (upper case, lower case, capitalized), punctu­ ation, and wildcards may also be specified. Individual patterns are weighted by how strongly they indicate the concept, and the sum of the weights of the patterns tha t match give a strength of occurrence of the concept in the text. Categorization decisions are made by if-then rules, which take into account what concepts are identified in the text, what part of the text they appear in, and what strength they occur at. The patterns, concepts, and catego­ rization rules are all application-specific. Their development for a particular application is a knowledge engineering task. The approach is appropriate for categorization tasks in which the categories can be defined in advance, have definitions th a t are specific and firm, and are directly related to the content of the text, rather than to the interest of the reader. Thus, corporate acqui­ sitions would be an appropriate category, but events of political significance would not.