ABSTRACT

To date, automatic language processing has been possible only within the tightly constrained context of a sublanguage. The generality of the sublanguage approach is dependent on an ability to adapt (or port) existing systems to new domains (sublanguages). Successful portability requires the definition of a restricted set of semantic relations adequate for natural language processing and the rapid, cost-effective acquisition of these relations for each new domain. This chapter explores a specific approach that identifies a limited set of relations for language processing and discusses the techniques available to automate the acquisition of this information.

The approach is based on the Linguistic String Project system, which uses a domain-independent grammar, augmented by a very limited set of domain-specific relations. The kinds of domain-specific relations include a set of distributionally based semantic classes and the relations between these semantic classes. These relations form the basis for stating allowable subject-verb-object and host-modifier combinations of classes, used to eliminate incorrect parses. The basis for automating the discovery procedure is the distributional hypothesis: words containing similar kinds of information will appear in similar syntactic environments.

The chapter summarizes the results from a series of experiments on the automatic generation of sublanguage semantic classes and sublanguage semantic patterns. It discusses the problems of omitted material and the phenomenon of phrasal attributes, and how both of these affect distributional data. It concludes with some possible ways to approach the circularity inherent in the distributional approach, where obtaining good distributional data depends on a correct syntactic analysis, but the correct syntactic analysis depends on having available the semantic classes and patterns of classes for the particular domain.