ABSTRACT

A sublanguage is characterized by distinctive specializations of syntax and the occurrence of domain-specific word subclasses in particular syntactic combinations. The Linguistic String Project of New York University has studied several sublanguages in detail over the past 15 years and developed computer methods for obtaining the relevant word classes and relations from samples of syntactically analyzed domain sentences. The methods are illustrated in application to articles in the lipoprotein literature. It has also proved possible to measure such features as the quantity, density, and complexity of information in the sentences of contrasting sublanguages.

The special word-classes and relations of a particular sublanguage provide the basis for a variety of natural language processing applications that would not be practicable in the language as a whole. For example, it is possible (with difficulty) to process full texts in a sublanguage and convert the free-text information into a structured form suitable for fact retrieval and data summarization. The information structures arrived at in such processing are similar in certain respects to data models used in database management systems, and suggest the possibility of adapting such systems for the management of natural language-derived databases.