ABSTRACT

This chapter presents the results of the first study of sublanguages carried out at the Linguistics Research Center of the University of Texas as part of the Machine Translation project. Our goal is the improvement of both the efficiency and the quality of automated grammatical analysis of texts. We believe that the issues of speed and quality are closely related in ways that are explained later. Our approach here is to discover ways in which texts within a single sublanguage resemble each other and how texts in different sublanguages differ. We then propose a means for (semi) automatically identifying the sublanguage of a new text and optimizing a Natural Language Processing system for that text, so that overall performance may be improved.

The questions we most directly address, then, are these: Are there predictable characteristics of texts said to lie within a single sublanguage, and differences between texts said to be in different sublanguages (i.e., Is there such a phenomenon as sublanguage)? If so, how can these characteristics be described, and can the sublanguage of a text be automatically identified? If the sublanguage of a text can be identified, how does one contruct a system that can quickly, automatically, and on-the-fly, optimize its performance for that text (sublanguage)?

We begin with a very brief overview of some of the relevant properties of the LRC Machine Translation system (Lehmann et al., 1981), so that our means of gathering data and our conclusions about how one might structure an adaptive system will be apparent to the reader. Afterwards, we describe the experimental setup in which we gathered our data, present and comment on the data, discuss the significance of our findings, and conclude with answers to the questions raised earlier, along with some commentary on the questions raised by the workshop organizers.