ABSTRACT

This book is concerned with the development and exploitation of computer corpora of spoken discourse (‘corpora’ being the plural of ‘corpus’, refers to a body of language samples). Condensed into a single phrase like this, the field appears to be technical and mystifying. But let us begin by reconstructing its development from a number of simpler and more familiar processes:

Someone, somewhere (let us call this Person A) decides to undertake the recording of natural spoken discourse, which may be everyday conversation, or some other variety of spoken language. The purposes of doing this are many: for example, they could be to study the language itself (particularly the spoken variety), to study the nature of conversation as a social activity, to help build better dictionaries, or to help build improved machines which will talk or understand speech.

Person B (who is perhaps the same as Person A) sets about transcribing the above recording. This person should ideally be highly trained in phonetics and linguistics, but often is not. Person B has to decide on a set of conventions for transcribing speech, which means deciding which sorts of information from the speech signal should be retained as important, and which should be disregarded. For instance, if the speaker makes a noise like ‘urn’, should that be transcribed or ignored? And if it should be transcribed, how?

Person C (perhaps the same person as A and B) decides to computerize Person B’s transcriptions; that is, to convert them to an electronic form that can be read, stored, or processed by computer. At this stage, decisions have to be made about how to make use of the computer’s symbolization potential (the letters, numbers, and other symbols such as > and %) to represent some, or all of the information in the transcription. (This is known as mark-up – see Edwards, Chapter 1.)

Person D decides to make this resource more useful by adding information which was not in the original transcription itself, but could be derived from it, plus a knowledge of the language and the contexts of its use. This amounts to an enrichment of the corpus as a resource, by adding extra layers of information, for example, on the grammatical classes of words, or on the classes of speech acts which have taken place in the course of the transcribed speech. This process of adding enriching information may be termed coding or annotation. Once the corpus has been coded with useful information, the results of the coding can, again, be ‘ploughed back’ into the corpus for other users to benefit from.

Person E decides to make use of the corpus (perhaps with various types of codings added) by applying it to a specialized area of research, with practical benefits in mind. The areas of application can be varied. For example, they may include social areas of application, such as investigating the nature of language disabilities, or technological ones, such as building better speaking and listening machines (i.e. speech synthesizers and speech recognizers).