ABSTRACT

The first four chapters in this section deal with theoretical and practical issues relating to the transcription and coding of spoken language in machine-readable form. Transcription is the process of representing spoken language in written form: how broad/narrow should that representation be? How can transcription be made useful to/usable by a wide range of users? How can we overcome the limitations of the written medium? Coding (also known as ‘tagging’ or ‘annotation’) relates to more abstract attributes of the text: for example, you might want to label grammatical, semantic, pragmatic or discoursal categories (to indicate, for example, that a word is a proper noun, that its use is restricted in some way, that a particular utterance was said in a sarcastic manner, or that it was used to bring an interaction to a close). Chapters 5 and 6 focus on issues of mark-up – the process of making texts machine-readable in ways which facilitate the interchange of data between users. The final chapter is rather different in nature – it is an edited transcript of an unscripted talk delivered interactively at the Lancaster Workshop on Computerized Spoken Discourse, held in September 1993. In this chapter, John Sinclair responds to the issues raised in the previous chapters. If we were constructing corpora in an ideal world, the issues raised in the first six chapters regarding delicacy of transcription and coding and detailed mark-up might all be taken on board. However, Sinclair speaking from his experience of many years working with large corpora of spoken language, discusses how in practice issues of cost and usability affect the transcription, coding and mark-up of very large corpora.