ABSTRACT

In the move towards building a comprehensive and fully functional computer-stored corpus of English, there are, inevitably, a variety of differing opinions about what should be included in the corpus over and above the transcribed text itself. Some sections of the research community have argued for the extreme case that corpora should contain nothing but the transcribed text, on the grounds that the transcribed text is the raw data, which, in order to serve the needs of all interested researchers, should be kept separate from any analytical material. This is not, however, a valid argument for a corpus of spoken English, as in this case the raw data is clearly the original speech, along with the situational features which were in operation at the time of the encounter, so that any transcription is in fact a translation of the data of the speech event. Any transcribed corpus of spoken language is a bank of pre-processed data which has already been in some way structured and categorized in order to present it in orthographic form.