Building a corpus to represent a variety of a language

doi:10.4324/9780203856949-9

ABSTRACT

This chapter takes a simple approach, first focusing on monolingual English language corpora, categorising them broadly into general, speech, parsed, historical and specialised, and then touching briefly on multimedia corpora and the concept of the web as a corpus. More specialised dialectal speech corpora include the following: the Newcastle Electronic Corpus of Tyneside English (NECTE), which contains dialect speech from Tyneside in Northeast England; the Limerick corpus of Irish English (L-CIE), which contains speech from all parts of Ireland; and the Scottish Corpus of Texts and Speech (SCOTS), representing speech from across Scotland, including Scots Gaelic. A brief mention should be made of several corpora that have been parsed. Specialised corpora are usually smaller in scale than general language corpora precisely because of their narrower focus. A growing number of corpora are now fully multimedia in the sense of having transcripts that are aligned or synchronised with the original audio or video recordings.