ABSTRACT

This chapter discusses the data sources, which are introduced to people, known as corpora that they can use to address these and other sorts of research topics. It is based on spoken language or written language. Where spoken language is concerned, corpora comprise only transcripts, while others contain transcripts and audio, a goldmine for researchers interested in phonetic and phonological variation. Corpora of written language can contain newspaper and academic writings, correspondence, literature and online communications. Video corpora of signed languages are available too, as are many other non-English corpora, including multilingual corpora containing parallel translations of the same text. Contextual style is well known to affect sociolinguistic variation. A recent advance in corpus research is the development of software for the automatic time-alignment of speech. Starting with an orthographic transcription, these tools create a phonetic transcription, and then automatically match every word and phoneme in these transcriptions to their precise point of occurrence in the audio.