ABSTRACT

This chapter considers key facets of corpus linguistics (CL) design and development in relation to the phonetic and phonological analysis of spoken interaction in all its variability, particularly when spontaneous or naturally occuring speech is involved. Human–computer interaction and a more refined awareness of the indexical capabilities of spoken language are driving both broad-sweeping and fine-grained analyses of spoken language, and CL provides an important tool for research in phonetics and phonology central to such investigations. However, there are gaps in the process of turning big, raw, ecologically valid speech recordings into segmented, phonetically transcribed and searchable data. Such hurdles are still significant for a well-documented and widely studied language such as English; they are even more challenging for under-resourced languages which can have a fragmented digital presence. Some of the promising advances which could serve to automate aspects of the pre-processing work to transcripts of the spoken data and also time-aligned segmentation are discussed here. CL and phonetic/phonological variation can mutually benefit from addressing the complexities raised in this chapter, with a view to overcoming the challenges posed by transforming raw spoken data into a ready resource.