Challenges in Transcription, Part II – Who Said What?
This chapter explores a hitherto under-researched aspect of spoken corpus transcription: ‘speaker identification’. Speaker identification is the degree of confidence with which transcribers can identify the speaker responsible for each turn in a spoken corpus transcript. Using the new Spoken British National Corpus 2014 (Spoken BNC2014) as a detailed case study, this chapter reports on a series of investigations which aim to inform understanding of the nature of this challenge. The findings suggest that speaker identification can prove very difficult for transcribers – so difficult that, in circumstances such as when there are several speakers, transcribers regularly and obliviously get it wrong. It is estimated that up to a quarter of the texts in the Spoken BNC2014 could be affected by inaccurate speaker identification. Solutions include the option to exclude from any given analysis the utterances or transcripts which are most likely to have fallen victim to poor speaker identification; and visualising uncertain speaker identification in the CQPweb interface for the Spoken BNC2014. The analysis in this chapter has implications for other spoken corpora which, it is suggested, ought to be investigated.