Challenges in Data Collection
In corpus linguistics, spoken corpora are notoriously difficult to compile. This is especially true of large national spoken corpora such as the new Spoken British National Corpus 2014 (Spoken BNC2014). Using the Spoken BNC2014 as a case study, this chapter explores the main practical components of spoken corpus data collection: speaker recruitment, metadata and audio data. For each component, the approach of the compilers of the original British National Corpus (BNC1994) is discussed before discussion of how these challenges were overcome by the Spoken BNC2014 team. Speaker recruitment was undertaken using a contributory public participation in scientific research (PPSR) model, whereby contributors were recruited using channels of public communication including broadcast, print and social media. Improvements to the metadata collection procedure (compared to the original BNC) resulted in the Spoken BNC2014 being rich in useful metadata about the speakers and the conversational context. Finally, audio data was gathered via an innovative use of contributors’ smartphones, which complemented the PPSR approach of the Spoken BNC2014.