Challenges in Corpus Processing and Dissemination
This chapter explores the final stages of spoken corpus compilation: processing the corpus transcripts and metadata and disseminating the completed corpus to the public. The first step is to convert the transcripts into an appropriate format for corpus data interchange and archiving. In the case of the Spoken British National Corpus 2014 (Spoken BNC2014), the format of choice was ‘modest’ Extensible Markup Language (XML). The chapter discusses how the XML conversion process not only converted the transcripts into an appropriate format but also acted as a final procedure for checking the quality of the transcription and correcting any remaining errors. The Spoken BNC2014 was tagged for part-of-speech (POS) and lemmatised using the spoken lexicon of the Constituent Likelihood Automatic Word-tagging System (CLAWS) with an estimated error rate of 2.5%. Finally, the dissemination procedure of the corpus is discussed; the Spoken BNC2014 was first made available to the public via CQPweb in 2017, and in 2018 the XML file download followed, so that the corpus files could be freely and publicly downloaded by users around the world.