Theoretical Challenges in Corpus Design
In corpus linguistics, one of the most challenging aspects of designing a corpus is representativeness. This chapter explores the literature on corpus representativeness and evaluates the approach to representativeness taken by the compilers of the spoken component of the original British National Corpus (BNC) and other spoken corpora. Existing spoken corpora are found to lie on a continuum between probability sampling and convenience sampling, with most comprising the latter. Then, the original design of the new Spoken British National Corpus 2014 (BNC2014) is described, which is summarised as “informal spoken British English, produced by L1 speakers of British English in the mid-2010s, whereby British English comprises four major varieties: English, Scottish, Welsh and Northern Irish English”.