ABSTRACT

In this chapter we explore the key considerations attending to the compilation of written corpora. We adopt a consciously broad view of what constitutes a “written” text for this purpose, taking it to include computer-mediated and web language in addition to texts that might be considered written in a more traditional sense (e.g. handwritten texts, printed newspaper articles, etc.). The chapter is structured according to three key areas of concern in corpus compilation: 1) design (including authenticity, representativeness, balance and size); 2) ethics and copyright; and 3) text gathering and processing (including text collection, cleaning and encoding). Special attention is given to the theoretical and practical concerns associated with written texts that pose particular challenges to corpus builders, such as historical texts and computer-mediated language. However, many of the issues explored in this chapter are relevant to the compilation of corpora of any kind.