ABSTRACT

This chapter attempts to elucidate the issues that are involved when building a written corpus. The discussions on size and sampling have necessarily touched on questions of representativeness and balance. In the corpus designed to represent published Business English materials, the Published Materials Corpus, balance and representativeness were achieved by surveying the popularity of use of books in the general market in order to provide an overview of those books actually in use at the time. Publicly available data can be gathered from a variety of sources – newspapers, journals, magazines and a number of sites on the Internet. Despite the fact that written corpora are purportedly easier to create than spoken, largely because of the problems of spoken language transcription, there are still a wide range of issues that need to be addressed at all stages of the process from planning to data gathering and organisation.