ABSTRACT

This chapter discusses the corpus design and text selection for general language corpora. The principal data source consulted is the purpose-built German-English Parallel Corpus of Literary Texts (GEPCOLT). Comparative data for English and German are supplied by the British National Corpus (BNC), and the group of corpora known here as the Mannheim Corpora. As Sinclair's comments make clear, corpus compilation is the vital first step in any corpus-based study of language, and decisions taken at this stage have ramifications throughout the whole study. In general linguistic research, corpora have traditionally been designed with the aim of presenting a representative sample of the language at large at a specific point in time. Sampling theory is concerned with how we can infer knowledge of a whole, the target population, from knowledge of a part, or sample, of such a target population. Situational and demographic criteria are used, for example, as selection criteria in the spoken part of the British National Corpus.