ABSTRACT

Corpus linguistics uses large collections of both spoken and written natural texts that are stored on computers. Corpus linguistics provides an extremely powerful tool for the analysis of natural language and can provide tremendous insights as to how language use varies in different situations, such as spoken versus written, or formal interactions versus casual conversation. One of the most important factors in corpus linguistics is the design of the corpus. A well-designed corpus should aim to be representative of the types of language included in it, but there are many different ways to conceive of and justify representativeness. When creating a corpus, data collection involves obtaining or creating electronic versions of the target texts, and storing and organizing them. Data collection for a written corpus most commonly means using a scanner and optical character recognition software to scan paper documents into electronic text files.