Corpus Linguistics

doi:10.4324/9780203783733-13

ABSTRACT

What is Corpus Linguistics? Recently, the area of study known as ‘corpus linguistics’ has enjoyed much greater popularity, both as a means to explore actual patterns of language use and as a tool for developing materials for classroom language instruction. Corpus linguistics uses large collections of both spoken and written natural texts (corpora or corpuses, singular corpus) that are stored on computers. By using a variety of computerbased tools, corpus linguists can explore different questions about language use. One of the major contributions of corpus linguistics is in the area of exploring patterns of language use. Corpus linguistics provides an extremely powerful tool for the analysis of natural language and can provide tremendous insights as to how language use varies in different situations, such as spoken versus written, or formal interactions versus casual conversation. Although corpus linguistics and the term ‘corpus’ in its present-day sense are

pretty much synonymous with computerized corpora and methods, this was not always the case, and earlier corpora, of course, were often not computerized. Before the advent of computers, or at least before the proliferation of personal computers, many empirical linguistics who were interested in function and use did essentially what we now call corpus linguistics. An empirical approach to linguistic analysis is one based on naturally occurring spoken or written data as opposed to an approach that gives priority to introspection. Empirical approaches to issues in linguistics are now the accepted practice, partly as a result of computer tools and resources becoming more sophisticated and widespread. Advances in technology have led to a number of advantages for corpus linguists, including the collection of ever larger language samples, the ability for much faster and more efficient text processing and access, and the availability of easy to learn computer resources for linguistic analysis. As a result of these advances, there are typically four features that are seen as characteristic of corpus-based analyses of language:

• It is empirical, analysing the actual patterns of use in natural texts. • It utilizes a large and principled collection of natural texts, known as a ‘corpus’, as the basis for analysis.