Searching a Corpus | 3 | Doing Corpus Linguistics

ABSTRACT

In this chapter, the authors use the Corpus of Contemporary American English (COCA) to illustrate the most commonly identified units of language that researchers use for their analyses: words, collocations, n-grams/lexical bundles for lexical patterns, and (POS) tags for grammatical patterns. It illustrates how to identify these units of language by providing different tasks that will give researchers practice in searching and analyzing these units of language. The COCA site actually provides the lists for us, including bi-, tri-, four-, and five-grams, and their frequencies in COCA. Collocates are always two-word combinations, are statistically determined, and are also called 2-grams. All Collocates are bi-grams but not all bi-grams are Collocates. There are many other parts of speech categories that could be potentially interesting for any linguistic study.