ABSTRACT

The most basic type of corpus analysis is checking the frequency of occurrence of a given word or a phrase. This chapter provides a description of the most important types of corpus analysis which include frequency analysis and concordancing, wordlists, cluster (n-gram) analysis and keyword analysis. As a quantitative approach to linguistic analysis, corpus linguistics relies on a number of statistical tests which are used to find statistically significant differences between different sets of data. Log-likelihood is a test which is used to compare differences in frequency values between different sets of data. T-score and mutual information (MI) tests are corpus statistics which are commonly used to identify collocations. Rather than relying on raw frequency which is "too unreliable a guide as to the strength of association between collocates", these tests inform us whether the co-occurrence of words has statistical significance, or whether it can be attributed to chance.