How to use statistics in quantitative corpus analysis

doi:10.4324/9780367076399-13

ABSTRACT

This chapter surveys statistical applications in contemporary corpus linguistics. It does so from two angles: first and from a corpus-linguistic angle, I discuss the four main corpus-linguistic methods of frequency, dispersion, association/contingency and concordances/context; for each method, I exemplify different measures that have been proposed and some central theoretical or applied applications. Second and from a statistical angle, I discuss the two main statistical approaches that are routinely applied to corpus data, namely regression/classification approaches (often involving hypothesis testing and/or machine learning methods such as regression modeling or tree-/forest-based approaches) and exploratory approaches (usually involving hypothesis-generating methods such as cluster or principal component analysis).