ABSTRACT

530Corpora offer insight into linguistic practices, as they provide large quantities of real-world observations that span various registers, times periods, regions and social groups. In order to draw conclusions from vast amounts of data, corpus linguistic research requires appropriate quantitative analysis. Careful use of statistical techniques can reveal trends and patterns that other forms of analysis cannot. There has been a long-standing trend in the field of linguistics toward statistical analysis, moving away from intuition and selected examples to descriptive statistics and inferential statistics. Descriptive statistics such as sums, averages and standard deviation allow for objective analysis using multiple observations. Statistical inferential analysis, on the other hand, allows linguists to consider whether linguistic variation is significant. Within any corpus, linguistic variation is always expected. It is inferential statistical analysis that allows us to distinguish if said variation is simply arbitrary or if it reflects a true and significant underlying difference that may apply beyond our particular data set to the population as a whole. Appropriately carried out inferential statistical analyses are also what help linguists determine which variables best explain the variation and to what degree.

This chapter is organized as follows: section 2 describes the workflow for performing a statistical analysis. Section 3 explores two types of regression model, linear regression and logistic regression. Section 4 considers ways of handling variables within statistical models, and section 5 presents additional methods beyond regressions. This chapter will not cover the mathematical theory behind the statistical models or the computational techniques to carry out these analyzes. However, it is recommended to use R, an open-source programming language specialized in statistical analysis that many academic disciplines are adopting en masse.