ABSTRACT

Corpus linguistics is a discipline which is based on distributional data: ‘things’ – words, morphemes, semantic features – occur in corpora or not, they co-occur with other things or they do not, their frequencies of occurrence or co-occurrence are proportional to those of other ‘things’ or not, etc. Thus and as mentioned in Section 2.1, all one really obtains from corpora are frequencies of occurrence and co-occurrence. This has two important corollaries, the rst of which was also mentioned above: Whatever a corpus linguist is interested in needs to be operationalized and interpreted in terms of frequencies. The second is just as important: If you have frequencies and other distributional data, then you need to employ the tools of the discipline that is concerned with frequencies and distributions, which is statistics. Thankfully, over the past ten or so years, corpus linguistics has evolved considerably in terms of the statistical tools that are being used, but (1) the obvious fact that a corpus linguist needs statistical expertise is still not as widely accepted as it should be – scholars in many other disciplines that deal with much more well-behaved data (i.e., smaller and more balanced data sets) have accepted their statistical needs much longer ago; and (2) while more statistical methods are used these days, they are often applied and/or reported on incorrectly. While I cannot provide a full-edged introduction to statistics for (corpus) linguists in this book – for that, see Gries (2013), which, as mentioned above, is to some extent a companion volume of this one – in this chapter I will introduce some of the absolute basics of statistical thinking, analysis, and visualization that can aid corpus-linguistic research.