ABSTRACT

This chapter discusses some of the fundamentals of corpus analysis, as well as introduces some of the publicly available software tools through which researchers upload and explore their own corpora and compare them with some of the pre-existing corpora. The chapter also makes use of #LancsBox: a 'toolbox' of corpus analysis software programs developed at Lancaster University. In order to make comparisons of frequencies between corpora of different sizes, the data is 'normalised'. This helps to make sense of the significance of the numbers. Statistical measures are used in corpus analysis, which constitute some of the parameters that are managed by researchers in their corpus analysis, according to their research aims. Researchers determine the range and characteristics of Web registers through corpora of Web data. The process for segmenting a text into words is called tokenisation. It is an important process where many subsequent forms of analysis rely on accurate identification of tokens in the first instance.