ABSTRACT

Increasingly, researchers in discourse analysis and critical discourse analysis are relying on corpora and corpus linguistic methods in their research. There are many advantages of using corpora – large and representative samples of natural texts – and corpus linguistic methods in discourse analysis. The massive size of many contemporary corpora offers a wealth of data, but it can also lead discourse analysts to become out of touch with the texts included in the corpus. The text is the fundamental unit of discourse. In computational linguistics, these methods are often referred to as 'bag of words' approaches, where important discourse characteristics such as word order, grammar, cohesion/coherence and textual boundaries are entirely disregarded and replaced by simple frequency data. The word 'text' is a term of art in discourse analysis, and as such, we feel it must be clearly defined in this chapter in order to avoid confusion with more general uses of 'text'.