ABSTRACT

In the field of linguistics, a body of written or spoken documents is called a corpus. A large range of corpora, although differing in nature and reason for being, can be analyzed using the same exploratory multivariate statistical methods. A corpus needs to be divided into documents, which correspond to the statistical units in the analysis. As contextual variables are defined at the source document level, direct and aggregate analyses differ depending on the role the variables play in a given statistical method. Since the first wave in 1978, various open-ended questions have been introduced, leading Ludovic Lebart to develop original statistical methodology to address this type of textual data. The choice of textual unit depends on the application in mind, the goals of the study, and prior knowledge about the corpus, as well as on the availability of a morphosyntactic analyzer. A univariate description of contextual variables can complement the word and segment indexes.