ABSTRACT

The point of clustering a set of documents is to divide them into groups called clusters—not chosen a priori—so that the documents’ lexical profiles are similar within the same cluster, and differ significantly from one cluster to another. This clustering must take into account all of the retained vocabulary, that is, it requires a multidimensional approach to cluster construction. Clustering in its various forms is exploratory in nature and comes without prior notions of what to expect. Whatever the clustering method, a measure of dissimilarity between statistical units needs to be defined. This dissimilarity may or may not be a distance measure in the mathematical sense of the term. In addition, any hierarchical clustering requires an aggregation method. Although clustering methods are flexible and can work with a range of distances or dissimilarity measures, the chapter considers the case where the document points are placed in a Euclidean space coming from an earlier factorial analysis.