Constrained Partitional Clustering of Text Data: An Overview

doi:10.1201/9781420059458-15

ABSTRACT

Clustering is ubiquitously used in data mining as a method of discovering novel and actionable subsets within a set of data. Given a set of data X , the typical aim of partitional clustering is to form a k-block set partition Πk of the data. The process of clustering is important since, being completely unsupervised, it allows the addition of structure to previously unstructured items such as free-form text documents. For example, Cohn et al. (12) discuss a problem faced by Yahoo!, namely that one is given very large corpora of text documents/papers/articles and asked to create a useful taxonomy so that similar documents are closer in the taxonomy. Once the taxonomy is formed, the documents can be eﬃciently browsed and accessed. Unconstrained clustering is ideal for this initial situation, since in this case little domain expertise exists to begin with. However, as data mining progresses into more demanding areas, the chance of ﬁnding actionable patterns consistent with background knowledge and expectation is limited.