ABSTRACT

Unavailability of labeled data is often a major problem in text mining. It is because the labeling process is very demanding and it is nearly impossible to have the labels assigned to the data in a reasonable time and amounts. Clustering is the most common form of unsupervised learning and enables automatic grouping of unlabeled documents into subsets called clusters. Such groups might be then used to assign labels to unlabelled documents. Document similarity is the only endogenous information that is available in the clustering process. The chapter thus discusses some common similarity measures and their computations. Different types of clustering algorithms (flat or partitional, hierarchical, and graph-based) are described together with some specific algorithms from these groups. Because there usually exists no prior knowledge about the desired outcome and several alternative results are possible, evaluating clustering is not an easy task. The chapter thus discusses some of the commonly used evaluation measures.