ABSTRACT

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619 27.1 Why We Need Distances between Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 619 27.2 The Confusion Matrix: Clusterings as Distributions . . . . . . . . . . . . . . . . . . . . . 621 27.3 A Cornucopia of Clustering Comparison Criteria . . . . . . . . . . . . . . . . . . . . . . . 622

27.3.1 Comparing Clusterings by Counting Pairs . . . . . . . . . . . . . . . . . . . . . . 622 27.3.2 Adjusted Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624 27.3.3 Comparing Clusterings by Set Matching . . . . . . . . . . . . . . . . . . . . . . . . 625 27.3.4 Comparing Clusterings by Information Theoretic Criteria . . . . . . . . . . . 627

27.4 Comparison between Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630 27.4.1 Range and Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630 27.4.2 (In)dependence of the Sample Size n . . . . . . . . . . . . . . . . . . . . . . . . . . . 631 27.4.3 Effects of Refining and Coarsening the Partitions . . . . . . . . . . . . . . . . . 632

27.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635

A common question when evaluating a clustering is how it differs from the correct or optimal clustering for that data set. This chapter presents the principles and methods for comparing two partitions of a data set D. As it will be seen, a variety of distances and indices for comparing partitions exist. Therefore, this chapter also describes some useful properties of such a distance or index, and compares the existing criteria in light of these properties.