chapter  26
24 Pages

Method-Independent Indices for Cluster Validation and Estimating the Number of Clusters

ByMaria Halkidi, Michalis Vazirgiannis, Christian Hennig

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 26.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596 26.2 Euclidean and Dissimilarity-Based Indices for Crisp Partitions . . . . . . . . . . . . . 598

26.2.1 Within/Between Clusters Sum of Squares-Based Criteria . . . . . . . . . . . 599 26.2.2 The Davies-Bouldin Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602 26.2.3 The Dunn Family of Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602 26.2.4 Hubert’s 0 and Related Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603 26.2.5 The Average Silhouette Width Criterion . . . . . . . . . . . . . . . . . . . . . . . . 604 26.2.6 The CDbw-Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605 26.2.7 A Clustering Validation Index Based on Nearest Neighbors . . . . . . . . . 607 26.2.8 Specialized Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608 26.2.9 A Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608

26.3 Validity Indices for Graph Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611 26.3.1 Coverage of a Graph Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612 26.3.2 Performance of a Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612 26.3.3 Modularization Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612

26.4 Fuzzy Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613 26.4.1 Validity Indices Involving Only the Membership Values . . . . . . . . . . . . 613 26.4.2 Indices Involving the Membership Values and the Dataset . . . . . . . . . . 614

26.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617

Given a data set and a clustering algorithm running on it with different input parameter values, we obtain different partitionings of the data set into not necessarily meaningful clusters. As a consequence, in most applications the resulting clustering scheme requires some sort of evaluation of its validity. Evaluating the results of a clustering algorithm is known as “cluster validation.” The present chapter focuses on relative cluster validity criteria that are used to compare different clusterings and find the one that “best” fits the considered data. These criteria are implemented by validity indices that can be evaluated

of

from the data set and the given clustering alone without having access to a “true” clustering. This chapter aims at presenting an overview of the available relative cluster validity indices and at highlighting the differences between them and their implicit assumptions. Furthermore, we mention some software packages, we stress the requirements that are under-addressed by the recent approaches, and address new research directions.