ABSTRACT

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Dimensions (Dichotomies) of Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 By Type of Clustering: Hard vs. Soft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 By Type of Clustering: Flat vs. Hierarchical . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.3 By Data Type or Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.4 By Clustering Criterion: (Probabilistic) Model Based vs. Cost-Based . . . . . . . 4 1.2.5 By Regime: Parametric (K Is Input) vs. Nonparametric

(Smoothness Parameter Is Input) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Types of Clusterings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Data Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Clustering Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5.1 Centroid-Based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5.2 Agglomerative Hierarchical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5.3 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5.4 Mixture Probability Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5.5 Density-Based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5.6 Other Probability Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5.7 Further Clustering Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.6 Cluster Validation and Further Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6.1 Approaches for Cluster Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6.2 Number of Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.6.3 Variable Selection, Dimension Reduction, Big Data Issues . . . . . . . . . . . . . 15 1.6.4 General Clustering Strategy and Choice of Method . . . . . . . . . . . . . . . . . . 16 1.6.5 Clustering Is Different from Classification . . . . . . . . . . . . . . . . . . . . . . . . . 17

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

This chapter gives an overview of the basic concepts of cluster analysis, including some references to aspects not covered in this Handbook. It introduces general definitions of a clustering, for example, partitions, hierarchies, and fuzzy clusterings. It distinguishes objects × variables data from dissimilarity data and the parametric and nonparametric clustering regimes. A general overview of principles for clustering data is given, comprising centroid-based clustering, hierarchical methods, spectral clustering, mixture

of

model and other probabilistic methods, density-based clustering, and further methods. The chapter then reviews methods for cluster validation, that is, assessing the quality of a clustering, which includes the decision about the number of clusters. It then briefly discusses variable selection, dimension reduction, and the general strategy of cluster analysis.