ABSTRACT

The last chapter looked at hierarchical clustering, the first form of clustering analysis developed, which continues to be used in many applications. However, hierarchical clustering methods do not scale well to large databases since they require the calculation of the distance between every record in a database, something that works up to a few thousand records, but becomes problematic beyond that point. Partially to deal with this limitation of hierarchical clustering, as well as for a number of other reasons, researchers in several disciplines have developed other methods of cluster analysis since the pioneering work of MacQueen (1967). The clustering approach most commonly used in applied data mining is known as partitioning cluster analysis, in which the records are divided (partitioned) into the “best” K groups based on some criteria. Nearly all the partitioning cluster analysis methods accomplish their objective by basing cluster membership on the proximity of each record to one of K points (or “centroids”) in the data. The objective of these clustering algorithms is to find the location of the centroids that optimizes some criteria with respect to the distance between the centroid of a cluster and the points assigned to that cluster for a pre-specified number of clusters in the data. In this chapter we describe how K-Centroids clustering algorithms work in general, present three different variants of the algorithm: K-Means, K-Medians, and Neural Gas, and then provide a graphical illustration of how K-Means clustering works. The three K-Centroids variants differ from one another in the distance metric that is used to measure the distance between a cluster centroid and its member points, and in how the centroids themselves are defined. We next take a minor detour and discuss the related issue of the different types of clusters that can exist, and the relationship between these cluster types and the nature of customer segments that are typically found in applied data mining projects. The next issue we explore are methods to determine both the “best” number of clusters and the “best” clustering method for a data set using two different cluster validation measures, the adjusted Rand index (Hubert and Arabie, 1985) to assess cluster solution stability and reproducibility, and the Calinski-Harabasz index (Calinski and Harabasz, 1974) to assess the relative level of within-cluster homogeneity to cross-cluster separation.