chapter  5
Density-Based Clustering
ByMartin Ester
Pages 17

Many of the well-known clustering algorithms make, implicitly or explicitly, the assumption that data are generated from a probability distribution of a given type, e.g., from a mixture of k Gaussian distributions. This is the case in particular for EM (Expectation Maximization) clustering and for k-means. Due to this assumption, these algorithms produce spherical clusters and cannot deal well with datasets in which the actual clusters have nonspherical shapes. Nonspherical clusters occur naturally in spatial data, i.e., data with a reference to some two- or three-dimensional concrete space corresponding to our real world. Spatial data include points, lines, and polygons and support a broad range of applications. Clusters in spatial data may have arbitrary shape, i.e., they are often drawn-out, linear, elongated etc., because of the constraints imposed by geographic entities such as mountains and rivers. In geo-marketing, one may want to find clusters of homes with a given characteristic, e.g., high-income homes, while in crime analysis one of the goals is to detect crime hot-spots, i.e., clusters of certain types of crimes. Even in higher-dimensional data the assumption of a certain number of clusters of a given shape is very strong and may often be violated. In this case, algorithms such as k-means will break up or merge the actual clusters, leading to inaccurate results. The objective to minimize the average squared distances of points from their corresponding cluster center leads to a partitioning of the dataset that is equivalent to the Voronoi diagram of the cluster centers, irrespective of the shape of the actual clusters. Figure 5.1 illustrates this weakness on a small 2-dimensional dataset, showing that k-means with k = 3 breaks up and merges the three horizontal clusters.