chapter  5
30 Pages

Looking for Similarity – Clustering

Clustering is primarily a way of exploring data. Its goal is understanding, rather than direct decision making. Given a set of records, a cluster is a subset of the records that are similar to each other, with respect to some reasonable measure of what similarity means, and not so similar to the rest of the records. We can learn from clustering at two levels. First, the largescale structure, how many clusters there are, what shapes they form, where they are located, and the voids in between them provide information about the problem domain. Second, the locations and shapes of each individual cluster provide information about those records that are placed in it, from which we can infer an underlying reason why these records are similar. At the heart of clustering are both similarity and dissimilarity. For example, retail stores often cluster their customers and try to learn what kind of customer each cluster represents. This can be used to influence the design and layout of stores, the kind of merchandise stocked, and even the opening hours. At the global level, a clustering can show where clusters are not – which may represent an opportunity for a new product or service.