ABSTRACT

Clustering refers to a collection of methods which are designed to uncover natural groups, called clusters, in data. The idea is that the groups should contain objects similar to each other, and the groups should be as different as possible. Some mathematical attempts to quantify the degree of similarity and difference necessary for clusters to be distinct have been pursued, but no consensus exists. For many clustering methods, a key choice that must be made by the analyst is the choice of metric for dissimilarities or distance between observations. Over the years, a variety of different clustering methods have been developed. Some of the earliest methods still enjoy tremendous popularity. These methods include partitioning methods, including k-means, and hierarchical clustering methods. These methods are popular because they are simple to understand and are implemented in most modern software packages for data analysis.