Finding Groups in Data

doi:10.1201/9780429470615-4

ABSTRACT

Data Mining algorithms try to address the issue as scientifically as possible, defining methods and rules aimed to assign units of observation into classes, which are not defined a priori, and are supposed to somehow reflect the structure of the entities that the data represent. Homogeneity within each cluster and the degree of separation among clusters are measured by referring to a distance metric or a dissimilarity measure. Cluster Analysis is a classification technique aiming at dividing individual cases into groups such that the cases in a cluster are very similar to one another and very different from the cases in other clusters. The ordering of pairs of units is sensitive to the selected measure of distance or dissimilarity, so an important step in Cluster Analysis is the choice of the most appropriate measure, which in practice should be based on the type of data at hand: for numerical variables while for categorical variables dissimilarity measures should be preferred.