ABSTRACT

Probabilistic data analysis assumes that data is the result of repeated, often independent, sampling from some population. The information contained in the data is used to infer some of its properties or even a description. This text deals with mixture and cluster analysis, also called partitioning or unsupervised classification. In these disciplines the data is assumed to emanate from mixtures consisting of unimodal subpopulations which correspond to different causes, sources, or classes and which all contribute to the data. Of course, there are data sets that emerge neither from a unimodal population nor from such a mixture. For instance, self-similar data sets are different. Consider the data set 1, 2, 4, 8, 16, . . . , 2k for some natural number k ≥ 10. This set is completely deterministic and does not show any subpopulation structure except, perhaps, one of k+1 singletons. Other data sets are chaotic, composed solely of outliers. Again, others are geometric, such as a double helix. They belong to the realm of image processing rather than to data analysis. Therefore, it is appropriate to first say some words about the populations and models relevant to cluster analysis.