ABSTRACT

Clustering is introduced as a sorting process where the criteria governing the sort are not known. It builds from the heuristic K means clustering, explaining mixture models in general and specifically Gaussian mixture models. Clustering is used to find news articles with similar contents or relating to the same topic. The Expectation-Maximization technique is shown to be a general iterative technique to maximize the data likelihood. The data generation is likened to a hypothetical Chinese restaurant with an infinite availability of tables and each table can seat as many diners as necessary. The chapter deals with the Dirichlet process which is explained in detail and illustrated by examples. A Dirichlet process extends the concept to an unknown variable number of clusters. Clustering often is the first step when analyzing a new data set. If not much is known yet about the data, it is useful to see whether the data separates naturally into different groups.