Least-Squares Data Recovery Clustering Models | 14 | v2

ABSTRACT

The data recovery approach is a cornerstone of contemporary thinking in statistics and data analysis. It is based on the assumption that the observed data reflect a regular structure in the phenomenon of which they inform. The regular structure A, if known, would produce data F(A) that should coincide with the observeddataY up to small residualswhich are due to possible flaws in any or all of the following three aspects: (a) sampling entities, (b) selecting features and tools for theirmeasurements, and (c)modeling the phenomenon

in question. Each of these can drastically affect results. However, so far only the simplest of the aspects, (a), has been addressed by introduction of probabilities to study the reliability of statistical inference in data analysis. In this text, we are not concerned with these issues. We are concerned with the underlying equation:

Observed data Y = recovered data F(A) + residuals E (*) The quality of the model A is assessed according to the level of residuals

E: the smaller the residuals, the better the model. Since quantitative models involve unknown coefficients and parameters, this naturally leads to the idea of fitting these parameters to data in such a way that the residuals become as small as possible. To put this idea as a minimization problem, one needs to combine the multiple residuals in an aggregate criterion. In particular, the so-called principle of maximum likelihood has been developed in statistics. When the data can bemodeled as a random sample from amultivariate Gaussian distribution, this principle leads to the so-called least-squares criterion, the sum of squared residuals to be minimized. In the data analysis or datamining framework, the data do not necessarily come from a probabilistic population. Moreover, analysis of the mechanism of data generation is not of primary concern here. One needs only to see if there are any patterns in the data as they are. In this case, the principle of maximum likelihood may be not applicable. Still, the sum of squared residuals criterion can be used in the context of data mining as a measure of the incompatibility between the data and the model. It provides for nice geometric properties and leads to provably reasonable cluster solutions. It also leads to useful decompositions of the data scatter into the sum of explained and unexplained parts. To show the working of model (∗) along with the least-squares principle, let us introduce four examples covering importantmethods in data analysis: (a) averaging, (b) linear regression, (c) principal component analysis, and (d) correspondence analysis [140]. These are also useful for introduction of data analysis concepts that are used throughout this text.