ABSTRACT

The perspective on sample reduction in this chapter is rather different from that of the previous ones. Up to this point in the book we have assumed we have no measures of data labels, groups or strata. We have reduced the sample based on unsupervised procedures which obtained as many groups as were requested. In this chapter the number of groups k is known. In addition, we also have recorded without uncertainty the group label for each observation. Consequently, we have ready a matrix U with cluster labels which can be used for sample reduction. The centroid matrix X¯ is readily available and does not need to be estimated, and similarly for any other statistics which may be stratified by group. Discriminant analysis is concerned with a slightly different objective than cluster analysis: that of prediction or classification, that is, assigning additional observations to one of the k groups. Before we proceed, let us underline that there are many methods for classification (see for instance Hastie et al. (2009)), where a review is beyond the scopes of this book. Discriminant analysis is only one of the possible options, which nicely fits into the framework of the book because of many similarities with topics like multivariate estimation of location and scatter, and cluster analysis. It shall be intended as an introduction to the topic and to the area of robust classification: discriminant analysis is limited by the fact that all predictors must be continuous, and that a parametric Gaussian assumption should be formulated possibly after transformation. Many classification methods on the other hand work well also with a mix of continuous (without parametric assumptions) and categorical measurements.