Regularization Methods for Cluster Analysis and Principal Components A

ABSTRACT

In this chapter, we have focused our attention on the application of regularization techniques, in particular the lasso, to the problem of CA. Clustering is a very widely used and popular statistical technique used for identifying subgroups within the larger population using multiple variables. Regularization can be applied to each of the two major types of CA, K-means and HCA. In addition, commonly used tools for determining the number of clusters such as the Gap and Clest statistics can also be used for identifying the optimal number of clusters to retain, and the optimal regularization tuning parameter, in the case of Gap. As with other statistical methods that we consider in this book, regularization is particularly useful for researchers who need to conduct CA in the context of high-dimensional data.

In the next chapter, we will turn our attention to regularization models for latent variables. In particular, our focus will be on both exploratory and confirmatory analysis and structural equation modeling. These methods are extensions of the univariate and multivariate models described in Chapters 3 and 5. We will begin these discussions with a review of the standard (non-regularized) models, followed by their regularized counterparts. And, as is the case with the other methods described in this book, we will demonstrate how these methods can be applied using the R software package.

Regularization Methods for Cluster Analysis and Principal Components Analysis

ABSTRACT