ABSTRACT

Unsupervised learning includes a host of exploratory techniques that can be used to gain insight into data, often when there is no outcome variable or clear target. The objective of principal components analysis is to replace the observed variables by a number of uncorrelated variables that explain a sufficiently large amount of the variance in the data. The expectation-maximization algorithm is an iterative procedure for finding maximum likelihood estimates when data are incomplete or treated as such. The expected value of the complete-data log-likelihood is computed conditional on the current parameter estimates. The choice of the number of latent components is an important consideration in probabilistic principal components analysis. One approach is to choose the number of latent components that captures a certain proportion of the variation in the data.