Unsupervised learning | 11 | Modern Data Science with R

ABSTRACT

This chapter explores techniques in unsupervised learning, where there is no response variable y. To illustrate, consider the unsupervised learning process of identifying different types of cars. The United States Department of Energy maintains automobile characteristics for thousands of cars: miles per gallon, engine size, number of cylinders, number of gears, etc. Often, a variable carries little information that is relevant to the task at hand. Even for variables that are informative, there can be redundancy or near duplication of variables. Such irrelevant or redundant variables make it harder to learn from data. The irrelevant variables are simply noise that obscures actual patterns. Similarly, when two or more variables are redundant, the differences between them may represent random noise. Furthermore, for some machine learning algorithms, a large number of variables p will present computational challenges. The mathematics of singular value decomposition draw on knowledge of matrix algebra, but the operation itself is accessible to anyone.