ABSTRACT

Data mining, or any form of statistical analysis for that matter, can be viewed as a set of methods to summarize large amounts of data so that we can usefully interpret the data. Collapsing the data by simply grouping it is common and useful. Cluster analysis and principal components are two broad classes of methods for grouping data. Since we will be using both in the tutorials that follow, a brief explanation of the difference is warranted. Consider a typical data table that has one row for each individual (a row is also called a “case” or “record”). Across the top, we have the names of the variables (or “fields”) which describe the individuals, and down the side we have an individual’s identifier (a name or number).