ABSTRACT

There is no reason why all variables of a data set should be useful in view of a cluster structure. Often it is hidden in a subset of all variables. In these cases, only the variables that determine the clustering should be included in the analysis. This is in particular true when there are many variables. Often, some of them are just noise; others are redundant. This refers to variables that contain no or repeated information on the structure to be detected. Adding noninformative variables to a clustered data set may strongly hamper the performance of clustering algorithms; see, for instance, Milligan [377]. Fowlkes et al. [168] analyzed a plane data set of five well separated clusters of 15 data points each. They extended the dimension to five by taking the product with three independent standard normal variables. Although the first two variables show a clear cluster structure that can be detected by any reasonable cluster method, they found that the structure was completely upset when all five variables were used. Their context was hierarchical clustering with single and complete linkage and with Ward’s criterion.