chapter  9
Clustering High-Dimensional Data
ByArthur Zimek
Pages 30

The general definition of the task of clustering as to find a set of groups of similar objects within a data set while keeping dissimilar objects separated in different groups or the group of noise is very common. Although Estivill-Castro criticizes this definition for including a grouping criterion [47], this criterion (similarity) is exactly what is in question among many different approaches. Especially in high-dimensional data, the meaning and definition of similarity is right at the heart of the problem. In many cases, the similarity of objects is assessed within subspaces, e.g., using a subset of the dimensions only, or a combination of (a subset of) the dimensions. These are the so-called subspace clustering algorithms. Note that for different clusters within one and the same clustering solution usually different subspaces are relevant. Therefore subspace clustering algorithms cannot be thought of as variations of usual clustering algorithms using just a different definition of similarity. Instead, the similarity measure and the clustering solution are usually derived simultaneously and depend on each other. In this overview, we focus on these methods. The emerging field of subspace clustering is still raising a lot of open questions. Many methods have been proposed, though, putting the emphasis on different aspects of the problem.