Clustering Categorical Data

doi:10.1201/9781315373515-12

ABSTRACT

This chapter aims to explore important clustering applications for categorical datasets and explains benefits and drawbacks of existing categorical clustering algorithms. It presents the goals of categorical clustering algorithms in general and provides gives an overview of similarity measures for categorical data. The chapter describes categorical clustering algorithms from the literature and discusses the scalability of the algorithms. Categorical clustering algorithms have various features, which make them suitable for applications with different requirements. A growing number of clustering algorithms for categorical data have been proposed, along with interesting applications, such as partitioning large software systems and protein interaction data. Hierarchical algorithms for categorical clustering take the approach of building a hierarchy representing a dataset’s entire underlying cluster structure. Hierarchical algorithms require few user-specified parameters and are insensitive to object ordering. The simplest categorical similarity measure is the Hamming distance, which measures the overlap between two categorical data objects by counting the number of matching attributes.