ABSTRACT

Clustering, one of the most important unsupervised learning problems, is the task of dividing a set of objects into clusters such that objects within the same cluster are similar while objects in different clusters are distinct. Clustering is widely used in many fields, such as text mining, image analysis, and bioinformatics. A clustering can be termed “valid” if it has an unusually high or low value, as measured with respect to the baseline distribution. The consistency between a pair of measures is defined as the similarity between their rankings on a series of clustering results. The similarity is measured by the Kendall’s rank correlation. A data set with various densities is challenging for many clustering algorithms. Therefore, it is a very interesting topic whether data with different densities also affect the performance of the internal validation measures. Subclusters are clusters that are close to each other.