ABSTRACT

Clustering aims to find the inherent structure of the unlabeled data by grouping them into clusters of objects [Jain et al., 1999]. A good clustering will produce high quality clusters where the intra-cluster similarity is maximized while the inter-cluster similarity is minimized. Clustering can be used as a stand-alone exploratory tool to gain insights on the nature of the data, and it can also be used as a preprocessing stage to facilitate subsequent learning tasks. Formally, given the data D = {x1,x2, . . . ,xm} where the ith instance xi = (xi1, xi2, . . . , xid) ∈ Rd is a d-dimensional feature vector, the task of clustering is to group D into k disjoint clusters {Cj | j = 1, . . . , k} with

⋃k j=1 Cj = D and Ci

⋂ i=j Cj = ∅. The clustering

results returned by a clustering algorithm L can be represented as a label vector λ ∈ Nm, with the ith element λi ∈ {1, . . . , k} indicating the cluster assignment of xi.