Cluster Sampling Algorithms | 15 | Clustering in Bioinformatics and Dr

ABSTRACT

Criteria for forming a partition with purely partitioning algorithms operate on an iterative adjustment of a feasible or temporary partition, taking into account all of the data items. In K-means, a typically randomly chosen set of K centroids are picked to start the iteration. Another similar set of algorithms approaches the forming of partitions, by randomly choosing a single starting centroid, and forming a cluster by gathering all of the other data items that are within a pre-deﬁned similarity threshold. The remaining data items - those excluded from the cluster - are then treated similarly, repeating the process until all data items are in some cluster. The clusters are more directly sampled as it were from the data, and there is no pre-selected number of clusters as there is in, say, K-means. The primary beneﬁt is eﬃciency, either in the case of the leader algorithms by not having to store a proximity matrix, nor having to calculate all (dis)similarities, but rather O(KN) (dis)similarities, where K = |C| is the eventual number of clusters; or, in the case of the exclusion region algorithms, having only to store a sparse matrix of the similarities, where the clustering step amounts to an O(KN) lookup of some K number of clusters. The assumption in terms of eﬃcacy is that these algorithms sample the density of the data in the feature space, and thereby suﬃciently identify the groups. The leader algorithms lend themselves as a last resort to the very largest data sets or huge online data streams. The exclusion region algorithms tend to be more eﬀective as they select clusters from what are clearly dense regions of the feature space, determined by the fact that they select the largest near neighbor entry in the near neighbor table at a speciﬁc threshold, and are not quite as subject to problems of random selection, but these algorithms suﬀer from the cost of generating all-pairs of similarities.