ABSTRACT

Many data clustering algorithms have a high intrinsic time complexity. This chapter reviews classic clustering algorithms which are designed for large-scale data sets. The key of Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is to introduce a new data structure called clustering feature (CF) as well as CF-tree. CF can be regarded as a concise summary of each cluster. CF-tree is a height-balanced tree which keeps track of the hierarchical clustering structure for the entire data set. In many clustering algorithms, it is mainly the pair-wise distance measure that determines the clustering structure. In addition to clustering, random projection itself has a broad applicability in various data mining tasks. The chapter examines parallel and distributed clustering algorithms to handle big data. In contrast to the typical single machine clustering, parallel and distributed algorithms use multiple machines to speed up the computation and increase the scalability.