ABSTRACT

In recent years several approaches to knowledge discovery and data mining, and in particular, clustering, have been developed, but only a few of them are designed using a decentralized approach. Clustering data is the process of grouping similar objects according to their distance, connectivity, or relative density in space [1]. There are a large number of algorithms for discovering natural clusters, if they exist, in a dataset, but they are usually studied as a centralized problem. These algorithms can be classified into partitioning methods [2], hierarchical methods [3], density-based methods [4], and grid-based methods [5]. Han et al.’s paper [6] is a good introduction to this subject. Many of these algorithms work on data contained in a file or database. In general, clustering algorithms focus on creating good compact representation of clusters and appropriate distance functions between data points. For this purpose, they generally need to be given one or two parameters by a user that indicate the types of clusters expected. Since they use a central representation where each point can be compared with each other point or cluster representation, points are never placed in a cluster with largely different members. Centralized clustering

CHAPMAN: “C4754_C022” — 2005/8/6 — 14:10 — page 342 — #2

is problematic if we have large data to explore or data is widely distributed. Parallel and distributed computing is expected to relieve current mining methods from the sequential bottleneck, providing the ability to scale to massive datasets, and improving the response time. Achieving good performance on today’s high performance systems is a nontrivial task. The main challenges include synchronization and communication minimization, workload balancing, finding good data decomposition, etc. Some existing centralized clustering algorithms have been parallelized and the results have been encouraging. Centralized schemes require high level of connectivity, impose a substantial computational burden, are typically more sensitive to failures than decentralized schemes, and are not scalable, which is a property that distributed computing systems are required to have.