High-Performance Distributed Data Mining

doi:10.1201/9781420085877-16

ABSTRACT

There is a subtle yet significant difference between algorithms designed for parallel and distributed systems. Generally, parallel data mining algorithms deal with tightly coupled custom-made shared memory systems or distributed-memory systems with fast interconnects. Distributed data mining generally deals with clusterlike loosely coupled systems connected over a slow Ethernet LAN or WAN. The main differences between parallel and distributed systems are scale, communication costs, interconnect speed, and data distribution. For example, the amount of communication feasible in a shared-memory parallel system can be large, whereas it might not be practical to do the same in a distributed cluster over the Internet. Large-scale real-world data mining systems typically use a combination of both parallel and distributed data mining systems. The parallel techniques are used for optimizing mining at a local hub, whereas distributed techniques are useful for aggregating information across geographically distributed locations. In this chapter, we focus on distributed implementations of data mining techniques.