ABSTRACT

Volumes of data are exploding in both scientific and commercial domains. Data mining techniques that extract information from huge amount of data have become popular in many applications. Algorithms are designed to analyze those volumes of data automatically in efficient ways, so that users can grasp the intrinsic knowledge latent in the data without the need to manually look through the massive data itself. However, the performance of computer systems is improving at a slower rate compared to the increase in the demand for data mining applications. Recent trends suggest that the system performance has been improving at a rate of 10-15% per year, whereas the volume of data collected nearly doubles every year. As the data sizes increase, from gigabytes to terabytes or even larger, sequential data mining algorithms may not deliver results in a reasonable amount of time. Even worse, as a single processor alone may not have enough main memory to hold all the data, a lot of sequential algorithms could not handle large-scale problems or have to process data out of core, further slowing down the process.