ABSTRACT

Continual advances in computer-based technologies have enabled researchers and engineers to collect data at an increasingly fast pace. Business and scientific data from many fields, such as finance, genomics, and physics, are often measured in gigabytes (GB, 29 bytes), terabytes (TB, 212 bytes), and sometimes even petabytes (PB, 215 bytes). For instance, it is reported that in 2010, one of eBay’s data warehouses reached 10PB and will grow to 20PB in 2011. Other business operators, such as Bank of America, WalMart, and Dell also reported their data warehouses to be in a PB range. The enormous proliferation of large-scale data sets brings new challenges to data mining techniques. Scalability and efficiency are two critical issues in large-scale applications [54, 31, 215, 66, 3, 139, 173, 193]. To address these challenges, existing data mining techniques need to be adapted and improved to handle large-scale data sets [16, 154, 101, 86, 194, 28, 61, 19].