ABSTRACT

Data mining is a multidisciplinary methodology for extracting nuggets of knowledge from data. It is an iterative process that generates predictive and descriptive models for uncovering previously unknown trends and patterns via analyzing vast amounts of data from various sources. As a powerful tool, the data mining technology has been used in a wide range of profiling practices, such as marketing, decision-making support, fraud detection, and scientific discovery, etc. In the past 20 years, the dimensionality of the data sets involved in data mining applications has increased dramatically. Figure 1.1 plots the dimensionality of the data sets posted in the UC Irvine Machine Learning Repository [53] from 1987 to 2010. We can observe that in the 1980s, the maximal dimensionality of the data is only about 100; in the 1990s, this number increases to more than 1500; and in the 2000s, it further increases to about 3 millon. The trend line in the figure is obtained by fitting an exponential function on the data. Since the y-axis is in logarithm, it shows the increasing trend of the dimensionality of the data sets is exponential.