Detecting outliers from large datasets | 19

ABSTRACT

With the success of database technologies and the constant reduction of hardware costs, many enterprises, commerical or otherwise, have amassed huge amounts of data over the years. These volumes of data represent a rich source for analysis and for improved understanding of the entities stored in the data. However, existing database technologies were designed and developed more for storage and archival purposes rather than for analysis. Thus, the past decade has witnessed the development of many so-called data mining tools, developed to ‘discover previously unknown knowledge embedded in the data’ (Piatetsky-Shapiro and Frawley 1991). In general, the discovered knowledge can be classified into four kinds. The first kind is to find dependencies-that is, how one part of the data depends on other parts of the data. Examples include association (Agrawal et al. 1993), correlation (Brin et al. 1997) and roll-up dependencies (Wijsen et al. 1999). The second kind is to identify classes-that is, which parts of the data can be grouped together because of their similarities. Clustering is a typical example (Jain and Dubes 1988; Kaufman and Rousseeuw 1990). The third kind is to explain classes-that is, what is common among members of the same class, and what is different between members of different classes. Classification is a typical example (Agrawal et al. 1992; Breiman et al. 1984). Last but not least, the fourth kind is to find outliers-that is, which parts of the data are exceptional or atypical.