ABSTRACT

Knowledge discovery can be modeled as a seven-step process that includes goal identification, target data creation, data preprocessing, data transformation, data mining, result interpretation and evaluation, and knowledge application. This chapter introduces a second knowledge discovery model known as the Cross Industry Standard Process for Data Mining (CRISP-DM). Creating a target data set often involves extracting data from a warehouse, a transactional database, or a distributed environment. Transactional databases do not store redundant data, as they are modeled to quickly update and retrieve information. Prior to using a data mining tool, the gathered data are preprocessed to remove noise. Missing data are of particular concern because many data mining algorithms are unable to process missing items. In addition to data preprocessing, data transformation techniques can be applied before data mining takes place. Data transformation methods such as data normalization and attribute creation or elimination are often necessary for a best result.