ABSTRACT

This chapter examines strategies for solving the preprocessing issues often seen with real data. The first section introduces the knowledge discovery in data (KDD) process model. The sections that follow focus on the steps of this model that involve creating initial target data—including an example of R interfacing with a relational database—data preprocessing and data transformation. R scripts illustrate how to locate noise in data and how to detect outliers. Several methods for dealing with missing data are presented. Model evaluation methods such as cross validation and bootstrapping are described. This chapter concludes with examples using R to transform and sample data for training and testing purposes.