ABSTRACT

It is very rare for data to be ‘clean’ before it is passed to a machine learning model. In practice data needs to be processed before being used for model training, this is known as preprocessing. Common data issues include missing data, categorical variables that cannot be handled by models, skewed distributions, and class imbalance.

This chapter focuses on preprocessing data for supervised machine learning. It builds upon the technical introduction provided in Chapters 7 and 8 and demonstrates how to perform data cleaning and apply pipelines to common preprocessing tasks, such as encoding of categorical features, imputation of missing values, feature and target transformations, as well as feature extraction. An adapted version of the Ames housing dataset is used as a running example, where sales prices of properties are predicted. This showcases various aspects of preprocessing and how they improve the performance of different machine learning algorithms such as linear models, random forests and gradient boosting. The chapter provides practical examples as well as recommendations of which preprocessing techniques to use.