ABSTRACT

Data preprocessing and engineering techniques generally refer to the addition, deletion, or transformation of data. An elementary approach to imputing missing values for a feature is to compute descriptive statistics such as the mean, median or mode and use that value to replace NAs. K-nearest neighbor imputes values by identifying observations with missing values, then identifying other observations that are most similar based on the other available features, and using the values from these nearest neighbor observations to impute missing values. Zero and near-zero variance variables are low-hanging fruit to eliminate. Zero variance variables, meaning the feature only contains a single unique value, provides no useful information to a model. Numeric features can create a host of problems for certain models when their distributions are skewed, contain outliers, or have a wide range in magnitudes. There are exceptions; for example, tree-based models naturally handle numeric or categorical features.