ABSTRACT

Categorical predictors can take a variety of forms in the data that is to be modeled. With the exception of tree-based models, categorical predictors must first be converted to numeric representations to enable other models to use the information. Categorical or nominal predictors are those that contain qualitative data. This chapter focuses on methods that encode categorical data to numeric values. There are several methods of encoding categorical predictors to numeric columns using the outcome data as a guide. These techniques are well suited to cases where the predictor has many possible values or when new levels appear after model training. One issue with effect encoding, independent of the estimation method, is that it increases the possibility of overfitting. The car evaluation data shows a pattern where the factor encodings had no difference compared to polynomial contrasts but when compared to unordered dummy variables, the factor encoding is superior.