ABSTRACT

In the previous chapter we discussed fairly sophisticated regression models with different types of responses, continuous and binary, and different types of predictors, continuous, binary, nominal and ordinal. We also touched upon the fact that regression models may not always be linear. Polynomial regression and interactions among the predictors may make the model very complicated. If there are hundreds of predictors, then variable selection and incorporation of interactions may not always work out well, primarily due to volume and complexity. In such situations decision-rule-based models are used to split the sample into homogeneous groups and thereby come up with a prediction. Tree-based models are also classification models, but the main difference between these and ordinary classification models is that here the sample is split successively according to answers to questions like whether X1 ≥ a. All observations with X1 ≥ a are classified into one group and the rest are classified into a second group. Typically, the split is binary, but there are procedures when split may be multiple. Tree-based models are easily interpretable and they are able to take into account many variables automatically, using only those which are most important. Tree-based models are also useful to identify interactions, which may later be used in regression models.