ABSTRACT

Decision trees are among the most popular data analytic techniques, falling into two branches: Classification trees for categorical outcome variables and regression trees for continuous outcome variables. After giving background on decision tree terminology and algorithms, a "Quick Start" classification tree example is presented, focusing on survival of various classes of passenger on the ill-fated Titanic voyage. A second "Quick Start" example looks use of regression trees to analyze correlates of murder. How to use the popular "rpart" decision tree program is presented for both classification and regression trees. Topics include printing tree rules, visualization of tree results with each of three visualization packages, and interpretation of the confusion matrix and other model performance metrics. Among the procedures presented are using node distribution plots, saving predictions and residuals, pruning decision trees, employing cross-validation, interpreting various plots (lift, gain, precision vs. recall plots), and understanding the CP table. In addition to coverage of the "rpart" program, use of the "tree" and "ctree" packages is explained.