ABSTRACT

Random forests (RFs) are essentially bagged tree ensembles with an added twist, and they tend to provide similar accuracy to many state-of-the-art supervised learning algorithms on tabular data, while being relatively less difficult to tune. RFs also include many bells and whistles that data scientists can leverage for non-prediction tasks, like detecting anomalies/outliers, imputing missing values, and so forth. Similarly, bagging a set of correlated predictors (i.e., models producing similar predictions) will only reduce the variance to a certain point. The idea is to limit the potential splitters at each node in a tree to a random subset of the available predictors, which will often result in a much more diverse ensemble of trees. Many decision tree algorithms can naturally handle missing values; CART and CTree, for example, employ surrogate splits to handle missing values.