ABSTRACT

A key activity in regression modelling known as variable selection is the process of choosing which variables to keep and which to exclude in the final model. The trade-off is clear: with more variables we can potentially explain more of the systematic variation, but we may also bring in more noise. Keeping only the relevant variables in the model is a crucial step with several potential goals: better estimation, better prediction and better interpretation. These statistical goals are in line with a general principle in science — the so-called Occam’s razor stating that among competing models of equal fit to the data, we should choose one with the fewest assumptions or parameters. When there are many potential predictors with equal status, i.e., no prior preferences among them, having as few predictors as possible in the model would often help interpretation. When we have a large number of potential predictors, overfitting also becomes a serious problem: it is far too easy to produce models that fit well only on the training or discovery data set, but not on the validation data set.