ABSTRACT

Variable selection in regression—identifying the best subset of many variables to include in a model—is arguably the hardest part of the model-building process. This chapter reviews the five frequently used variable selection methods found in major statistical software packages. It presents Tukey's exploratory data analysis (EDA) relevant to the titled topic: the natural seven-step cycle of statistical modeling and analysis. The seven-step cycle serves as a notable solution to variable selection in regression. Classic statistics dictates that the statistician set about dealing with a given problem with a prespecified procedure designed for that problem. An ideal variable selection method for regression models would find one or more subsets of variables that produce an optimal model. The objective of the ideal method states that the resultant models include the following elements: accuracy, stability, parsimony, interpretability, and lack of bias in drawing inferences.