Model Building and Feature Selection with Genomic Data

doi:10.1201/9781584888796-31

ABSTRACT

Feature selection is fundamental in statistical modeling. When the number of predictors is large, it is crucial to identify a few important variables that can well explain the response. A sparse model is much more interpretable than the full model using all predictors, and feature selection can often improve the prediction accuracy of the model. Traditional model selection methods combine best-subset selection with some model selection criteria such as AIC and BIC. This approach has two fundamental drawbacks. First, the bestsubset selection is not computationally feasible for high-dimensional data. The number of subset models increases exponentially. Second, the best-subset selection is very unstable in the sense that a small perturbation on the data yields a very diﬀerent model [2]. Modern methods in high-throughput biology such as gene expression arrays produce massive high-dimensional data that traditional variable selection approaches are not capable of handling.