ABSTRACT

We discuss the problem of supervised statistical learning in large data sets where the number of explanatory variables may exceed the number of observations. In this situation, classical model selection criteria such as the Akaike Information Criterion or the Bayesian Information Criterion usually overestimate the number of important predictors. Therefore, the predictive properties of the estimated models are also degraded. To address this problem, model size penalties should include a """multiple testing correction""" that depends on the number of variables in the database. We present the basic ideas in the context of estimating the vector of means in a multivariate normal distribution with independent coordinates. We report results illustrating the asymptotic optimality of the Bonferroni and the Benjamini-Hochberg procedures in terms of estimation loss and Bayes risk. We then move to the regression setting and discuss a variety of estimation procedures with penalties related to multiple testing corrections and with some optimal prediction and model selection properties. We discuss classical information criteria, as well as regularization techniques, such as LASSO or SLOPE. Finally, we describe the knockoff method for constructing control variables, which can be combined with any supervised learning algorithm and is proven to control the False Discovery Rate.