ABSTRACT

One of the important objectives of statistical learning is to produce a model (or models) based on the current data so that future outcomes can be predicted with high accuracy and low variability (namely, high precision) using new data. Therefore, variable selection is critical to construct a parsimonious model, particularly in big data with many variables. In this chapter, I describe some of the most commonly used methods to derive the models, including the ridge regression (that includes all variables in the data), lasso, group lasso, adaptive lasso, elastic net, and sure independence screening. These methods, even though with similar underlying principles, all used somewhat different approaches to derive the models. Mathematical derivation, examples, and R-codes are provided to illustrate how data can be analyzed using these methodologies.