ABSTRACT

All statistical techniques introduced in the previous chapters are limited to data with a relatively small number of covariates. When the sample size n goes to infinity while keeping the number of covariates p fixed, standard likelihoodand estimating equation-based methods can be used straightforwardly for estimation and inference. Since the late 1990s, advancements in biomedical technologies have generated a large number of “large p, small n” data sets, where

the number of covariates is comparable to or even much larger than the sample size. For example, in a typical microarray gene expression study, the number of subjects n is usually no greater than 1000, while the number of genes profiled can be more than several thousands, and all their expression values are recorded to be the p covariates. With such data, standard survival analysis techniques are no longer directly applicable. Suppose that we attempt to fit a Cox regression model with the number of covariates larger than the sample size. Mathematically it can be shown that multiple or even infinite many maximizers of the partial likelihood function exist, with most or all of them being unreasonable. Some existing software packages would fit a model using only the first few covariates, while setting the estimated regression coefficients to zero for the rest. Such results are unreasonable in that they depend on the order of covariates set in a computing code.