ABSTRACT

Gene expression microarray data are typically characterized by large quantities of variables with unknown correlation structures [1,2]. This high dimensionality has presented us challenges in analyzing the data, especially when correlations among variables are complex. Including many variables in standard statistical analyses can easily cause problems such as singularity and overfitting, and sometimes is not even doable. To manage this problem, the dimensionality of the data will often be reduced in the first step. There are multiple ways to achieve this goal. One is to select a subset of genes based on certain criteria such that this subset of genes is believed to best predict the outcome. This gene selection strategy is typically based on some univariate measurement related to the outcome, such as t-test and rank test [3,4]. Another strategy is to use a weighted combination of genes of lower dimension to represent the total variation of the data. Representa-

tive approaches are principle component analysis (PCA) [5] and partial least squares (SLR) [6-9]. Machine learning algorithms such as LASSO [10,11] and Random Forest [12] have embedded capacity to select variables while simultaneously making predictions, and can be used to accommodate high dimensional microarray data.