ABSTRACT

One challenging problem in many feature selection applications is the smallsample problem [143, 93, 42, 41, 159, 125], where the dimensionality of data is extremely high, while the sample size is very small. For instance, a typical cDNA microarray data [88]) used in modern genetic analysis usually contains more than 30,000 features (the oligonucleotide probes), but the sample size is often less than 100. With so few samples, many irrelevant features can easily gain their statistical relevance due to randomness [159]. With a data set of this kind, most existing feature selection algorithms become unreliable by selecting many irrelevant features. For example, in cancer study based on cDNA microarray, researchers found that traditional feature selection algorithms offer limited or inaccurate selection of biological features [118, 159]. Fold change1

is a popular method used in gene selection. To study its actual performance when sample size is small, we obtain a

microarray data set from Gene Expression Omnibus (GEO) [11] with the reference id GSE2403. We randomly partition samples into positive and negative groups with 10 samples in each group. We then apply the fold change measurement on the split sample to identify significantly regulated genes. We repeat this process 10 times, and the number of significantly regulated genes identified each time is shown in Figure 6.1. On average, we identify 12.7 significantly regulated genes on each random split. We also apply the t-test [123] on the original split,2 and identify 16 significantly regulated genes, which is only a little bit larger than the average number obtained on the random splits. This example shows that when sample size is small (20), and the number of features is very large (11,362), many features can be identified as significant on an arbitrary split of the samples. This implies that on the original split, some

of the significant features identified by fold change may gain their statistical significance by sheer randomness.