ABSTRACT

The major challenge in high throughput experiments, such as microarray data, matrixassisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF

MS) data, or surface-enhanced laser desorption/ionization time-of-flight mass spectral (SELDI-TOF MS) data, is that the data are often high dimensional. When the number of dimensions reaches thousands or more, the computational time for the statistical analyses or pattern recognition algorithms can become unreasonable. This can be a problem, especially when some of the features (markers or variables) are not discriminatory. The irrelevant features may also cause a reduction in the accuracy of some algorithms. For example, experiments with a decision tree classifier have shown that adding a random binary feature to standard datasets can deteriorate the classification performance by 5-10%.1 Furthermore, in many pattern recognition tasks, the number of features represents the dimension of a search space; the larger the number of features, the greater the dimension of the search space, and the harder the problem. With today’s technology, the dimensions of microarray and MALDI-TOF data can reach above 40,000 and 100,000 per experiment. One of the challenges in the traditional statistical approaches is that the number of the subjects is much less than the number of the features or variables in the research dataset.