ABSTRACT

The nature of genomic big data is that data matrices are “tall,” with the number of variables far exceeding the number of cases. In other words, in genomics applications, one typically has to deal with small-sample, high-dimensional data sets. This creates unique challenges in the application of classification techniques, which must be carried out with judgment to avoid classifier overfitting, poor feature selection, and inaccurate classifier error estimation. We present in this chapter a Bayesian approach to the accurate classification of small-sample, high-dimensional data sets from three applications of current interest, namely, gene expression rank data, liquid-chromatography mass-spectrometry (LC-MS) protein abundance data, and 16S rRNA metagenomic data. Models for each kind of data are presented, the Bayesian inference procedure is explained in detail, and experimental results are provided comparing the performance of the proposed algorithms to other state-of-the-art classification methods.