ABSTRACT

The term classification (also known as supervised learning) refers to a set of tech-

niques that are designed to construct rules that allow units to be sorted into different

groups. These rules are based on data and the rules are designed to optimize some

fixed criterion. In the context of microarray analysis, classification could potentially

be useful for medical diagnosis. Here the goal is to determine if a subject is healthy

or diseased, or determine the disease state, based on a set of gene expression mea-

surements. For example, one could run a microarray on a tissue or fluid sample from

a subject then determine if the subject has a certain form of cancer. While this seems

a straightforwardmanner of diagnosing patients, microarrays have yet to be routinely

used in this fashion in clinical practice, but the hope is that one day this will be an

effective method for diagnosing illness. Currently classification is used in a less for-

mal way to identify genes that differ between patient groups (for example, Golub,

1999). While one can use methods for testing for differential expression, such meth-

ods don’t directly use information on all genes simultaneously to detect differences

in gene expression (although some methods do use some information from all genes

to make a decision regarding each gene, such as SAM and the empirical Bayes ap-

proach discused in Section 13.4). There are a number of good texts available that

cover this material in greater depth, such as Hastie, Tibshirani, and Friedman (2001).

The data for the classification problem consists of a set of measurements xi for all i arrays and an indicator yi for each array indicating group membership. The problem is then to define a map C(x) (known as a the classifier) from the space whose domain is the space in which the xi reside to the range where the yi reside in some optimal fashion. Once one has determined C(x), given a new set of measurements for some subject xnew, one classifies this subject as C(xnew). In the context of microarray analysis xi is a set of gene expression measurements, so we will now assume that the xi are p-dimensional vectors. Here pmay be the number of genes on the microarray, but frequently only a subset of these genes are used for classification. The reason for us-

ing only a subset is that for any given tissue or fluid specimen only a subset of genes

are actually expressed in the sample. Genes that aren’t expressed are not useful for

classification as the data for those genes is just random noise. In addition, sometimes