ABSTRACT

CONTENTS 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Data from Microarray Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Sources of Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.3 Principles of Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.4 Common Designs for Oligonucleotide Arrays . . . . . . . . . . . . . . . . 6 1.2.5 Power/Sample Size Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.6 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.7 Designs for Dual-Channel Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 Normalization of Microarray Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.1 Normalization and Its Implications for Estimation of

Variance Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3.2 Normalization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.3.2.1 Method Based on Selected Invariant Genes . . . . . . . . 15 1.3.2.2 Methods Based on Global or Local Values . . . . . . . . . . 15 1.3.2.3 Local Regression Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.3.2.4 Quantile-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.3.2.5 Methods Based on Linear Models . . . . . . . . . . . . . . . . . . . . 18 1.3.2.6 Probe Intensity Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.4 Clustering and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.4.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.4.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.4.2.1 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.4.2.2 Classification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.4.2.3 Accuracy of Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.5 Detection of Differential Gene Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.5.1 Fold Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.5.2 The Two Sample t-Test and its Variants . . . . . . . . . . . . . . . . . . . . . . . . 28 1.5.3 Adjustments for Multiple Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 1.5.4 False Discovery Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

C5777: “c5777_c001” — 2007/10/27 — 13:02 — page 2 — #2

in

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.6 Empirical Bayes Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1.6 Networks and Pathways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Microarray technology has quickly become one of the most commonly used high throughput systems inmodern biological andmedical experiments over the past 8 years. For most parts, a single microarray records the expression levels of several genes in a tissue sample-this number often runs in tens of thousands. At the end, a huge multivariate data set is obtained containing the gene expression profiles. A microarray experiment typically compares the expression data with two or more treatments (e.g., cell lines, experimental conditions, etc.); additionally, there is often a time component in the experiment. Owing to the relatively high production cost of microarrays, oftentimes very few replicates are available for a given set of experimental conditions that pose new challenges for the statisticians in analyzing these data sets. Most of the early microarray experiments involved the so-called two-

channel cDNAmicroarrayswhere small amounts of geneticmaterials (cRNA) areprintedona small glass slidewith robotic print heads. ThemRNAsamples corresponding to two different treatments are tinted with two different fluorescent dyes (generally red and green) and allowed to hybridize (a technical term for a biological process by which an mRNA strand attaches to the complementary cDNAstrand) on the same slide. At the end, the expressionvalues of the sample under comparison are evaluated with certain specialized laser scanners. In more recent studies, the oligonucleotide arrays, also known as the Affymetrix GeneChips®, are becoming increasingly popular. These are factory-prepared arrays targeted for a particular genome (e.g., rat, humans, etc.) that contain oligonucleotide materials placed in multiple pairs-called a probe set (https://www.affymetrix.com/products/system.affx). One of each pair contains the complementary base sequences for the targeted gene; however, the other one has an incorrect base in the middle created to measure nonspecific bindings during hybridization that can be used for background correction. Expression values are computed by the relative amounts of bindings (perfect match versus perfect mismatch). Besides the above two microarray platforms, there exist many additional

choices at present including many custom arrays offered by various manufactures; in addition, serial analysis of gene expression (SAGE), which is technically not a microarray-based technique, produces gene expression data as well. Unlike microarrays, SAGE is a sequencing-based gene expression

C5777: “c5777_c001” — 2007/10/27 — 13:02 — page 3 — #3

prior knowledge of the to be considered. Another important difference between the two is that, with SAGE, one does not need a normalization procedure (see Section 1.3) since it measures abundance or expression in an absolute sense. Calculating expression itself is an issue with most, if not all, microarray

platforms; in addition, there are issues of normalizations and correction for systematic biases and artifacts, some of which are discussed in Section 1.3. In addition, there have been recent studies comparing multiple microarray platforms and the amount of agreement between them. The very latest set of results (see, e.g., Irizarry et al., 2005) contradict earlier beliefs about nonreproducibility ofmicroarray gene expression calculations and concludes that the laboratories running the experiments havemore effect on the final conclusions than the platforms. In other words, two laboratories following similar strict guidelines would get similar results (that are driven by biology) even if they use different technologies. On the other hand, the “best” and the “worst” laboratories in this study used the same microarray platform but got very different answers. In this review, we present a brief overview of various broad topics of

microarray data analysis. We are particularly interested in statistical aspects of microarray-based bioinformatics. The selection of topics is, by no means, comprehensive partly because new statistical problems are emerging every day in this fast growing field. A number of monographs have come out in recent years (e.g., Causton et al., 2003; Speed, 2003; Lee, 2004; McLachlan et al., 2004; Wit andMcClure, 2004), which can help an interested reader gain further familiarity and knowledge in this area. The rest of the chapter is organized as follows. Some commonly employed

statistical designs in microarray experiments are discussed in Section 1.2. Aspects of preprocessing of microarray data that are necessary for further statistical analysis are discussed in Section 1.3. Elements of statisticalmachine learning techniques that are useful for gaining insights into microarray data sets are discussed in Section 1.4. Hypothesis testing with microarray data is covered in Section 1.5. The chapter ends with a brief discussion of pathway analysis using microarrays as a data generation tool.