ABSTRACT

Given gene expression data from some set of experimental conditions, one of the

first questions that the analysts wants to answer is which genes differ across con-

ditions. At first glance, many of these questions can be dealt with most simply by

conducting an analysis on a gene by gene basis using some spot level summary such

as RMA. For the simplest sort of analysis, suppose we have a certain number of bi-

ological replicates from 2 distinct classes and want to address the question of which

genes differ between the 2 classes. Anyone with some training in statistics would

recognize this as the classical 2 sample problem and would know how to proceed.

One would perhaps examine the distribution of the measurements within each class

using a histogram to assess if the data depart strongly from normality and consider

appropriate transformations of the data or perhaps even a non-parametric test (such

as the Wilcoxon test). One would perhaps assess if the variances were unequal and

perhaps correct for this by using Welch’s modified 2 sample t-test, otherwise one could use the usual common variance 2 sample t-test. While there is some evidence that this is not optimal, such a procedure would be appropriate. As an example, con-

sider Figure 13.1. This figure displays a histogram of the p-values one obtains if one uses a standard 2 sample t-test to test for differences in gene expression between a set of HIV negative patients and a set of HIV positive patients for all genes on an

Affymetrix microarray. The tissue used in this application was taken from the lymph

nodes of the patients, and we expect large differences between HIV negative and HIV

positive patients in this organ because HIV attacks cells that are important for normal

functioning of the immune system. If no genes are differentially expressed, then we

expect this figure to look like a histogram of a uniform distribution, however that

appears to not be the case since many genes have small p-values (for this example, over 12,000 genes have p-values that are less than 0.05).