ABSTRACT

One of the most widely used experimental designs for microarray studies involves obtaining gene expression data on samples of cases from two or more (J ≥ 2) populations that differ with respect to some characteristic. Thus, analysis of variance (ANOVA) models are typically used to test whether gene expression levels differ across the populations for each gene. It can be expressed with the following linear model:

Yij(k) = µ∗(k) + βj(k) + εij(k) (13.1)

where j refers to the J conditions, groups, or populations (i.e., between-subjects factor), i refers to the nj samples nested within the jth group, k refers to the K number of genes, εij(k) is a random error vector for the kth gene, βj(k) is the differential expression effect for the kth gene, and N = nj is the total number of subjects. Yij(k) is the expression level. A log transformation is often used in order for βj(k) to reflect differential expression ratio. An F-ratio, F(k), can be used to test whether there are statistically significant differences in expression for the kth gene:

F(k) = ∑J

j=1 nj(Y¯j(k) − Y¯∗(k))2/(J − 1)∑J j=1 ∑nj

i=1(Yij(k) − Y¯j(k))2/(N − J) (13.2)

The F-ratio is distributed as F[(J−1),(N−J)] under the null hypothesis:

H0(k):βj(k) = 0, for all j and for the kth gene. (13.3)

In using the parametric F-ratio, the random error components (εij(k)) for the kth gene are assumed to be independent and identically distributed with a mean of zero, a homoscedastic (constant) variance, (σ 2

ε(k)), and normal shape for each group (i.e., NID[0, σ 2

ε(k)]). By requiring identical error distributions, it can be assured that a rejection of the null hypothesis in Equation 13.2 is due to shifts (differences) among location parameters. Furthermore, assuming normal error distributions means as estimates of location and the parametric F-ratio will yield the maximum statistical power for rejecting Equation 13.3.