ABSTRACT

We begin with an example that will be used throughout the chapter. The data come from Sørlie et al. (2001). The goal of that article was to “classify breast carcinomas based on variations in gene expression derived from complementary deoxyribonucleic acid (cDNA) microarrays and to correlate tumor characteristics to clinical outcome.” The data consist of log fluorescence values for 456 cDNA clones measured on 85 tissue samples. Of the 85 samples, 4 are normal tissue samples, 78 are carcinomas, and 3 are fibroadenomas. Three of the four normal tissue samples were pooled normal breast samples from multiple individuals. Sørlie et al. (2001) selected the 456 genes from an initial set of 8102 genes so as to optimally identify the intrinsic characteristics of breast tumors. In Figure 4.1 and Figure 4.2, the data are plotted as heat maps.∗ This representation assigns a color for every matrix entry, with negative (underexpressed) values being green, and positive (overexpressed) values red. The data presented in this plot were preprocessed by Sørlie et al. (2001), adjusting rows and columns to have median zero. This preprocessing was applied before selection of the subset of 456 genes, so the column (i.e., sample) medians are not precisely zero. Heat maps are used to look for similarities between genes and between samples. They are most effective if rows and columns are ordered so as to allow these patterns to be identified. Clustering is often used to give this ordering, by identifying groups of samples (genes) and then arranging the groups so that the closest groups are adjacent. This is illustrated in Figure 4.1, where rows and columns are arranged according to separate hierarchical clusterings. Sørlie et al. (2001) used a similar graphic to identify interesting groups of genes and tumor subtypes. In Figure 4.2, five interesting gene subgroups are given. These are similar to those identified by Sørlie et al. (2001). These gene groups were selected because of unusually high or low expression levels among some of the tumors (columns). The gene groups highlighted in Figure 4.2 are used to characterize the different tumor subtypes. The six tumor subtypes (indicated by color from left to right of the dendogram in Figure 4.2) are Basal-like (red), ∗ These plots were generated using Michael Eisen’s Cluster and Treeview packages.