ABSTRACT

Chapter 3 identified the problems involved when working with full size complementary deoxyribonucleic acid (cDNA) microarray slides. These problems primarily revolved around the high signal variability as generated by the existence of various image artifacts. The chapter detailed the benefits of using a multi-view process on the images to emphasis various aspects of their surfaces. An enhancement of this nature allows for the partial removal of an image’s artifacts and thus re-focuses the image into the more likely regions of interest (the gene spots). The Pyramidic Contextual Clustering (PCC) algorithm presented in this

chapter extends this enhancement idea to the next level. The PCC process is designed to emphasize the probable gene spot regions via a different approach to that of the Image Transformation Engine (ITE) technique of chapter three. Although the gene spot positional information as rendered by PCC is partial in some areas of the newly generated image (as per the ITE), these partial gene spots are still of benefit to the underlying task of gene spot identification. When used in combination, these overlapping ITE and PCC processes should render better image knowledge than the processes would individually. Current research work in the field of large datasets with clustering ap-

proaches relates to the processing of large database structures; examples of which can be seen in the development of algorithms such as CLARANS [151], BIRCH [230] and CLIQUE [3]. Importantly, note that although these individual papers refer to large datasets as containing many gigabytes and terabytes of information, in practice they process only a fraction of this. CLARA [118] is an adaptation of the original partitioning around medoids (PAM) clustering algorithm designed specifically to process larger datasets. CLARANS is a further modification of the CLARA algorithm and is based upon a randomized search process using parameters to limit the number of neighboring points and local minima that are explored. The authors of the BIRCH algorithm highlighted that by limiting their search process as in the CLARANS implementation, BIRCH may not discover a real local minimum. Therefore, the BIRCH authors presented an algorithm designed to minimize the I/O costs while working with limited memory and thus attempted to show that BIRCH

An

is consistently superior to the above methods. The CLIQUE algorithm was designed to answer several special requirements in data mining that the authors felt were not adequately addressed. Clustering [109,110,133] as related to microarray data is typically focused at

generating a clustering of the resultant gene spot intensity values themselves rather than as a means of finding the gene spots per se. A good example of this can be seen with Balasubramaniyan [18] who searched for co-regulated genes. The most common technique employed for such analysis work is that of hierarchical clustering [147,213] techniques. Eisen [70] described a hierarchical clustering algorithm that used a greedy heuristic based on average linkage methods. Leach [128] presented a comparative study of clustering techniques and metrics for gene expression levels. Lashkari [126] described an experiment setup for human T cells analysis, while a Bayesian network approach was proposed by Friedman [90] that described the interactions between the genes in a microarray. Jansen [113] investigated techniques relating gene expression data to protein-protein interactions. However, with a typical microarray image consisting of over twenty million-

plus pixels, current clustering methods simply cannot be directly applied to the image data as a whole. Early research work in this field can be seen in papers [32, 34-36, 149, 227], where the authors looked at the use of clustering as applied to microarray imagery specifically. To overcome computational expense issues associated with clustering the full image, Bozinov and Rahenfu¨hrer [32] proposed an abstraction of the k-means [139] technique whereby pre-defined centroids were chosen for both foreground and background domains, to which all pixel intensities could be assigned. This is to say that one centroid was chosen to represent a gene spot signal with another centroid chosen to represent the noise or artifact. Unfortunately although traditional k-means is able to choose centroids according to the dataset’s characteristics, this approach is inherently biased towards outlying values (saturated pixels for example), and not the true region of interest (the foreground pixels). In Nagarajan [149] the issue of clustering the full slide was not addressed specifically as the authors were interested in the effects of clustering the individual gene spots. Yang [52, 227] presented a general review of this area, detailed other manual methods and proposed a system called SPOT that had some improvements over previous methods. Other methods present ideas along similar themes, for example, the appli-

cation of wavelets [209, 210] and Markov random fields [115-117] show great promise, however, at this time they have only been used on what would be classified as “good slides,” and even then, not on realistically dimensioned data. If these techniques fail to determine the location of just one spot, the system would fail, thus having to fall back on user intervention in order to recover. Indeed, when processing large scale datasets, Berkhin [24] classified the current solutions into three groups: incremental mining [212], data squashing [64, 65], and reliable sampling [143] methods. The main drawback with these implementations is that, by reducing the number of elements in the

datasets, important data will have been lost and so there is a need for developing techniques that can be scaled to these large scale image problem areas directly. With completion of the Data Services stage of the Copasetic Microarray

Analysis (CMA) framework, the investigation shifts its focus onto finding possible gene spots within the image. The Pyramidic Contextual Clustering (PCC) algorithm examines the full surface area of the microarray image at the pixel level. These surface pixels are compared together (according to some proximity criteria), and assigned into one of the two groups, which represent either signal (foreground) or noise (background). The proximity criterion is scaled up through all iterations to give the algorithm as much “pixel spread” information as possible. Once the proximity criterion is greater than the dimensions of the raw image a consensus result is generated. This consensus image along with the PCC “time slices” are used throughout the CMA framework as the building blocks for acquiring image knowledge. This chapter aims to explain how the PCC technique of Figure 4.1 is able

to process a full microarray image (something not possible with traditional clustering techniques) with the mechanisms involved in breaking up the pixels into their foreground (the gene spot) and background groups.