ABSTRACT

Today one of the leading causes of death among women in many parts of the world is breast cancer. Thus, diagnosis of breast cancer at an early stage may save lives. Gene expression profiling of breast cancer tissues is a genomic approach mainly focussing on the characterization of differences among cancer tissues. Gene expression data generated from a microarray experiment is represented by a real-valued expression matrix. Genes exhibiting similar patterns are often functionally related.

Mining such biological data is an important study in bioinformatics. The accuracy of the data-mining task is improved by hybridizing it with optimization algorithms. Clustering and classification are useful and popular methods to extract useful patterns from these gene expression data. K-means clustering hybridized with differential evolution and ant colony optimization is used to cluster breast cancer gene expression data. K-nearest neighbour classifier hybridized with PSO is used to classify breast cancer gene expression data. The MapReduce programming model is applied to parallelize the computationally intensive tasks in K-means clustering and K-nearest neighbour classifier. Applied over different number of processors, the proposed approach exhibits good scalability and accuracy.