ABSTRACT

One of the major applications of DNA methylation microarrays is to identify epigenetic markers for disease diagnosis. Other applications include classifying diseased samples into distinct subtypes. Classification algorithms are widely used for pattern recognition, which is one of the main subjects in machine learning. A familiar example is to build a spam filter that classifies incoming e-mails into spam and nonspam. High-density microarrays measure the methylation at CGI (CpG island)

locations across entire genome. Not all of the locations are informative in classification as not all the words in an e-mail are useful in discriminating the e-mail. The first step in building a classifier thus is to select the informative loci. The step is critical as we select discriminatory keywords in successfully filtering e-mails. The building of a classifier relies on a training dataset in which the disease

status of every sample is known. Parameters of the classifier are then tuned in order to minimize the classification error. It is usually found that simple classifiers perform well in comparison to sophisticated ones. We introduce two simple classifiers and also illustrate the performance of a classifier by the technique of receiver-operating-characteristic curves.