ABSTRACT

CONTENTS 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.2 Molecular Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.4 Machine Learning in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.4.1 Problem Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.4.2 Training and Testing Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.4.3 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.4.4 Feature Selection and Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.4.5 Supervised versus Unsupervised Learning . . . . . . . . . . . . . . . . . . . . 53 2.4.6 Model Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2.5 Examples of Modern Machine and Statistical Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.5.1 Linear Discriminant Analysis and Support Vector

Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.5.2 Linear Regression and Support Vector Regression . . . . . . . . . . . 57 2.5.3 Neural Networks for Classification and Regression . . . . . . . . . . 58 2.5.4 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

2.6 Applications of Machine Learning to Structural Bioinformatics . . . . 61 2.6.1 Secondary Structure Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.6.2 Solvent Accessibility Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.6.3 Structural Predictions for Membrane Proteins . . . . . . . . . . . . . . . . 63 2.6.4 Computational Protocols for the Recognition of

Protein-Protein Interaction Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 2.6.5 Phosphorylation as a Crucial Signal Transduction

Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

C5777: “c5777_c002” — 2007/10/27 — 13:02 — page 46 — #2

in

and Structural Consequences of Mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

2.7 Computational Gene Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 2.8 Biomarkers, Drug Design, and QSAR Studies . . . . . . . . . . . . . . . . . . . . . . . . . 69 2.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

This chapter presents an overview of applications of machine and statistical learning techniques to problems arising in the area of molecular biology and medicine. As such, the methods and applications discussed here fall into the general area of bioinformatics, which is concernedwith the computational analysis and interpretation of data regarding biological systems and processes. This is an active and still relatively youngfield of researchwith great potential of advancing both basic and applied biomedical research. The Human Genome Project (Venter et al., 2001), which was completed

recently, and its extensions such as the HapMap (International HapMap Consortium, 2003) project dealing with genetic variability in human populations, have triggered an enormous growth of data and research that aims at elucidating fundamental questions in medicine, biochemistry, genetics, and molecular biology. In particular, the availability of DNA sequence information has enabled the large-scale analysis of correlations between genetic variations and, for example, differences in susceptibility to diseases or other medically relevant outcomes. Machine learning-based approaches are capable of capturing complex correlations between relevant descriptors (or “features”), such as genetic mutations, and observed outcomes, such as cancer survival time. Capturing and characterizing such correlations can lead to successful prediction of various aspects of molecular systems. For a general overview of applications of machine learning in bioinformatics, see, for example, Baldi et al. (2000) and Mjolsness and DeCoste (2001). We start this chapter with a very brief overview of central problems, data

sources, and measurement techniques being used in molecular biology and genomics. This is followed by a discussion of machine learning approaches and some aspects of general importance regarding their applications to problems arising in molecular biology, such as the importance of data representation, model selection and validation, alternative learning algorithms, and their interplay with hypothesis generation and further experimental studies. We are necessarily brief and selective; rather than providing a comprehensive overview of the field, we discuss what we believe to be crucial elements of successful applications of machine learning and data mining techniques in bioinformatics.