chapter  3
76 Pages

Probabilistic and Model-Based Learning

We begin this chapter by addressing the question: “Why are probabilistic and model-based learning relevant in the context of biological systems?” This is a pertinent question because, after all, probability theory deals with uncertainty and probabilistic models are a way of quantifying uncertainty. When one thinks about the kind of objects that constitute biological data (e.g., nucleotide sequences in the DNA, amino acid sequences in peptides, the molecular components of carbohydrates and lipids, the metabolites participating in a certain metabolic pathway, etc.), there is a lot of predictability about them in the sense that one will always find the same nucleotide sequence at a specific position on a specific chromosome of a specific organism (except for rarely occurring events called mutations) or the same amino acid sequence in a specific protein molecule. So, where does the uncertainty come from? In fact, it primarily results from the inadequacy of our present state of knowledge compared to what remains to be discovered in the future. Many aspects of a biological system are still partially known and currently known ‘facts’ often turn out to be wrong in the light of newly discovered knowledge. Also, there is a constant need for extrapolating what we know about a smaller or simpler organism to a larger and more complex one that is still unexplored. In other words, researchers in bioinformatics are constantly faced with the need to use inductive reasoning and to draw inferences. There are three different concepts of knowledge in this world. The philosopher’s view is that all knowledge is correct and the distinction between right and wrong depends only on the observer’s viewpoint. The scientist’s view is that all knowledge is wrong unless it can be experimentally verified by independent observers. To put it in another way, a scientist such as a physicist or a chemist uses deduction to add to existing knowledge (i.e., if A implies B and B can be experimentally shown to be false, then A must be false). On the other hand, the probabilist’s or statistician’s view of knowledge is based on the principle of induction. It goes like this: if A implies B and B is observed to happen, A is more likely to be true. Probability and statistics enable us to quantify, for example, how much more likely A becomes if B is observed k times (k=1,2,3,. . .). Often, a

classical statistician’s approach is to start with a set of competing hypothesis about an unknown quantity or object, choose a probability model describing the relation between the unknown and the observable (i.e., data) and finally, reach a decision regarding the hypotheses based on the evidence from the data, along with an assessment of the uncertainty involved in that decision. This is called hypothesis testing. At other times, he/she would try to find out the most likely value(s) of the unknown by maximizing the joint probability model for the data-vector (called the likelihood function) with respect to the unknown. This is known as maximum likelihood estimation. These two are the central theme of frequentist inference —there are many variations to these themes. There is, however, another parallel approach called the Bayesian approach. A Bayesian would start with some a priori assumptions about the unknown, usually in the form of a probability distribution (called a prior distribution) that can be based completely on his/her personal belief or on already existing information or on some kind of ‘expert opinion.’ The Bayesian would then combine this prior distribution with a probability model relating the unknown and the observed data (i.e., the likelihood function) to get a joint distribution and from it, using the Bayes principle, ultimately derive the conditional distribution of the unknown given the observed data (the posterior distribution). To a Bayesian, therefore, acquiring new knowledge basically means updating the prior information about the unknown in the light of the observed data.