ABSTRACT

Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, USA

CONTENTS

12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 12.2 Independent sites models and summary statistics . . . . . . . . . . . . . . . 250

12.2.1 Likelihood inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 12.2.2 The EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 12.2.3 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 12.2.4 Conditional means on a phylogeny . . . . . . . . . . . . . . . . . . . . . . 252 12.2.5 Endpoint-conditioned summary statistics from

uniformization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

12.3 Dependent-site models and Markov chain Monte Carlo . . . . . . . . . 260 12.3.1 Gibbs sampling with context dependence . . . . . . . . . . . . . . . 262 12.3.2 Path sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

12.3.2.1 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 12.3.2.2 Direct sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 12.3.2.3 Uniformization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 12.3.2.4 Bisectioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270

12.3.3 Metropolis-Hastings algorithm with dependence . . . . . . . . 271 12.4 Future directions for sequence paths with dependence models . . 273

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

Algorithms, and

While some probabilistic models of DNA or protein sequence change are not based on an instantaneous rate matrix (e.g., Barry and Hartigan (1987)), most are. With an instantaneous rate matrix, there is an opportunity to go beyond the sequences that begin and end a branch on a phylogenetic tree — inferences can be made about the sequence changes that happened between these endpoints. At the most detailed level, inferences would be about which changes transformed the beginning sequence into the ending one and about exactly when these changes occurred. At a less detailed level, various summary statistics about the evolutionary trajectory from the beginning to ending of the branch might be of interest. A variety of techniques are available for making inferences about evolutionary trajectories conditional upon the endpoints of a branch and one objective of this chapter is to introduce them. To parallel the “Brownian bridge” that results when the endpoints of a Brownian motion process are conditioned upon, an endpoint-conditioned Markov process is known as a Markov bridge (Al-Hussaini and Elliot, 1989). This chapter is not intended to be comprehensive regarding inference techniques for Markov bridges. Instead, the focus is on endpoint-conditioning with Markov models for molecular sequence evolution.