ABSTRACT

Department of Methods and Models for Economy, Territory and Finance, Sapienza Universita´ di Roma, Roma, Italy

Luca Tardella

Department of Statistical Sciences, Sapienza Universita´ di Roma, Roma, Italy

CONTENTS

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Substitution models: a brief overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.1 Bayesian inference for substitution models . . . . . . . . . . . . . . 27 3.3 Bayesian model choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Computational tools for Bayesian model evidence . . . . . . . . . . . . . . 32

3.4.1 Harmonic mean estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.2 IDR: inflated density ratio estimator . . . . . . . . . . . . . . . . . . . . 34

3.4.2.1 IDR: numerical examples . . . . . . . . . . . . . . . . . . . . . 38 3.4.2.2 IDR for substitution models . . . . . . . . . . . . . . . . . 41

3.4.3 Thermodynamic integration estimators . . . . . . . . . . . . . . . . . 43 3.4.3.1 Generalized stepping-stone estimator (GSS) . 45 3.4.3.2 Comparative performance: the linear model . 47

3.5 Marginal likelihood for phylogenetic data . . . . . . . . . . . . . . . . . . . . . . . 48 3.5.1 Hadamard data: marginal likelihood computation . . . . . . 49 3.5.2 Green plant rbcL example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

It is widely accepted that species diversified in a tree-like pattern from a common descendant and that the diversification is mainly due to changes in the genetic codes of the species accumulating during the centuries. The main

Algorithms, and

aim of phylogenetics is to investigate the evolutionary relationships among species, studying similarities and differences of aligned genomic sequences. From a statistical point of view, the problem of analyzing phylogenetic sequences is often formalized as follows: given a set of DNA sequences of different species, we aim at inferring the tree that best represents the evolutionary relationships. Alternative tree estimation methods such as parsimony methods (Felsenstein (2004), chapter 7) and distance methods (Fitch and Margoliash, 1967; CavalliSforza and Edwards, 1967) have been proposed. We consider stochastic models for substitution rates in a fully Bayesian framework. We focus on model selection issues and several estimation procedures of the Bayesian model evidence will be reviewed. We address model choice within a fully Bayesian framework proposing alternative model evidence estimation procedures.