ABSTRACT

The availability of whole-genome sequence data from multiple organisms has provided a rich resource for investigating several biological, medical and pharmaceutical problems and applications. Yet, along with the insights and promises these data are providing, genome-wide data have given rise to more complex problems and have challenged traditional biological paradigms. One of these paradigms is the inference of a species phylogeny. Traditionally, a biologist would proceed by obtaining the molecular sequence of a single locus, or gene, in a set of species, inferring the phylogeny (or evolutionary history) of this locus, and taking it to be an accurate representation of the species pattern of divergence. Although this approach may work for several groups of organisms, particularly when taking extra caution in selecting the locus, the availability of sequence data for multiple loci from a variety of organisms and populations has highlighted the deficiencies and inaccuracies of this traditional approach. Different loci in a group of organisms may have different gene tree topologies. In this case, there is no single gene tree topology to declare as the species tree. Further, in the presence of reticulate evolutionary events, such as horizontal gene transfer, the species phylogeny may not be a tree; instead, a network of relationships is the more appropriate model. In a seminal paper, Maddison (1997) discussed the issue of species/gene tree incongruence, the implications it has on the inference of a species tree, and the processes that can cause such incongruence and for which explicit modeling is necessary for accurate inferences. The three main processes discussed were lineage sorting, gene duplication and loss, and reticulate evolution. Lineage sorting occurs because of random contribution of genetic material from each individual in a population to the next generation. Some fail to have offspring while some happen to have multiple offspring. In population genetics, this process was first modeled by R. A. Fisher and S. Wright, in which each gene of the population at

a particular generation is chosen independently from the gene pool of the previous generation, regardless of whether the genes are in the same individual or in different individuals. Under the Wright-Fisher model, the coalescent considers the process backward in time (Kingman, 1982; Hudson, 1983a; Tajima, 1983). That is, the ancestral lineages of genes of interest are traced from offspring to parents. A coalescent event occurs when two (or sometimes more) genes “merge” at the same parent, called the most recent common ancestor (MRCA) of the two genes. In certain cases, two genes coalesce at a branch in the species tree that is deeper than their MRCA. When this happens, it may be that coalescence patterns result in trees that do not reflect the divergence patterns of the species. Evidence of extensive lineage sorting has been reported in several groups of organisms; (see e.g., Rokas et al. 2003; Syring et al. 2005; Pollard et al. 2006; Than et al. 2008c; Kuo et al. 2008). Gene duplication is considered a major mechanism of evolution, particularly in generating new genes and biological functions (Ohno, 1970; Graur and Li, 2000). Duplication events result in multiple gene copies, which when transmitted to descendant organisms, produce complex gene genealogies. As some of these genes may go extinct (Olson, 1999), inferring the gene tree from only those copies present in the organisms can result in a topology that may disagree with that of the species tree. A similar effect can be obtained as an artifact of sampling some, but not all, of the gene copies in an organism’s genome. The third process discussed by Maddison (1997) is reticulate evolution. For example, evidence shows that bacteria may obtain a large proportion of their genetic diversity through the acquisition of sequences from distantly related organisms, via horizontal gene transfer (HGT; Ochman et al., 2000; Doolittle, 1999b,a; Kurland et al., 2003; Hao and Golding, 2004; Nakamura et al., 2004). There is also recent evidence of widespread HGT in plants (Bergthorsson et al., 2003, 2004; Mower et al., 2004). Interspecific recombination is believed to be ubiquitous among viruses (Posada et al., 2002; Posada and Crandall, 2002), and hybrid speciation is a major evolutionary mechanism in plants, and groups of fish and frogs (Ellstrand et al., 1996; Rieseberg and Carney, 1998; Rieseberg et al., 2000; Linder and Rieseberg, 2004; Mallet, 2005; Noor and Feder, 2006; Mallet, 2007). There is a major difference between lineage sorting and gene duplication/loss on the one hand and reticulate evolutionary events on the other, in terms of the reconciliation outcome. Gene trees may disagree with each other, as well as with the species phylogenies, due to lineage sorting or gene duplication/loss events. In this case, their reconciliation yields a tree topology, with the deep coalescences, duplications, and losses taking place within the species tree branches. However, when horizontal gene transfer or hybrid speciation occur, the evolutionary history of the genomes can no longer be adequately modeled by a tree; instead, a phylogenetic network is a more appropriate model. Incorporating these processes into computational methods for inferring accurate evolutionary histories will have significant implications on reconstructing accurate evolutionary histories of genomes and better understanding of their diversification. Biologists have long acknowledged the presence of these processes, their significance,

and their effects. The computational research community has responded in recent years, proposing a plethora of methods for reconstructing complex evolutionary histories by reconciling incongruent gene trees.