Taxon Sampling versus Computational Complexity and Their Impact on Obtaining the Tree of Life
The scope of phylogenetic analysis has increased greatly in the last decade, with analyses of hundreds, if not thousands, of taxa becoming increasingly common in our efforts to reconstruct the tree of life and study large and species rich taxa. Through simulation, we investigated the potential to reconstruct ever larger portions of the tree of life using a variety of different methods
(maximum parsimony, neighbour joining, maximum likelihood and maximum likelihood with a divide-and-conquer search algorithm). For problem sizes of 4, 8, 16 … 1,024, 2,048 and 4,096 taxa sampled from a model tree of 4,096 taxa, we examined the ability of the different methods to reconstruct the model tree and the running times of the different analyses. Accuracy was generally good, with all methods returning a tree sharing more than 85% of its clades with the model tree on average, regardless of the size of the problem. Unsurprisingly, analysis times increased greatly with tree size. Only neighbour joining, by far the fastest of the methods examined, was able to solve the largest problems in under 12 hours. However, the trees produced by this method were the least accurate of all methods (at all tree sizes). Instead, the strategy used to sample the taxa had a larger impact on both accuracy and, somewhat unexpectedly, analysis times. Except for the largest problem sizes, analyses using taxa that formed a clade generally both were more accurate and took less time than those using taxa selected at random. As such, these results support recent suggestions that taxon number in and of itself might not be the primary factor constraining phylogenetic accuracy and also provide important clues for the further development of divide-andconquer strategies for solving very large phylogenetic problems.