The root of a phylogenetic tree is fundamental to its biological interpretation, but standard substitution models do not provide any information on its position. Here, we describe two recently developed models that relax the usual assumptions of stationarity and reversibility, thereby facilitating root inference without the need for an outgroup. We compare the performance of these models on a classic test case for phylogenetic methods, before considering two highly topical questions in evolutionary biology: the deep structure of the tree of life and the root of the archaeal radiation. We show that all three alignments contain meaningful rooting information that can be harnessed by these new models, thus complementing and extending previous work based on outgroup rooting. In particular, our analyses exclude the root of the tree of life from the eukaryotes or Archaea, placing it on the bacterial stem or within the Bacteria. They also exclude the root of the archaeal radiation from several major clades, consistent with analyses using other rooting methods. Overall, our results demonstrate the utility of non-reversible and non-stationary models for rooting phylogenetic trees, and identify areas where further progress can be made.
The root of a phylogenetic tree is fundamental to its biological interpretation. For example, the eocyte hypothesis of Lake , in which the eukaryotic host cell emerges from within the archaeal radiation, depends critically on a root for the tree of life outside the eukaryotes or the relevant archaeal groups. And yet, phylogenies of the broadly conserved genes usually used to test hypotheses of eukaryotic origins are typically unrooted, because they are inferred under stationary, reversible substitution models in which the likelihood of the tree does not depend on the position of the root .
To interpret unrooted trees, biologists typically make use of external information . A common strategy is to include an outgroup, or set of taxa that are known to branch outside the clade of interest (the in-group). The root can then be placed on the branch connecting the outgroup to the rest of the tree. Although widely used, outgroup rooting has the potential to interfere with the inference of in-group relationships, particularly when the outgroup is distantly related to the in-group or differs substantially in nucleotide or amino acid composition [4,5]. In such cases, outgroup rooting can lead to a phylogenetic artefact called ‘long-branch attraction’ (LBA) , in which fast-evolving or compositionally biased sequences are drawn towards the outgroup, appearing as basal (early diverging) members of the in-group in the inferred tree . LBA is commonly invoked to explain the basal placement of the Microsporidia, a group of fast-evolving Fungi, in early eukaryotic trees , and it may also have played a role in the inference of the ‘three domains' tree of life, in which the eukaryotes branch as the sister group to a monophyletic Archaea [9,10].
Another limitation of standard outgroup rooting is that it cannot be used to root the tree of life, for which no obvious outgroup is known. An ingenious solution to this problem is to use pairs of paralogous genes that were already present in the last universal common ancestor (LUCA); each paralogue can then act as an outgroup to root the other. Analyses of this type have supported a root on the branch separating the Bacteria from all other cellular life [11–13], but are nonetheless fraught with difficulty —as highlighted by Gouy and colleagues elsewhere in this issue . In practice, it is difficult to trace gene duplications back to LUCA, so the number of genes that can be analysed is low, and the long basal branches in the resulting trees may be particularly sensitive to phylogenetic error .
There is therefore a clear need for alternative rooting approaches, both to corroborate results from outgroup rooting and for use in cases where outgroup rooting is problematic. Potential alternatives include the use of molecular clocks , gene tree parsimony  and, most recently, probabilistic gene tree/species tree reconciliation . Here, we consider another approach to rooting phylogenetic trees: the use of non-reversible or non-stationary substitution models, in which the likelihood of the tree depends on the position of the root. These models allow the tree to be rooted as an integral part of the analysis, without the need for outgroups. Despite the potential of these approaches for addressing questions in deep phylogeny, there has been surprisingly little work on the subject. Barry & Hartigan  developed a non-reversible and non-stationary substitution model that was implemented by Jayaswal et al.  and has been applied to the inference of rooted trees ; however, the large number of parameters involved has limited the general applicability of the model. Yang & Roberts  proposed a non-stationary model which allowed a change in the composition vector at each speciation event. They fitted their model to small-subunit ribosomal RNA (rRNA) sequences from across the tree of life, and recovered an eocyte tree in which the root was placed between E. coli—the single bacterial representative—and the remaining sequences. The NDCH model of Foster  and the BP model of Blanquart & Lartillot [24,25] are similar except a fixed, but unknown, number of composition vectors apply to different parts of the tree. While these models have the potential to offer a more parsimonious description of the data, their unknown dimension makes model-fitting computationally difficult. Finally, Huelsenbeck et al.  investigated the ability of a non-reversible model to correctly identify the root position on simulated data and found that the approach worked best when the data contained a high degree of non-reversibility.
One reason for the lack of subsequent interest in these models may be the additional model complexity which can result from relaxing the standard assumptions of reversibility and stationarity, and the resultant increase in the computational cost of model fitting (though see ). Here, we describe recent advances in Bayesian modelling of non-reversible  and non-stationary  substitution processes that ameliorate some of these difficulties. We then apply these new models to root trees relating to three classical problems in evolutionary biology: the relationship between the extremophilic Bacteria Thermus and Deinococcus; the relationship of the eukaryotes to the Archaea in the ribosomal RNA tree of life and the root of the archaeal radiation.
2. Two models for learning about the root of a phylogenetic tree
Consider evolution at one site of a nucleotide sequence, along one branch of a phylogenetic tree. Most phylogenetic models assume that substitutions can be modelled using a continuous time Markov process (CTMP). The defining assumption which underpins these models is the Markov property which asserts that, conditional on the current state of the process (i.e. the current nucleotide), the future state depends only on this current state and not on its past. The CTMP is characterized by an instantaneous rate matrix that governs the transition probabilities along the branch. Standard models assume that the CTMP is time-reversible and in its stationary distribution. This pair of assumptions affords some mathematical simplification and allows the rate matrix to be decomposed into the probabilities of the theoretical stationary distribution and a set of exchangeability parameters . The latter determine the general propensity for change between the different pairs of nucleotides. The most general form of this model, with six exchangeability parameters, is the GTR model . Assumptions of equality among these parameters give rise to simpler models, such as the HKY85 model , which has only two: one governing the rate at which transitions occur, and the other the rate of transversions. In standard models, a rate matrix of the same form is assumed to apply to every branch of the tree. We call this assumption across-branch homogeneity. Sites are then assumed to be independent of each other and evolving according to the same model, but with a site-specific rate. The variation between rates is generally modelled using a gamma distribution or one of its discretized approximations .
The assumptions of stationarity, reversibility and across-branch homogeneity are largely made for mathematical convenience. However, these assumptions come at an inferential cost, with stationary and reversible models yielding likelihood functions that do not depend on the position of the root. As such, topological inference is limited to unrooted trees. We consider two recently developed Bayesian hierarchical models that relax some of these standard assumptions, and thereby allow the models to be used to make inference about root positions.
The first of these models, NR , is branch-homogeneous and stationary, but non-reversible. In a non-reversible model, the direction of time is important. The structure of the Bayesian hierarchical model is based on the simple HKY85 model, but allows the parameters of the instantaneous rate matrix to depart from this structure through two perturbations: the first to allow a more general GTR structure, and the second to allow the most general non-reversible form. The size of the perturbations is controlled by standard deviation parameters σR and σN, respectively, whose values are unknown. By fitting the model to data, we learn which values are more or less plausible, with large values of σR providing evidence of reversible departures from the HKY85 model, and large values of σN providing evidence of non-reversibility. Indeed, it is this evidence of non-reversibility that drives inference about the root position.
The second model, HB , is branch-heterogeneous. Evolution on each branch proceeds according to the GTR model, but the composition vector, i.e. the theoretical stationary distribution, is allowed to take a different value on each branch that is not connected to the root. A further distinct composition vector applies at the root and on the two branches on either side. This creates a model of fixed dimension that is easier to fit than the related NDCH  or BP [24,25] models. To avoid overparametrization, the Bayesian hierarchical model is structured to allow information to be shared between branches, with a greater exchange of information between neighbouring than distant edges. The resulting model is non-stationary and can account for discrepancies in sequence composition among taxa. It also yields a likelihood function that is informative about the root position. In this case, the information arises from evidence of non-stationarity in the data.
Both models are fitted in a Bayesian framework, using Markov chain Monte Carlo (MCMC) methods to sample the posterior distribution. Further details of the MCMC algorithms can be found in Heaps et al.  and Cherlin et al. , although note that here we use a standard Metropolis Hastings algorithm for the HB model, rather than the data augmentation scheme described in the original paper. We also use a slightly revised version of the HB model, with a common composition vector at the root and on the two branches on either side, and GTR, rather than HKY85, exchangeabilities. Our experience suggests that this revised model provides more robustness against implausible root splits on pendant branches of the underlying unrooted tree.
3. The relationship between Thermus and Deinococcus
Deinococcus radiodurans and Thermus aquaticus are related extremophilic Bacteria that are adapted to two different sets of extreme conditions: ionizing radiation and desiccation in the case of Deinococcus , and high temperatures for Thermus . Although their common ancestry is attested by a range of independent analyses , the Thermus–Deinococcus relationship is a classic test case for new phylogenetic models because standard approaches often fail to recover the correct tree from analyses of rRNA [23,37]. To compare the behaviour of the NR and HB models, we inferred trees under both models using an alignment  of the 16S rRNA genes of Thermus, Deinococcus and three outgroup taxa: Bacillus, Aquifex and Thermotoga. The majority rule consensus tree under the NR model represents an incorrect tree (figure 1a), with Bacillus as the closest relative of Deinococcus. This tree recapitulates previous results in which the rRNA sequences cluster according to GC content rather than evolutionary history: Thermus, Aquifex and Thermotoga all have GC-rich rRNA genes, perhaps as an adaptation to life at high temperatures , while both Deinococcus and Bacillus are mesophiles whose rRNA genes show a more moderate GC content. The consensus tree inferred under the HB model correctly recovered the Thermus/Deinococcus clade (figure 1b), probably because the HB model allows composition to vary across the branches of the tree. The numbered ranking of branches in order of decreasing GC content in figure 1b demonstrates that placing Thermus and Deinococcus as sister taxa requires a switch to high GC-content in Thermus following the divergence of the two lineages, which is not possible under stationary models such as NR.
Although the consensus trees in figure 1 disagree on the position of the root, it is important to note that the root position in a consensus tree does not necessarily represent the root split which received the most posterior support. This is because the majority rule consensus tree contains precisely the clades that have posterior support of more than 0.5 . It is therefore a conditional summary, computed recursively from the leaves to the root, which depends upon the plausibility of sub-clades. By contrast, the posterior over root splits is a marginal summary which averages over the relationships expressed elsewhere in the tree (see  for further details or  for related comments on summarizing posteriors for unrooted topologies). Therefore, in spite of the differences between the consensus trees in figure 1, it is interesting that the root split which received the most support was the same in both cases, and separated Aquifex from the other species (electronic supplementary material, table S1, PP = 0.21 under NR, PP = 0.70 under HB). This result is consistent with the conclusions of the Genomic Encyclopedia of Bacteria and Archaea project, which performed the most comprehensive phylogenomic survey of Bacteria to date . Thus, despite the differences between the NR and HB models in the way that rooting information is extracted from the data, both models agree with recent biological opinion in this case.
4. Application to the ribosomal RNA tree of life
Comparisons of rRNA sequences have been central to the debates over the deep structure of the tree of life, and in particular the relationship of eukaryotes to the Archaea . Many early analyses favoured a ‘three domains’ tree, in which the Bacteria, Archaea and eukaryotes were each monophyletic domains. By contrast, more recent analyses taking advantage of an improved sampling of archaeal sequence diversity and using better-fitting substitution models have instead favoured the ‘eocyte’ tree of Lake , in which the eukaryotic rRNA sequences—taken to represent the host cell lineage for the mitochondrial endosymbiont—emerge from within the archaeal radiation [42–45]. We analysed a previously published 16-species concatenated rRNA alignment containing 761 sites from the large subunit rRNA gene and 720 sites from the small subunit . Sequences were aligned with MUSCLE , MAFFT , ProbCons  and Kalign , and a consensus alignment inferred using M-Coffee . Poorly aligning sites were identified and removed using BMGE  with the default parameters. Analysis of this alignment under the NR model recovered the classic ‘three domains’ topology, in which the eukaryotes emerge as the sister group to a monophyletic Archaea with strong posterior support (figure 2a, PP = 0.93 for archaeal monophyly). Based on recently published analyses of rRNA and protein-coding genes, this tree is currently thought to be incorrect [10,42,44,52,53], although it has historically received support from simpler stationary models (reviewed in ). This result suggested that, while the NR model can provide useful rooting information, it is subject to many of the same limitations as other stationary models. Inference under the HB model recovered an eocyte tree, with the eukaryotes emerging as the sister group to the ‘TACK’ superphylum of Archaea (figure 2b, PP = 0.89 for the eukaryote/TACK clade), consistent with recent phylogenomic analyses [42–45]. As in the case of the Thermus dataset, mapping posterior inferences of the most GC-rich branches onto the consensus tree provides an intuitive explanation for the differences in results between the NR and HB models. The branches leading to the common ancestor of the Archaea, and to each of the major archaeal clades, are among the most GC-rich in the phylogeny (ranked first (0.756), sixth (0.639) and eighth (0.621) by posterior mean GC content; see figure 2b and electronic supplementary material, figure S2), but the long branch leading to the common ancestor of the eukaryotes has a much more moderate GC content (ranked 20th overall; 0.444). Thus, the eocyte tree requires the placement of a moderate GC branch inside a high GC clade: this is biologically plausible, because we know that sequence composition can change over evolutionary time, but not possible under NR and other stationary models. This result provides some insight into why early analyses with simpler phylogenetic methods often recovered the three domains tree and provides support for the suggestion of Tourasse & Gouy  that the eocyte tree might be intrinsically more difficult to recover than the three domains tree.
For these rRNA genes, the posterior distributions for root splits support the placement of the roots on the consensus trees, showing disagreement between the NR and HB models (see figure 2 and electronic supplementary material, table S2). The NR model favours a root on the branch separating the Bacteria from all other cells, in agreement with traditional paralogue rooting approaches [11–13] and analyses of genome networks . By contrast, the HB model places the root within the Bacteria (figure 2b) with posterior support equal to 1. Although the root is unresolved on the consensus tree, the root split with the greatest posterior support (electronic supplementary material, table S2, PP = 0.34) groups all the Bacteria except Rhodopirellula on one side of the root. Some authors have argued for a root within the Bacteria on the basis of polarized indels or other rare genomic changes [55,56], although neither of these proposals unites the planctomycetes (here represented by Rhodopirellula) with the Archaea and eukaryotes. While resolution of this issue will clearly require analyses with a greatly improved sampling of Bacteria, we also sought to investigate the reason for the different root inferences under the NR and HB models. Recall that, in the case of the NR model, the σR parameter provides a measure of reversible departures from the HKY85 model while the σN parameter provides a measure of non-reversibility. Plots showing the weight of evidence in the data for different values of σN and σR for the Thermus and tree of life datasets showed markedly different behaviour (figure 3): while both datasets revealed evidence of non-zero values for σR, providing support for GTR-like over HKY85 exchangeabilities, posterior support for non-zero values of σN is clearly greater in the tree of life. Thus, the tree of life dataset shows substantial evidence of non-reversibility in the substitution process within branches, which is not accounted for in the HB model. This observation may provide a partial explanation for the failure of the HB model to recover the most widely accepted root position on this dataset. It also suggests that, beyond the compositional heterogeneity that is increasingly recognized as an important and pervasive feature of real sequence data, some alignments may also contain significant evidence of non-reversibility in the substitution process. This finding agrees with the work of Squartini & Arndt , who presented evidence for non-reversibility in the evolution of the Drosophila and human lineages, and motivates the development of phylogenetic models that can account for both non-stationarity and non-reversibility, as these may both be salient features of real sequence data.
5. The root of the archaeal radiation
If the root of the tree of life lies between the Bacteria and Archaea, then the divergence of these two lineages represents the deepest split in the cellular world. Rooting the archaeal tree would establish a phylogenetic framework for reconstructing the gene content of the first archaeon and for constraining hypotheses about the earliest archaeal metabolisms. The models we introduce here are particularly well suited to addressing this question because they obviate the need to include a bacterial outgroup and the long branch that results, potentially improving the robustness of our inferences against long-branch attraction. We considered a concatenated alignment of 16S and 23S rRNA sequences sampled from across the known diversity of Archaea, including the Euryarchaeota, ‘TACK’ superphylum , and recently described ‘DPANN’ lineages . The archaeal alignment showed substantially more heterogeneity across taxa in its empirical composition than the Thermus or tree of life datasets. For example, the standard deviation of the proportion of guanine was 0.064, compared with at most 0.027 for the other two alignments (electronic supplementary material, table S4). The stationary NR model cannot account for such compositional heterogeneity and so we chose to fit the HB model only to this dataset. The root in the consensus tree separates the ‘TACK’ superphylum from a clade containing the Euryarchaeota and ‘DPANN’ Archaea (figure 4), although posterior support for the monophyly of the clades on either side of this root was low (PP = 0.52 and 0.56, respectively), largely due to uncertainty about the position of Korarchaeum on one side of the root, and of Nanoarchaeum and some early diverging methanogenic Euryarchaeota on the other. This uncertainty is reflected in the posterior for root splits, which offers support to a variety of basal arrangements of these groups around the root; see the electronic supplementary material, table S3. It is also interesting to note that the most GC-rich branches in the consensus tree (those leading to Nanoarchaeum on the one hand, and the TACK superphylum on the other) are close to the inferred root, consistent with our analysis of the rRNA tree of life (figure 2b). Based on the observation that the GC content of rRNA genes increases with optimal growth temperature , these results are compatible with the idea that the archaeal common ancestor was a thermophile . Intriguingly, the analysis robustly excluded the root from within several major archaeal groups, including the Crenarchaeota (PP = 0.92), Thaumarchaeota/Aigarchaeota (PP = 0.99), and a clade containing some Euryarchaeota and all of the ‘DPANN’ Archaea except Nanoarchaeum (PP = 0.98).
A key benefit of the HB model is that it allows inference about the root of the tree without the use of an outgroup while directly modelling the variation of sequence composition over time. It is therefore interesting to note that our rooted archaeal tree shares some key features with a recent phylogenomic analysis of the Archaea that made use of proteins for which the distance separating bacterial and archaeal sequences was shorter than in traditional ribosomal protein trees, an approach which should also reduce the impact of long-branch attraction on the in-group phylogeny . In both trees, the highly reduced ‘DPANN’ Archaea emerge polyphyletically from within the Euryarchaeota, and the root is placed between TACK (called ‘Proteoarchaeota’ in ) and this Euryarchaeota/DPANN clade. Taken together, these results suggest that the basal position of a monophyletic DPANN clade in recent analyses may, in part, be attributable to attraction to the long bacterial outgroup. This result illustrates how the approaches described here can complement and extend analyses using traditional outgroup rooting.
6. Prospects for non-reversible and non-stationary substitution models
Despite their potential for addressing key questions in early cellular evolution, non-reversible and non-stationary substitution models are still an under-explored area of phylogenetics. In this article, we have explored two recently developed models for inferring rooted trees from nucleotide data on modest but reasonable numbers of taxa—up to 30 in the case of the archaeal dataset. These models make root inferences that are consistent with independent phylogenomic analyses and anciently duplicated genes and may provide a useful alternative to outgroup rooting. Our results therefore show that real sequence alignments can contain useful information about the position of the root that is implicit both in changes in sequence composition over time as well as—in at least some cases—evidence for directionality in the process of substitution.
Inferring deep phylogenies is challenging, and our analyses also revealed limitations in the models we have developed so far, helping to identify some important points on which progress can be made. Compositional heterogeneity is a pervasive feature of real sequence data, but at least some alignments also show evidence for non-reversibility within branches of the tree (that is, σN > 0). Joint modelling of these features is desirable, but inference under such models is not yet computationally tractable. We have focused on relaxing modelling assumptions so as to make the likelihood dependent on the root of the tree, but we know that other model properties are also important for the accuracy of the inferred topology—in particular, across-site compositional variation (as accommodated by the CAT model ). Future work will focus on incorporating these and other important features into our models, in order to improve the accuracy and robustness of rooted phylogenetic trees.
The sequence alignments analysed in the paper have been deposited at FigShare: http://dx.doi.org/10.6084/m9.figshare.1468407.
T.A.W. and S.E.H. conceived and performed the analyses and wrote the paper. S.E.H., T.M.W.N. and R.J.B. devised and implemented the HB model; S.C., S.E.H., T.M.W.N. and R.J.B. devised and implemented the NR model. T.M.E. provided guidance and contributed significant editing of the paper. All authors commented on and approved the final version of the paper.
We have no competing interests.
T.A.W., S.E.H., S.C. and T.M.E. were supported by a European Research Council Advanced Investigator Programme grant (ERC-2010-AdG-268701), and a Programme Grant from the Wellcome Trust (no. 045404) to T.M.E.
Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.
Tarrío R, Rodríguez-Trelles F, Ayala FJ. 2000Tree rooting with outgroups when they differ in their nucleotide composition from the ingroup: the Drosophila saltans and willistoni groups, a case study. Mol. Phylogenet. Evol. 16, 344–349. (doi:10.1006/mpev.2000.0813) Crossref, PubMed, ISI, Google Scholar
Tavaré S. 1986Some probabilistic and statistical problems in the analysis of DNA sequences. In Lectures on mathematics in the life sciences, vol. 17, pp. 57–86. Providence, RI: American Mathematical Society. Google Scholar
Omelchenko MV, Wolf YI, Gaidamakova EK, Matrosova VY, Vasilenko A, Zhai M, Daly MJ, Koonin EV, Makarova KS. 2005Comparative genomics of Thermus thermophilus and Deinococcus radiodurans: divergent routes of adaptation to thermophily and radiation resistance. BMC Evol. Biol. 5, 57. (doi:10.1186/1471-2148-5-57) Crossref, PubMed, ISI, Google Scholar