The molecular signal for the adaptation to cold temperature during early life on Earth

Several lines of evidence such as the basal location of thermophilic lineages in large-scale phylogenetic trees and the ancestral sequence reconstruction of single enzymes or large protein concatenations support the conclusion that the ancestors of the bacterial and archaeal domains were thermophilic organisms which were adapted to hot environments during the early stages of the Earth. A parsimonious reasoning would therefore suggest that the last universal common ancestor (LUCA) was also thermophilic. Various authors have used branch-wise non-homogeneous evolutionary models that better capture the variation of molecular compositions among lineages to accurately reconstruct the ancestral G + C contents of ribosomal RNAs and the ancestral amino acid composition of highly conserved proteins. They confirmed the thermophilic nature of the ancestors of Bacteria and Archaea but concluded that LUCA, their last common ancestor, was a mesophilic organism having a moderate optimal growth temperature. In this letter, we investigate the unknown nature of the phylogenetic signal that informs ancestral sequence reconstruction to support this non-parsimonious scenario. We find that rate variation across sites of molecular sequences provides information at different time scales by recording the oldest adaptation to temperature in slow-evolving regions and subsequent adaptations in fast-evolving ones.


Introduction
Several lines of evidence support the hypothesis that, during early stages of the evolution, life was adapted to high temperatures that may have prevailed on the surface of the early Earth. For instance, previous studies discovered that the deepest branching lineages within the bacterial and archaeal domains are thermophilic [1]. This scenario is also supported by the reconstruction and synthesis of ancestral translation elongation factor Tu sequences that appear more and more thermostable when going back in time [2] and by an estimation of the amino acid composition of ancestral proteomes that appear more similar to the composition of extant thermophiles than that of mesophiles [3].
A tight relation exists between either the G þ C content in ribosomal RNAs or the amino acid contents in proteins and the optimal growth temperature (OGT) of Bacteria and Archaea. Such correlations between molecular composition and temperature may be explained by structural adaptation increasing RNA and protein thermostability [4,5] and are likely to remain constant over evolutionary time. They allow the construction of molecular thermometers [6] that ancestral base and amino acid compositions through ancestral sequence reconstruction. Using such an approach, Boussau et al. [7] concluded that molecular sequence data confirm the hypothesis of high-temperature adaptation during the early stages of life, namely for the ancestors of the bacterial and archaeal domains. However, these authors reported strong evidence for a non-parsimonious scenario in which the last universal common ancestor (LUCA) itself, living at a still earlier stage of the history of life, was a mesophilic organism.
Through a number of control experiments, Boussau et al. [7] have shown that the use of non-homogeneous substitution models, which are capable of capturing the variation of composition among lineages, are key to accurately estimate ancestral base and amino acid compositions, and therefore ancestral temperatures. But these authors have not identified the specific molecular properties present in extant sequences that inform non-homogeneous models to support such a non-parsimonious scenario. In this letter, we aim to address this issue.

Material and methods (a) Datasets and non-homogeneous models
Boussau et al. [7] built a concatenate of small-and largesubunit rRNAs from 456 organisms (2239 sites) and used the sites restricted to stem regions (1043 sites) to infer the ancestral G þ C contents over the tree of life. From these alignments, we selected 125 species covering a broad taxonomic diversity without redundancy in the taxonomic sampling. Regarding the concatenation of proteins, the 56 gene families and 30 species considered in Boussau et al. [7] were used here, and increased to 38 species, with the addition of Archaea species in particular, which were poorly represented in the first set of species. We reconstructed ML phylogenetic trees for rRNAs (on the 2239 sites dataset) and proteins with PHYML [8]. A three-domain tree was obtained and the root was placed on the branch between the ancestors of Bacteria and Archaea/Eukaryotes. As in [9] and [7], the branch-wise equilibrium frequencies were estimated along these universal phylogenetic trees. The stem dataset was analysed with the BPPML program [10] assuming a discrete gamma distribution with eight categories to model rate variation among sites and the non-homogeneous Galtier & Gouy (GG) substitution model [11]. The GG model specifies branch-wise equilibrium G þ C contents, as well as an independent G þ C content at the root. For proteins, we used a new branch-wise non-homogeneous model implemented in the maximum-likelihood (ML) framework, named COaLA [12] that we recently designed. See the electronic supplementary material for a description of the COaLA model and an evaluation of the fit to data of the non-homogeneous models in comparison with homogeneous models.

(b) Molecular thermometers
OGT highly correlates with the G þ C content of the stem regions of rRNAs (r ¼ 0.76, p-value , 0.001; see the electronic supplementary material, figure S2) and with the second axis of the COA computed on amino acid compositions of the protein dataset restricted to prokaryotic species (r ¼ 0.88, p-value , 0.001; see the electronic supplementary material, figure S3). We controlled for phylogenetic inertia with the phylogenetic independent contrast approach [13] using the R package APE [14] and observed that those correlations were still strongly significant. Linear regressions between OGTs and compositions were then computed to obtain the molecular thermometers.

(c) Inference of ancestral compositions and optimal growth temperatures
The ancestral sequences were inferred with BPPANCESTOR [10] using the evolutionary parameters estimated by BPPML. For each node of the tree, 100 ancestral sequences were generated by drawing amino acids from the posterior distributions of probabilities. The average composition of these ancestral sequences was calculated and the corresponding ancestral temperatures were deduced from the molecular thermometers (see the electronic supplementary material for the confidence intervals computation and the caution required when interpreting ancestral temperatures).

Results and discussion
We first confirm results obtained in [7] with the present rRNA and protein datasets and the non-homogeneous GG [11] and COaLA [12] substitution models in ML. Figures 1 and 2 0 show that LUCA is estimated to have lived in colder environments than the ancestors of Bacteria and Archaea (Wilcoxon test, p-value , 0.001), which were hyperthermophiles. Electronic supplementary material, figure S1 shows that this pattern is also recovered when an alternative tree topology is used, in which Eukaryotes branch within Archaea (Eocyte hypothesis [15]) but is less pronounced with the homogeneous LG model, which infers a thermophilic LUCA. The phylogenetic signal that informs a non-hyperthermophilic LUCA and yet two hyperthermophilic descendants is currently unknown. However, several points suggest that the variation in evolutionary rate among sites plays a role. First, Fournier & Gogarten [16] highlighted that amino acids that are found in higher proportions in hyperthermophilic species are rarer at slow-evolving sites. Such amino acids notably include charged residues [6]. Second, the signal for a parallel adaptation to high temperatures is partially lost when COaLA is employed without a gamma distribution to model the variation in rate among sites (see the electronic supplementary material, figure S1).
To highlight the influence of rate variation among sites in the differential recording of ancestral compositions, we partitioned the rRNA and protein datasets according to the site evolutionary rates. Figure 2a,b shows that with slow-evolving sites, all ancestors are inferred to be mesophilic organisms, LUCA being adapted to lower temperatures than its two descendants. The ancestral compositions of fast-evolving sites tend to favour hotter ancestral environments, even for LUCA with proteins. But LUCA is still inferred to live at lower temperatures than the ancestors of Bacteria and Archaea. As expected, the quantitative estimates of past temperatures inferred by both slow-and fast-evolving sites are different from those obtained with the complete dataset. Indeed, although slow-evolving sites conserved reliable signals for ancestral compositions, they carry less phylogenetic information for the early parallel adaptation to high temperature, which explains why this pattern is less pronounced than that with the complete dataset. However, both the G þ C content (rRNAs) and the third axis of a correspondence analysis (proteins) computed from the slow-evolving sites of extant sequences correlate with OGT (r ¼ 0.72, p-value , 0.001 and r ¼ 0.43, p-value , 0.05, respectively), adding support to the idea that slow-evolving sites can respond to temperature and can represent accurate fossils of ancestral adaptation to temperature. Fast-evolving sites contain a stronger signal for this parallel adaptation but necessarily less reliable information for ancestral compositions, and so ancestral temperatures.
All these results suggest the presence of a genuine signal in molecular sequences indicating a mesophilic LUCA. This signal was recorded thanks to a combination of compositional variation in time and rate variation in site such that slowevolving sites more accurately reflect older temperatures, while fast-evolving sites partially erased this oldest signal in favour of subsequent adaptations to higher temperatures.
Gowri-Shankar & Rattray [17] showed that there is an intrinsic correlation between evolutionary rates across sites and base composition in rRNAs. Therefore, nucleotide composition varies across the sites of an rRNA alignment. These authors showed that branch-wise non-homogeneous models, which account for the variation of composition in time but assume across-site homogeneity, may infer biased ancestral sequence compositions for sequences generated by a time-homogeneous process in which evolutionary rate and base compositions are correlated. The inference bias is directed towards the composition of slow-evolving sites which are, in the case of full-length rRNAs including both stem and loop regions, GC-poor. One could therefore wonder whether such an inference bias would be responsible for the low G þ C content inferred for LUCA compared with the higher G þ C contents of its first descendants. We reject this bias with two points. First, as in this study, Boussau et al. [7] applied the molecular thermometers on rRNAs to only the stem regions of the molecule. Electronic supplementary material, figure S4 shows that, for these regions, the correlation found by Gowri-Shankar & Rattray [17] is in the opposite direction, although non-significant, with G þ Cenriched slow-evolving sites. Second, we simulated data in a context where the bias would apply, assuming only rsbl.royalsocietypublishing.org Biol Lett 9: 20130608 heterogeneity among sites and no heterogeneity among branches, and verified whether the correlation between site evolutionary rates and site compositions incorrectly informs the non-homogeneous model to estimate a lower G þ C content of LUCA than for its descendants. We partitioned rRNA alignment sites in eight categories according to their evolutionary rate. For each rate-specific category, we simulated DNA sequences with a homogeneous Tamura92 model and the G þ C equilibrium frequency fixed to the observed G þ C frequency of the category, and then concatenated the eight simulated sets. We repeated this procedure 100 times and reconstructed ancestral G þ C contents with the nonhomogeneous GG model on each concatenated simulated alignment. Electronic supplementary material, figure S5 shows that the pattern of parallel increase in G þ C content from LUCA found from real data is not recovered. Instead, LUCA has a higher G þ C content than its two descendants. As slow-evolving sites of stem regions have globally higher G þ C contents than fast-evolving ones (see the electronic supplementary material, figure S4), this simulation result is in agreement with the bias of Gowri-Shankar & Rattray [17]. It further suggests that, if the non-homogeneous model applied to real data is affected by the bias as it is when applied to simulations, the true G þ C content of LUCA may so far have been overestimated.
All these results indicate that non-homogeneous models can capture a genuine timewise variation in composition and that the pattern of parallel increase to high temperatures does not result from a bias owing to a correlation between site-specific rates and site-specific compositions [17] but emerges in spite of this bias.