Paleophenotype reconstruction as a window into historic biological states

The conditions for life in the deep past were different in many important ways from conditions of life on Earth today. Simple sequence-based comparative methods of evolutionary reconstruction, applied to single genes or proteins, can therefore give an incomplete or misleading picture of ancestral life forms. One way to augment sequence-based, single-gene methods to obtain a richer and more reliable picture of the deep past, is to resurrect inferred ancestral protein sequences in living organisms, where their phenotypes can be exposed in a complex molecular-systems context, and to then link consequences of those phenotypes to biosignatures that were preserved in the independent historical repository of the geological record. Good candidates for such ‘revenant gene’ studies are the genes for enzymes involved in carbon-fixation or other core metabolic pathways. Resurrecting ancestral DNA using synthetic-biology methods to engineer modern host bacteria is just at the beginning of its use as a systematic method in evolutionary biology. It has great potential to refine our understanding of the historical era already probed by phylogenetic methods, and even to suggest the forces governing the assembly of living systems reaching back into the “pre-historic” past, before those sequence divergences that have left descendants into the modern era. However, good design of revenant gene studies introduces new and interesting problems in the selection of genes, biosignatures, and modern host organisms, which should be understood as part of the next step in advancement of evolutionary methods.


Main Text
The history of life on Earth has left two main repositories of evidence from which we can try to reconstruct it: the geological record, and the extant genetic diversity of organisms. Contemporary organisms, however, can be complex and cryptic vehicles for information about their history [1,2].
Many genetic signatures in known life have been largely overwritten due to changing conditions of natural selection, evolutionary convergences, or simply genetic drift [3,4]. However, functional links between the evolution of a protein (or a metabolic network) and biosignatures preserved in the geological record may provide clues to the timing and origin of major phylogenetic groups of organisms, including clades that went extinct and are no longer accessible for direct comparative study [5][6][7][8][9].
One way to connect the geological and genomic data sets is through ancestral sequence reconstruction [10][11][12]. One first infers ancestral sequences of biological molecules by phylogenetic reconstruction methods, and then uses these proposed sequences to synthesize models of paleoenzymes either computationally or experimentally [13][14][15][16]. In some cases the enzymes may be used to replace their modern counterparts in living organisms, being brought back to life as 'revenant genes', to obtain in situ expressions of "paleo-phenotypes" of the organisms that once harboured them [17,18].
Efforts to reconstruct paleophenotypes require careful design, to avoid misinterpreting artifacts of reconstruction bias, or using host organisms that may be not faithfully reproduce ancestral phenotypes because too many of their systems have since adapted to other conditions. We present here a paleophenotype reconstruction approach that builds on prior efforts in paleoenzymology, extending the utilization of inferred ancestral gene/enzyme sequences engineered within modern organisms. Our functional framework builds on applying paleophenotypes to complex biology and on experimentally testing historical geobiological models and hypotheses. We begin by outlining the logical motivation for paleophenotype reconstruction and describe the criteria that should be addressed as a basis for selecting an enzymatic system for paleophenotype reconstruction at the systems level. We then use ancestral sequence reconstruction to determine the evolutionary history of a critical component of the photosynthetic CO2 fixation pathway -the beta-carbonic anhdyrase protein -and critically evaluate the selection criteria for candidate revenant genes suitable for paleopheotype reconstruction studies.
Our three linked goals are: 1) to learn (by solving concrete cases) when one must look beyond single-gene phylogeny to reconstruct entire functioning molecular systems, in order to correctly link enzyme properties to geological signatures; 2) by studying cases such as Calvin-cycle carbon fixation, where an isotope signature is inherently linked to a functional criterion such as molecular selectivity and an environmental property such as oxygen activity, to demonstrate a consistent multi-factor reconstruction of an organism's phenotype in its environmental context; and 3) to search for features in which ancient proteins may have been truly more primitive than any proteins that have survived in extant organisms, to understand the evolutionary progression from the first (we may suppose fitful) invasions of new modes of cellular life, metabolism, or bioenergetics, and the refined forms of modern organisms that have made it difficult to infer the paths through which these emergences could have taken place.

THE LOGICAL MOTIVATION FOR PALEOPHENOTYPE RECONSTRUCTION
To understand why -and for which systems -experimental paleophenotype reconstruction is likely to be an important scientific advance, it is helpful to reflect on the limited forms of information about evolutionary processes that are actually employed in conventional methods of sequencebased phylogenetic reconstruction. Efforts to map out more of the genotype-phenotype correspondence, whether through modelling or by in vivo expression, and to correlate these with independent evidence carried geologically, may be understood as a way to bring in other dimensions of information about evolution that can contribute to historical reconstruction.

Phenotype Information Can Augment Relatively Simple Sequence Substitution Models
The field of phylogenetic inference, after nearly a half century of dedicated work, has addressed most problems of consistent sampling and error estimation [19][20][21][22][23][24][25][26][27]. Yet the probability models that are the workhorses of most phylogenetic inference are disconnected from their context: they typically are site-local insertion, deletion, and substitution models with no semantics of the functioning objects produced. 1 Often it is necessary to restrict probability models to evaluating substitution events independently at each site, to keep computations affordable especially for large data sets. However, such models are by construction incapable of reflecting interaction properties that can cause some joint variations to have very different likelihood to produce viable organisms than the marginal variations do independently.
Information about interaction effects can often be obtained from models of protein domain structure, folding, and function [30][31][32][33][34][35][36][37]. Learning how to use functional protein models to identify the most important non-local interactions and represent their effects on viability and fitness is one goal that can be pursued as more ancestral enzymes are reconstructed. Substitution probabilities must also be estimated jointly with alignments, and systematically biased alignment estimates can lead to mis-specified substitution models [38][39][40][41]. Information about folding and function can be particularly informative for ambiguous alignments, as substitutions or crossovers that preserve domain structures should yield viable organisms more often than those that would be incompatible with maintaining functional domains.

Synthetic-Biology Methods Offer Ways to Test the Internal Consistency of Reconstructions
Much current-generation phylogenetic inference, because of the big-data survey nature of its questions [42], yields independently-derived proposals for the presence or absence of genes in ancient genomes, along with putative sequences for ancient proteins [43][44][45]. However, the probability models generating these claims at present include no information (as part of the Monte Carlo generate-and-test cycle itself) about the consistency of the physiologies they predict. The use of synthetic biology methods to insert reconstructed genes into living organisms, or to test proposed molecular systems either in vitro or with modified genomes in vivo provides ways to test historical models at the system level. It can help bridge the gap between ancestral sequences inferred with algorithmically sophisticated but information-poor probability methods, and proposals for how they might have co-occurred in ancient cells.

PROPOSED FRAMEWORK FOR REBUILDING PALEOPHENOTYPES IN THE LABORATORY
We outline here three criteria in choosing enzyme systems for which a paleophenotype reconstruction and systems-engineering approach may be feasible and may yield interesting insights beyond those delivered by simple sequence-phylogenetic methods alone. They are directed both at properties of the enzymes and properties of the clades and environments in which these occurred over time (see Figure 1).

i)
Geology: Is the problem geochronologically constrained? Does the protein system of interest mediate a biosignature that is recoverable from the rock record? Is there temporal structure in that biosignature that can be correlated with important evolutionary transitions in either enzyme context or function? Do major changes in enzyme function correspond to events of phylogenetic divergence, which can then be calibrated against geochronology?
ii) Phenotype: Can information be provided by in vivo resurrection of the protein that resolves important ambiguities in the usual methods of ancestral sequence inference, or that shows important errors in the assumptions usually made about sequence inference? This is information we think of as being reported by the phenotype of the protein, whether it is revealed by resurrection or by computational modelling.

iii)
Ancestry: Do we have a current organism that is similar enough to the host organisms for the ancestral proteins that expressing them in our current organism will reveal the phenotypic characters that governed their function in the past? Is the proposed host a well-studied model organism? Are other essential components of metabolic pathways present in contemporary organisms also remnants of ancient life [46], and can their major evolutionary innovations also be inferred where these are significant to system functions?
Through reconstructing and examining the evolutionary history of contemporary components and then tying their phenotypes into biosignatures in mineral form, we can provide insight into innovations that are grounded in the rock record and thus in the geological and ecological context.
Enzymes involved in carbon-fixation such as Ribulose-1,5-bisphosphate carboxylase/oxygenase (Rubisco) proteins are thought to be one of the main causes of a distinct biosignature preserved in the rock record. This biosignature is revealed through comparison of 13C-isotope measurements between carbonate originally derived from atmospheric CO2 and organic carbon sequestered from biomass. 13C-isotope fractionation differences are the oldest record of living organisms, extending to at least ~3.5 billion years in the past [47][48][49][50]. Rubisco is the distinctive catalyst and putative isotopic bottleneck for the Calvin-Benson-Bassham cycle [51,52] -the predominant photosynthetic Carbon-fixation pathway by volume -and is therefore at the heart of many fundamental questions about the co-evolution of early life and the development of biogeochemical cycles of the planet [53].
Indeed, while we do not know whether ancient Rubisco proteins exhibited paleophenotypic properties that are comparable to those produced by contemporary Rubisco, or how efficiently the ancestral Rubisco proteins functioned under ancient environmental conditions, Rubisco plays a pivotal role in biogeochemical interpretation of the C-isotope fractionation patterns in deep time [47][48][49][50][54][55][56]. Undoubtedly, characterization of ancient Rubisco as a means of elucidating steps of biochemical adaptation and resulting protein biochemical and organismal behaviour at the key nodes of phylogeny would be crucial and would be applicable for a paleophenotype reconstruction approach, suitable for subsequent isotope fractionation measurement and even phenotypic resurrection of isotope fractionation through engineering these ancient genes inside modern cyanobacteria [57][58][59].
Rubisco proteins don't function in isolation in a cellular system. In bacteria, carbonic anhydrase proteins support Rubisco activity, by mediating efficient CO2 transport into and around the cell [60].
Carbonic anhydrase converts bicarbonate to carbon dioxide in the carboxysomes where Rubisco is localized -organelles thought to have evolved as a consequence of the increase in atmospheric oxygen concentration in the ancient Earth [61][62][63][64] -thereby alleviating the stringency required of Rubisco for carboxylase over oxygenase activity and reducing the energy and carbon loss that result from photorespiration. Additionally, carbonic anhydrases have essential roles in facilitating the transport of carbon dioxide and protons in the intracellular space, across biological membranes and in the layers of the extracellular space [65].
To understand the later-diverging innovations of photosynthetic systems associated with the rise of oxygen, the drawdown of atmospheric and oceanic CO2, and the colonization of land, it may even become essential to jointly reconstruct innovations in Rubisco and carbonic anhydrase with other metabolic and compartmental systems that serve as Carbon-concentrating mechanisms [66,67]. The joint evolution of pathways associated with photorespiration may also provide evidence about O2/CO2 discriminatory capabilities of ancestral enzymes as well as the interpretation of the ancient isotope signals.
As the first step beyond single-molecule reconstruction to the study of functional molecular systems, we present here the phylogenetic history of carbonic anhydrase enzymes and the ancestral sequence for the beta-carbonic anhydrase protein. We assess whether/how carbonic anhydrase proteins meet our selection criteria for paleophenotype reconstruction, and demonstrate that events of horizontal gene transfer in an evolutionary tree for a given gene need to be recognized prior to a laboratory paleophenotype reconstruction.

A UNIVERSAL CARBON SHUTTLE IN PHOTOSYNTHESIS AND BEYOND: CASE STUDY OF THE RECONSTRUCTION OF ANCIENT CARBONIC ANHYDRASE PROTEINS
Carbonic anhydrase is found in metabolically diverse species representing all three domains of life, [68][69][70]. The three main classes of carbonic anhydrase (alpha, beta and gamma) are not homologous and are thought to be a result of convergent evolution [71,72]. Although molecular dates based only on sequence comparison should be regarded with caution, it has been suggested that both the gamma and the beta classes are ancient enzymes, which existed before the split between archaeal and bacterial domains [73,74].
In a coarse assessment, carbonic anhydrase meets our paleophenotype selection criteria. It carries out an essential and ancient function in the carbon concentration machinery. While no particular study (to our knowledge) has attributed a specific biosignature to the activity of bacterial carbonic anhydrases, this enzyme mediates CO2 efflux in the carboxysome, potentially impacting the interpretation of Rubisco kinetic isotope selectivity, which is correlated with molecular CO2/O2 discrimination and turnover rate, in terms of the ambient CO2 and O2 activities in the cellular environment. The root of the gamma-class is inferred to have extended to approximately 4.2 billion years ago [73]. Moreover, the presence of carbonic anhydrase in thermophilic chemolithoautotrophs suggests that other ancient CO2-fixation pathways besides the Calvin cycle also depended on carbonic anhydrase function for efficient C-fixation [75,76].
In this study, we focus on the beta-carbonic anhydrase, also called the prokaryotic carbonic anhydrase (although it has been found in eukaryotes as well). Beta-carbonic anhydrase is an ancient enzyme, it is widely represented in prokaryotes and despite its critical role for Earth's biosphere, to date, not many studies focused on the molecular evolution of beta-carbonic anhydrases [77,78].
Beta-carbonic anhydrases have been subdivided into four main clades, A to D. One group of enzymes belonging to the B clade of the beta-carbonic anhydrase is a probable example of neofunctionalization: these enzymes take CS2 (and not CO2) as a substrate [79]. Previously reconstructed phylogenetic history of the Rubisco proteins, an essential partner of carbonic anhydrase in the carboxysome, displays a highly supported phylogenetic tree which recapitulates the organismal phylogeny [55,81]. In contrast, here we show that the evolutionary history of the carbonic anhydrase is more complex: several enzymes with no common ancestor (the different classes of carbonic anhydrases) catalyse the same reaction. Even within classes of carbonic anhydrases, duplications and horizontal gene transfers seem frequent. This is an expected outcome of the greater generality and modular function of carbonic anhydrase compared to the specialized role of Rubisco: modular components of a metabolic system are much more readily transferred 2 or re-evolved through convergent evolution. The interpretation that carbonic anhydrase function is general and modular is further suggested by its redundant presence in many organisms: the number of homologs of the beta-carbonic anhydrase ranged from none to as many as six carbonic anhydrase genes in cyanobacteria, with most genomes having more than one gene. This suggests that the cost of exchanging (by horizontal gene transfer) or losing one copy of carbonic anhydrase (either completely or by neofunctionalization, as in the case of the CS2 hydrolase [79] is small. Nevertheless, ancestral reconstruction, in parts of the tree where a phylogenetic signal can be established with significant confidence, as in the cyanobacterial group in the B clade of the beta-carbonic anhydrase, shows that the conserved parts of the alignment are even more conserved in the ancestors of the clade than in the extant species (Figure 3). None of the highly conserved residues differs in the ancestors of the group (Node 67, depicted with an asterisk on Figure 2, and its descendants, but not the extant species), suggesting that the function of the enzyme hasn't significantly changed inside the group.
We further analysed the reliability of the ancestral carbonic anhydrase (Node 67) in the last cyanobacterial common ancestor by examining the posterior probabilities for each reconstructed residue ( Figure 4). Multiple alignment of the reconstructed ancestor sequence with the sequences from all of the known extant species shows that the sequence of the large domain (located on the Nterminus side) can be confidently established (Figure 4). This fragment ranges from residues 40 to 240, which covers the functionally important zinc-binding core [83,84]. On the other hand, the other parts of the reconstructed protein sequence are less reliable, as evidenced by the fact that the second-and third most probable amino-acid are closer to the most probable one (Figure 4). Relevant signatures of primitiveness can vary across protein families and ancestral functions. In some cases, it may be expected that proteins which are now sub-functionalized to specific substrates were once multifunctional, a change that we would expect to see [85][86][87] in a deep past when error rates in genome replication and also in translation should have been higher [88], favouring fewer and shorter genes, at the cost that each gene may have been required to catalyse multiple reactions in order for pathway formation to be possible. 3 The very existence of carboxylase/oxygenase discrimination in Rubisco has the character of an emerging but stalled sub-functionalization. Molecular oxygen was by all evidence a minor component of the environments in which Rubisco emerged, and the enzyme mechanism was selected on the only relevant substrate: CO2. The ability of cyanobacteria to perform oxygenic photosynthesis is thought to have converted the early reducing atmosphere into an oxidizing one, which dramatically changed the composition of life forms on Earth by simultaneously enabling new life forms tolerant to oxygen and leading to the near-extinction of the existing ones [91,92]. By enabling the massive proliferation of oxygenic photosynthesizers, Rubisco introduced the need for a substrate discrimination that had not existed when it arose, potentially creating conditions for its own failure. The reaction mechanism to which the whole enzyme structure is committed is one for which discrimination is costly and only partially successful, even under intense selection pressure.

SIGNATURES OF PRIMITIVENESS: EXTENDING LESSONS LEARNED
The result was displacement of selective pressure from Rubisco onto other enzymes such as carbonic anhydrase, and onto cellular ultrastructure in forming the carboxysome. 3 Such an interpretation has been advanced for homologies in proteins of the rTCA cycle in Aquificaceae Due to the correlation between isotope selectivity and substrate discrimination in Rubisco, a further signature of primitiveness is suggested, which can readily be empirically tested. All modern Rubiscos fall along a rather tight linear regression between turnover rate and CO2/O2 discrimination, with a less-tightly correlated isotope shift [51,52]. The apparently bright horizon, beyond which no Ribiscos are found, is the reason for the interpretation of an inherent trade-off in the mechanism that fixes CO2 using only the free energy of hydrolysis, which forces turnover to be sacrificed as the price of discrimination. The absence of dispersion on the low-performance side of the regression has been interpreted as evidence of evolutionary optimization: that turnover is always maximized against the futile cycle of photorespiration in the CO2/O2 environment of the enzyme. By studying turnover versus discrimination in ancestral Rubiscos, we can test whether they seem to reflect the same optimization horizon as modern Rubiscos. If not, one possibility is that the enzymes were more primitive; another is that naive sequence reconstruction methods miss essential information needed to identify the true ancestral form for this protein.
Coupling an optimality analysis with functional measures of carbonic anhydrase will then allow us to compare the implied CO2/O2 environment to ambient conditions in eras suggested by the phylogenies of Rubisco and other molecular clocks or ancient biosignatures.

EXTRAPOLATING BIOLOGY BACK TO EARLIEST-OR PRE-BIOLOGICAL CONDITIONS
Comparative sequence reconstruction, even augmented by geochronology, can only reveal directly historical evidence within the era from which sequence divergences have been preserved to the present. The geochronological evidence may extend to earlier times -as is potentially the case for organic carbon signatures, although these become sparse near the time when life emerged on Earth about 4 billion years ago [93][94][95] -and it is plausible that complex cellular life also existed and evolved within these eras; but to understand them we will need interpretive methods beyond simple sequence comparison.
Phylogenetic reconstruction should move beyond sequence reconstruction and the functional deductions based on phenotypes that are heavily impacted by the sequence reconstruction quality.
Indeed, early examples in the young field of paleoenzymology attempted to make inferences about the temperature of ancient environments that no longer exist, by interpreting ancestral protein sequences from organisms long-since extinct [96][97][98][99]. Its predictions should come to integrate enzymatic structure and folding information as well as systems-level effects on protein network interactions and physiology -hence the emphasis on "paleophenotype" rather than "paleosequence" or "paleostructure".
Engineering ancient pathways whose behaviour could recapitulate certain past phenotypes and innovations has the potential to reconcile ambiguities in phylogenetic reconstruction. This may be realized by focusing on assessing those particular phenotypes that facilitate effective comparison to an independent historical record of component or organismal phenotype contained in the rock record. The same dynamical effects -error-prone replication, horizontal gene transfer [88,[100][101][102][103]that tend to erase memory about genes and genomes in the deepest eras of life, also remove one of the aspects of biology that interferes most with modelling from first principles: the capacity for historically-contingent features to contribute essential context for function. If we can use a combination of deep sequence reconstruction, functional modelling, and geological constraints on phenotype, we may be able to identify the "rules of assembly" for very early living systems. These are the rules that, as we extend to epochs in which memory was less robust, should have governed

CONCLUSION
The ability of cyanobacteria to perform oxygenic photosynthesis is thought to have converted the early reducing atmosphere into an oxidizing one, which dramatically changed the composition of life forms on Earth by simultaneously enabling new life forms tolerant to oxygen and leading to the near-extinction of species acclimated to anoxic conditions. Much of this information is derived from the geologic record-evidence of carbon cycling (and biological activity) can be inferred from carbon isotopes which lies at the interfaces between enzymatic activity, organismal phenotype and the formation of sedimentary rocks [104]. As a corollary to these interpretive schemes, it is assumed that the controlled fixation of inorganic carbon to organic carbon is a precondition for the emergence of living systems. While selectivity in the carbon isotope composition of biologically-produced organic matter is evidence that is preserved in the rock record that reflects the metabolic activity of ancient

Genome selection
Representative genomes for different taxonomic groups were selected using the software phyloSkeleton (manuscript in revision; available at https://bitbucket.org/lionelguy/phyloskeleton) as follows: one for each species of Streptococcus, one for each family in Proteobacteria and Cyanobacteria, and one for each order otherwise. After manual curation of poorly classified genomes, 388 genomes were retained (Supp. Table 1).

Carbonic anhydrase homologs identification
Genomes were search with HMMer v3.1b2 [105], using the PFAM profile for the so-called prokaryotic carbonic anhydrase (http://pfam.xfam.org/family/PF00484), which contains representatives of the betacarbonic anhydrase clades A, B and C (but not D), as defined in [79]. All homologs with an E-value < 1e-10 were retained. In the 388 selected genomes, between 0 and 6 homologs of CA per genome were found, with most genomes having 0 (102 genomes), 1 (176 genomes) or 2 (72 genomes) copies. In total, 457 sequences were retrieved. The sequence length varied from 118 to 867 amino acids, with eukaryotic homologs being the longest. Most sequences were between 200 and 250 amino-acids.

Sequence alignment
The 457 CA homologs found by phyloSkeleton and the 35 sequences that could be retrieved from Smeulders et al. (2011) were aligned with mafft-linsi v7.215 [106] and the resulting alignment were filtered to remove positions that had >50% gaps, using trimal [107]. The filtered alignment counted 200 positions and was visually inspected for any obvious misaligned regions.

Phylogenetics
RAxML 8.2.8 [108] with the PROTCATLG model was used to infer the phylogenetic tree depicted in Figure 2.
A hundred parametric bootstraps were drawn to estimate branch reliability.

Ancestral reconstruction
Ancestral reconstruction of the clade B carbonic anhydrase in cyanobacteria was done by first collecting all homologs of the carbonic anhydrase in cyanobacteria, choosing one representative genome per genus in cyanobacteria with phyloSkeleton, and adding genomes branching close to cyanobacteria, both clade A and clade B (Sorangium cellulosum, Sandaracinus amyloliticus, and Beijerinckia indica) in Figure 2, resulting in 83 genomes (Supp. Table 2). Sequences belonging to clade B were then extracted and uploaded to Phylobot [109].
Sequences were analyzed with muscle and msaprobs, and trees were drawn under the PROTCATLG and PROTGAMMALG models. The complete results are available at Phylobot: http://phylobot.com/38899544/.
The result of the muscle alignment and the tree drawn under PROTCATLG were further analyzed and visualized in Jalview 2.10.1 [110].

Authors' Contributions
All authors analyzed the data, contributed to writing the final manuscript and gave final approval for publication.

Figure 1
Criteria for paleophenotype reconstruction in the laboratory by generating hybrid ancient-modern bacterial systems

Figure 2
Maximum-likelihood phylogenetic reconstruction of the carbonic anhydrase (CA). The 492 CA homologs retrieved from 388 representative genomes were aligned with mafft-linsi and a maximum-likelihood phylogeny was inferred from the alignment with RAxML, using the PROTCATLG model. Bootstrap support is displayed for branches supported by > 50 bootstrap trees.
Clades were collapsed to provide a more readable tree. The number of members of major taxonomic groups is presented in parentheses next to each collapsed clade. Cyanobacteria are depicted in blue.