Resurrecting ancestral genes in bacteria to interpret ancient biosignatures

Two datasets, the geologic record and the genetic content of extant organisms, provide complementary insights into the history of how key molecular components have shaped or driven global environmental and macroevolutionary trends. Changes in global physico-chemical modes over time are thought to be a consistent feature of this relationship between Earth and life, as life is thought to have been optimizing protein functions for the entirety of its approximately 3.8 billion years of history on the Earth. Organismal survival depends on how well critical genetic and metabolic components can adapt to their environments, reflecting an ability to optimize efficiently to changing conditions. The geologic record provides an array of biologically independent indicators of macroscale atmospheric and oceanic composition, but provides little in the way of the exact behaviour of the molecular components that influenced the compositions of these reservoirs. By reconstructing sequences of proteins that might have been present in ancient organisms, we can downselect to a subset of possible sequences that may have been optimized to these ancient environmental conditions. How can one use modern life to reconstruct ancestral behaviours? Configurations of ancient sequences can be inferred from the diversity of extant sequences, and then resurrected in the laboratory to ascertain their biochemical attributes. One way to augment sequence-based, single-gene methods to obtain a richer and more reliable picture of the deep past, is to resurrect inferred ancestral protein sequences in living organisms, where their phenotypes can be exposed in a complex molecular-systems context, and then to link consequences of those phenotypes to biosignatures that were preserved in the independent historical repository of the geological record. As a first step beyond single-molecule reconstruction to the study of functional molecular systems, we present here the ancestral sequence reconstruction of the beta-carbonic anhydrase protein. We assess how carbonic anhydrase proteins meet our selection criteria for reconstructing ancient biosignatures in the laboratory, which we term palaeophenotype reconstruction. This article is part of the themed issue ‘Reconceptualizing the origins of life’.

Seattle, WA 98105, USA BK, 0000-0002-0482-2357 Two datasets, the geologic record and the genetic content of extant organisms, provide complementary insights into the history of how key molecular components have shaped or driven global environmental and macroevolutionary trends. Changes in global physico-chemical modes over time are thought to be a consistent feature of this relationship between Earth and life, as life is thought to have been optimizing protein functions for the entirety of its approximately 3.8 billion years of history on the Earth. Organismal survival depends on how well critical genetic and metabolic components can adapt to their environments, reflecting an ability to optimize efficiently to changing conditions. The geologic record provides an array of biologically independent indicators of macroscale atmospheric and oceanic composition, but provides little in the way of the exact behaviour of the molecular components that influenced the compositions of these reservoirs. By reconstructing sequences of proteins that might have been present in ancient organisms, we can downselect to a subset of possible sequences that may have been optimized to these ancient environmental conditions. How can one 2017 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/ by/4.0/, which permits unrestricted use, provided the original author and source are credited.

Introduction
The history of life on Earth has left two main repositories of evidence from which we can try to reconstruct it: the geological record and the extant genetic diversity of organisms. Contemporary organisms, however, can be complex and cryptic vehicles for information about their history [1,2]. Many genetic signatures in known life have been largely overwritten due to changing conditions of natural selection, evolutionary convergences or simply genetic drift [3,4]. However, functional links between the evolution of a protein (or a metabolic network) and biosignatures preserved in the geological record may provide clues to the timing and origin of major phylogenetic groups of organisms, including clades that went extinct and are no longer accessible for direct comparative study [5][6][7][8][9].
One way to connect the geological and genomic datasets is through ancestral sequence reconstruction [10][11]. One first infers ancestral sequences of biological molecules by phylogenetic reconstruction methods, and then uses these proposed sequences to synthesize models of palaeoenzymes either computationally or experimentally [12][13][14][15]. In some cases, the enzymes may be used to replace their modern counterparts in living organisms, being brought back to life as 'revenant genes', to obtain in situ expressions of 'palaeo-phenotypes' of the organisms that once harboured them [16,17].
Efforts to reconstruct palaeophenotypes require careful design, to avoid misinterpreting artefacts of reconstruction bias, or using host organisms that may not faithfully reproduce ancestral phenotypes because too many of their systems have since adapted to other conditions. We present here a palaeophenotype reconstruction approach that builds on prior efforts in palaeoenzymology, extending the utilization of inferred ancestral gene/enzyme sequences engineered within modern organisms. Our functional framework builds on applying palaeophenotypes to complex biology and on experimentally testing historical geobiological models and hypotheses. We begin by outlining the logical motivation for palaeophenotype reconstruction and describe the criteria that should be addressed as a basis for selecting an enzymatic system for palaeophenotype reconstruction at the systems level. We then use ancestral sequence reconstruction to determine the evolutionary history of a critical component of the photosynthetic CO 2 -fixation pathway-the beta-carbonic anhydrase protein-and critically evaluate the selection criteria for candidate revenant genes suitable for palaeophenotype reconstruction studies.
Our three linked goals are: (i) to learn (by solving concrete cases) when one must look beyond single-gene phylogeny to reconstruct entire functioning molecular systems, in order to correctly link enzyme properties to geological signatures; (ii) by studying cases such as Calvin-cycle carbon fixation, where an isotope signature is inherently linked to a functional criterion such as molecular selectivity and an environmental property such as oxygen activity, to demonstrate a consistent multi-factor reconstruction of an organism's phenotype in its environmental context; and (iii) to search for features in which ancient proteins may have been truly more primitive than any proteins that have survived in extant organisms, to understand the evolutionary progression from the first (we may suppose fitful) invasions of new modes of cellular life, metabolism or bioenergetics, and the refined forms of modern organisms that have made it difficult to infer the paths through which these emergences could have taken place.

The logical motivation for palaeophenotype reconstruction
To understand why-and for which systems-experimental palaeophenotype reconstruction is likely to be an important scientific advance, it is helpful to reflect on the limited forms of information about evolutionary processes that are actually employed in conventional methods of sequence-based phylogenetic reconstruction. Efforts to map out more of the genotype-phenotype correspondence, whether through modelling or by in vivo expression, and to correlate these with independent evidence carried geologically, may be understood as a way to bring in other dimensions of information about evolution that can contribute to historical reconstruction.

(a) Phenotype information can augment relatively simple sequence substitution models
The field of phylogenetic inference, after nearly a half century of dedicated work, has addressed most problems of consistent sampling and error estimation [18][19][20][21][22][23][24][25][26]. Yet the probability models that are the workhorses of most phylogenetic inference are disconnected from their context: they typically are site-local insertion, deletion and substitution models with no semantics of the functioning objects produced. 1 Often it is necessary to restrict probability models to evaluating substitution events independently at each site, to keep computations affordable especially for large datasets. However, such models are by construction incapable of reflecting interaction properties that can cause some joint variations to have very different likelihoods to produce viable organisms than the marginal variations do independently.
Information about interaction effects can often be obtained from models of protein domain structure, folding and function [29][30][31][32][33][34][35][36]. Learning how to use functional protein models to identify the most important non-local interactions and represent their effects on viability and fitness is one goal that can be pursued as more ancestral enzymes are reconstructed. Substitution probabilities must also be estimated jointly with alignments, and systematically biased alignment estimates can lead to mis-specified substitution models [37][38][39][40]. Information about folding and function can be particularly informative for ambiguous alignments, as substitutions or crossovers that preserve domain structures should yield viable organisms more often than those that would be incompatible with maintaining functional domains.

(b) Synthetic-biology methods offer ways to test the internal consistency of reconstructions
Much current-generation phylogenetic inference, because of the big-data survey nature of its questions [41], yields independently derived proposals for the presence or the absence of genes in ancient genomes, along with putative sequences for ancient proteins [42][43][44]. However, the probability models generating these claims at present include no information (as part of the Monte Carlo generate-and-test cycle itself) about the consistency of the physiologies they predict. Does the selected gene/protein system map with the phylogeny? Figure 1. Criteria for palaeophenotype reconstruction in the laboratory by generating hybrid ancient-modern bacterial systems.
ways to test historical models at the system level. It can help bridge the gap between ancestral sequences inferred with algorithmically sophisticated but information-poor probability methods, and proposals for how they might have co-occurred in ancient cells.

Proposed framework for rebuilding palaeophenotypes in the laboratory
We outline here three criteria in choosing enzyme systems for which a palaeophenotype reconstruction and systems engineering approach may be feasible and may yield interesting insights beyond those delivered by simple sequence-phylogenetic methods alone. They are directed both at properties of the enzymes and at properties of the clades and environments in which these occurred over time ( figure 1).
(i) Geology. Is the problem geochronologically constrained? Does the protein system of interest mediate a biosignature that is recoverable from the rock record? Is there temporal structure in that biosignature that can be correlated with important evolutionary transitions in either enzyme context or function? Do major changes in enzyme function correspond to events of phylogenetic divergence, which can then be calibrated against geochronology? (ii) Phenotype. Can information be provided by in vivo resurrection of the protein that resolves important ambiguities in the usual methods of ancestral sequence inference, or that shows important errors in the assumptions usually made about sequence inference? This is information we think of as being reported by the phenotype of the protein, whether it is revealed by resurrection or by computational modelling. (iii) Ancestry. Do we have a current organism that is similar enough to the host organisms for the ancestral proteins that expressing them in our current organism will reveal the phenotypic characters that governed their function in the past? Is the proposed host a well-studied model organism? Are other essential components of metabolic pathways present in contemporary organisms also remnants of ancient life [45], and can their major evolutionary innovations also be inferred where these are significant to system functions?
Through reconstructing and examining the evolutionary history of contemporary components and then tying their phenotypes into biosignatures in mineral form, we can provide insight into innovations that are grounded in the rock record and thus in the geological and ecological context. Enzymes involved in carbon fixation such as ribulose-1,5-bisphosphate carboxylase/ oxygenase (Rubisco) proteins are thought to be one of the main causes of a distinct biosignature preserved in the rock record. This biosignature is revealed through comparison of 13 C-isotope measurements between carbonate originally derived from atmospheric CO 2 and organic carbon sequestered from biomass. 13  is the distinctive catalyst and putative isotopic bottleneck for the Calvin-Benson-Bassham cycle [50,51]-the predominant photosynthetic carbon-fixation pathway by volume-and is therefore at the heart of many fundamental questions about the coevolution of early life and the development of biogeochemical cycles of the planet [52]. Indeed, while we do not know whether ancient Rubisco proteins exhibited palaeophenotypic properties that are comparable to those produced by contemporary Rubisco, or how efficiently the ancestral Rubisco proteins functioned under ancient environmental conditions, Rubisco plays a pivotal role in biogeochemical interpretation of the C-isotope fractionation patterns in deep time [46][47][48][49][53][54][55]. Undoubtedly, characterization of ancient Rubisco as a means of elucidating steps of biochemical adaptation and resulting protein biochemical and organismal behaviour at the key nodes of phylogeny would be crucial and would be applicable for a palaeophenotype reconstruction approach, suitable for subsequent isotope fractionation measurement and even phenotypic resurrection of isotope fractionation through engineering these ancient genes inside modern cyanobacteria [56][57][58].
Rubisco proteins do not function in isolation in a cellular system. In bacteria, carbonic anhydrase proteins support Rubisco activity, by mediating efficient CO 2 transport into and around the cell [59]. Carbonic anhydrase converts bicarbonate to carbon dioxide in the carboxysomes where Rubisco is localized-organelles thought to have evolved as a consequence of the increase in atmospheric oxygen concentration in the ancient Earth [60][61][62][63]-thereby alleviating the stringency required of Rubisco for carboxylase over oxygenase activity and reducing the energy and carbon losses that result from photorespiration. Additionally, carbonic anhydrases have essential roles in facilitating the transport of carbon dioxide and protons in the intracellular space, across biological membranes and in the layers of the extracellular space [64].
To understand the later-diverging innovations of photosynthetic systems associated with the rise of oxygen, the drawdown of atmospheric and oceanic CO 2 , and the colonization of land, it may even become essential to jointly reconstruct innovations in Rubisco and carbonic anhydrase with other metabolic and compartmental systems that serve as carbon-concentrating mechanisms [65,66]. The joint evolution of pathways associated with photorespiration may also provide evidence about O 2 /CO 2 discriminatory capabilities of ancestral enzymes as well as the interpretation of the ancient isotope signals.
As the first step beyond single-molecule reconstruction to the study of functional molecular systems, we present here the phylogenetic history of carbonic anhydrase enzymes and the ancestral sequence for the beta-carbonic anhydrase protein. We assess whether/how carbonic anhydrase proteins meet our selection criteria for palaeophenotype reconstruction, and demonstrate that events of horizontal gene transfer in an evolutionary tree for a given gene need to be recognized prior to a laboratory palaeophenotype reconstruction.

A universal carbon shuttle in photosynthesis and beyond: case study of the reconstruction of ancient carbonic anhydrase proteins
Carbonic anhydrase is found in metabolically diverse species representing all three domains of life [67][68][69]. The three main classes of carbonic anhydrase (alpha, beta and gamma) are not homologous and are thought to be a result of convergent evolution [70,71]. Although molecular dates based only on sequence comparison should be regarded with caution, it has been suggested that both the gamma and the beta classes are ancient enzymes, which existed before the split between archaeal and bacterial domains [72,73].
In a coarse assessment, carbonic anhydrase meets our palaeophenotype selection criteria. It carries out an essential and ancient function in the carbon concentration machinery. While no particular study (to our knowledge) has attributed a specific biosignature to the activity of bacterial carbonic anhydrases, this enzyme mediates CO 2 efflux in the carboxysome, potentially impacting the interpretation of Rubisco kinetic isotope selectivity, which is correlated with molecular CO 2 /O 2 discrimination and turnover rate, in terms of the ambient CO 2  to approximately 4.2 billion years ago [72]. Moreover, the presence of carbonic anhydrase in thermophilic chemolithoautotrophs suggests that other ancient CO 2 -fixation pathways besides the Calvin cycle also depended on carbonic anhydrase function for efficient C-fixation [74,75].
In this study, we focus on the beta-carbonic anhydrase, also called the prokaryotic carbonic anhydrase (although it has been found in eukaryotes as well). Beta-carbonic anhydrase is an ancient enzyme, and is widely represented in prokaryotes. Despite its critical role for Earth's biosphere, to date, not many studies have focused on the molecular evolution of beta-carbonic anhydrases [76,77]. Beta-carbonic anhydrases have been subdivided into four main clades, A-D. One group of enzymes belonging to the B clade of the beta-carbonic anhydrase is a probable example of neofunctionalization: these enzymes take CS 2 (and not CO 2 ) as a substrate [78].
The alignment of the 457 beta-carbonic anhydrase proteins revealed a strong conservation of about 200 amino acids, of which a handful are identical in all or almost all sequences. The phylogenetic reconstruction of the carbonic anhydrase from 388 genomes confirms the existence of four relatively well-supported clades (figure 2 and see electronic supplementary material, figure S1). Clade D seems to be the most distant one, and based on this we have chosen to root the tree from this group. The presence of archaeal homologues close to the root of three out of the four clades gives some credence to the hypothesis that duplication of the beta-carbonic anhydrase is very ancient, perhaps occurring before the archaea/bacteria separation. Except for the main clades and a few other shallower groups, the bootstrap values are altogether low, an ambiguity often observed when reconstructing deep phylogenies based on short protein alignments, due to the small amount of genetic information available [79]. This is further emphasized by the fact that the main bacterial phyla-with the notable exception of cyanobacteria-were rarely reconstructed as monophyletic, even inside each clade. The incongruence between the carbonic anhydrase tree and the generally accepted tree of life can be explained by either one, or a combination of two, main factors: (i) the amount of phylogenetic information contained in this 200-site alignment might be too low to reliably approximate the true tree or (ii) the carbonic anhydrase has been horizontally transferred several times. Among the few well-supported clades, the two that almost solely comprise sequences from the same phylum are two groups of cyanobacteria, one belonging to clade B (bootstrap support: 98) and one to clade C (low bootstrap support) ( figure 2). In addition, sequences from cyanobacteria are found almost exclusively in these two clades. Despite the poor support values of the tree, even inside the group encompassing the cyanobacterial sequences of clade B, a likely scenario for the evolution of the carbonic anhydrase in cyanobacteria can be drafted. We hypothesize that the last common cyanobacterial ancestor had at least two copies of the gene, one clade B-like and the other clade C-like, and that one or the other copy was lost (and perhaps further copies gained) following different patterns in different cyanobacterial descendants. In consequence, we performed an ancestral reconstruction only for the well-supported, monophyletic cyanobacterial group in clade B (see electronic supplementary material, figure S2).
Previously reconstructed phylogenetic history of the Rubisco proteins, an essential partner of carbonic anhydrase in the carboxysome, displays a highly supported phylogenetic tree which recapitulates the organismal phylogeny [54,80]. By contrast, here we show that the evolutionary history of the carbonic anhydrase is more complex: several enzymes with no common ancestor (the different classes of carbonic anhydrases) catalyse the same reaction. Even within classes of carbonic anhydrases, duplications and horizontal gene transfers seem frequent. This is an expected outcome of the greater generality and modular function of carbonic anhydrase compared to the specialized role of Rubisco: modular components of a metabolic system are much more readily transferred 2 or re-evolved through convergent evolution. The interpretation that carbonic anhydrase function is general and modular is further suggested by its redundant presence in many organisms: the number of homologues of the beta-carbonic anhydrase ranged Proteobacteria (4) Proteobacteria (13), Actinobacteria (1), Acidobacteria (1) Actinobacteria (1), Deltaproteobacteria (1) Actinobacteria (7), Deltaproteobacteria (4), misc (4) plants (7) 100 Deltaproteobacteria (3), misc (1) misc (5) misc (5) Euryarchaeota (2) Zetaproteobacteria Alphaproteobacteria (3) Gammaproteobacteria (12) Bacteroidetes (2) misc (5) Eukaryotes (3) Alphaproteobacteria (12) misc (6) Deltaproteobacteria (4) misc (2) misc (6) misc (4) Euryarchaeota (2) Euryarchaeota (3) Streptococcus (52) Crenarchaeota (1) Actinobacteria (5) Firmicutes (7), misc (6) Euryarchaeota (3) Chloroflexi (2) Actinobacteria (17), misc (2) Proteobacteria (9), Cyanobacteria (1), misc (1) Proteobacteria (15) Deltaproteobacteria (5) Proteobacteria (15), misc (8) Actinobacteria (10), Chloroflexi (2) Proteobacteria (8), misc (2) Gamma-(29), Beta-(18), Alphaproteobacteria (6), misc (9) Deltaproteobacteria (2) Alphaproteobacteria (20), misc (2) Epsilonproteobacteria (6) cyanobacterial group in the B clade of the beta-carbonic anhydrase, shows that the conserved parts of the alignment are even more conserved in the ancestors of the clade than in the extant species ( figure 3). None of the highly conserved residues differs in the ancestors of the group (node 67, depicted with an asterisk in figure 2, and its descendants, but not the extant species), suggesting that the function of the enzyme has not significantly changed inside the group. We further analysed the reliability of the ancestral carbonic anhydrase (node 67) in the last cyanobacterial common ancestor by examining the posterior probabilities for each reconstructed residue (figure 4). Multiple alignment of the reconstructed ancestor sequence with the sequences from all of the known extant species shows that the sequence of the large domain (located on the N-terminus side) can be confidently established ( figure 4). This fragment ranges from residues 40 to 240, which covers the functionally important zinc-binding core [82,83]. On the other hand, the other parts of the reconstructed protein sequence are less reliable, as evidenced by the fact that the second-and third-most probable amino acid are closer to the most probable one (figure 4).

Signatures of primitiveness: extending lessons learned from phylogenetic reconstruction to more general principles of molecular evolution
From joint phylogenetic/geochronological reconstructions such as our reconstruction of Rubisco [56] and carbonic anhydrase, we may attempt to identify more general principles of molecular evolution, either by finding signatures of primitiveness-characters in which the first invaders of a new niche or new functionality still reflect what they were before, and are not yet well adapted to their new mode of life-or by studying the dynamic by which forces of selection become displaced from one molecular system to another-when some primitive, inherited molecular function is incapable of responding to selective criteria from the new environment that have become irresistible. Relevant signatures of primitiveness can vary across protein families and ancestral functions. In some cases, it may be expected that proteins which are now sub-functionalized to specific substrates were once multifunctional, a change that we would expect to see [84][85][86] in a deep past when error rates in genome replication and also in translation should have been higher [87],  . Posterior probabilities of maximum-likelihood sequence residues, per position for the best (blue), second (orange) and third (grey) best residue, as reconstructed by phylobot.
favouring fewer and shorter genes, at the cost that each gene may have been required to catalyse multiple reactions in order for pathway formation to be possible. 3 The very existence of carboxylase/oxygenase discrimination in Rubisco has the character of an emerging but stalled sub-functionalization. Molecular oxygen was by all evidence a minor component of the environments in which Rubisco emerged, and the enzyme mechanism was selected on the only relevant substrate: CO 2 . The ability of cyanobacteria to perform oxygenic photosynthesis is thought to have converted the early reducing atmosphere into an oxidizing one, which dramatically changed the composition of life forms on Earth by simultaneously enabling new life forms tolerant to oxygen and leading to the near-extinction of the existing ones [90,91]. By enabling the massive proliferation of oxygenic photosynthesizers, Rubisco introduced the need for a substrate discrimination that had not existed when it arose, potentially creating conditions for its own failure. The reaction mechanism to which the whole enzyme structure is committed is one for which discrimination is costly and only partially successful, even under intense selection pressure. The result was displacement of selective pressure from Rubisco onto other enzymes such as carbonic anhydrase, and onto cellular ultrastructure in forming the carboxysome.
Owing to the correlation between isotope selectivity and substrate discrimination in Rubisco, a further signature of primitiveness is suggested, which can readily be empirically tested. All modern Rubiscos fall along a rather tight linear regression between turnover rate and CO 2 /O 2 discrimination, with a less tightly correlated isotope shift [50,51]. The apparently bright horizon, beyond which no Rubiscos are found, is the reason for the interpretation of an inherent trade-off in the mechanism that fixes CO 2 using only the free energy of hydrolysis, which forces turnover to be sacrificed as the price of discrimination. The absence of dispersion on the low-performance side of the regression has been interpreted as evidence of evolutionary optimization: that turnover is always maximized against the futile cycle of photorespiration in the CO 2 /O 2 environment of the enzyme. By studying turnover versus discrimination in ancestral Rubiscos, we can test whether they seem to reflect the same optimization horizon as modern Rubiscos. If not, one possibility is that the enzymes were more primitive; another is that naive sequence reconstruction methods miss essential information needed to identify the true ancestral form for this protein. Coupling an optimality analysis with functional measures of carbonic anhydrase will then allow us to compare the implied CO 2 /O 2 environment to ambient conditions in eras suggested by the phylogenies of Rubisco and other molecular clocks or ancient biosignatures.

Extrapolating biology back to earliest or pre-biological conditions
Comparative sequence reconstruction, even augmented by geochronology, can only reveal directly historical evidence within the era from which sequence divergences have been preserved to the present. The geochronological evidence may extend to earlier times-as is potentially the case for organic carbon signatures, although these become sparse near the time when life emerged on Earth about 4 billion years ago [92][93][94]-and it is plausible that complex cellular life also existed and evolved within these eras; but to understand them we will need interpretive methods beyond simple sequence comparison.
Phylogenetic reconstruction should move beyond sequence reconstruction and the functional deductions based on phenotypes that are heavily impacted by the sequence reconstruction quality. Indeed, early examples in the young field of palaeoenzymology attempted to make inferences about the temperature of ancient environments that no longer exist, by interpreting ancestral protein sequences from organisms long-since extinct [95][96][97][98]. Its predictions should come to integrate enzymatic structure and folding information as well as system-level effects on protein network interactions and physiology-hence the emphasis on 'palaeophenotype' rather than 'palaeosequence' or 'palaeostructure'.
Engineering ancient pathways whose behaviour could recapitulate certain past phenotypes and innovations has the potential to reconcile ambiguities in phylogenetic reconstruction. This may be realized by focusing on assessing those particular phenotypes that facilitate effective comparison to an independent historical record of component or organismal phenotype contained in the rock record. The same dynamical effects-error-prone replication, horizontal gene transfer [87,[99][100][101][102]-that tend to erase memory about genes and genomes in the deepest eras of life, also remove one of the aspects of biology that interferes most with modelling from first principles: the capacity for historically contingent features to contribute essential context for function. If we can use a combination of deep sequence reconstruction, functional and structural modelling and geological constraints on phenotype, we may be able to identify the 'rules of assembly' for very early living systems. These are the rules that, as we extend to epochs in which memory was less robust, should have governed strong evolutionary convergences, and, for the earliest molecular systems that were bound in the most detail to their geological environments, they may permit a limited amount of prediction from first principles about what those systems could have been. The reconstruction of carbonic anhydrase demonstrates the potential for integrative investigation. Though the carbonic anhydrase tree shows that horizontal gene transfer is a barrier for deep-time reconstructions of some enzymes using single-gene trees, carbonic anhydrase's intimate interactive relationship with other enzymes of the carbon uptake and carbon-concentrating mechanism apparatus leaves open the possibility of isolating its functional behaviour in conjunction with enzymes with more clearly resolved genetic histories such as Rubisco.

Conclusion
The ability of cyanobacteria to perform oxygenic photosynthesis is thought to have converted the early reducing atmosphere into an oxidizing one, which dramatically changed the composition of life forms on the Earth by simultaneously enabling new life forms tolerant to oxygen and leading to the near-extinction of species acclimated to anoxic conditions. Much of this information is derived from the geologic record-evidence of carbon cycling (and biological activity) can be inferred from carbon isotopes, which lie at the interfaces between enzymatic activity, organismal phenotype and the formation of sedimentary rocks [103]. As a corollary to these interpretive schemes, it is assumed that the controlled fixation of inorganic carbon to organic carbon is a precondition for the emergence of living systems. While selectivity in the carbon isotope composition of biologically produced organic matter is evidence that is preserved in the rock record that reflects the metabolic activity of ancient organisms, this record is increasingly problematic investigating life's early evolution and origins that complements the loss of a robust geologic record may prove insightful.
Characterizing palaeophenotypes in the laboratory is now feasible with evolutionary techniques already available and new techniques currently under development. However, the best chance of succeeding is in the integration of ancestral sequence reconstruction, genetic engineering and laboratory evolution. Current efforts attempt to establish the extent to which biochemical properties of palaeoenzymes may be correlated with biosignatures retrievable from the rock record, such as stable isotope ratios [56]. This behavioural information is critical to understanding ancient biological innovations and their effect upon (and shaping by) the Earth's environment. It is likely that the specific genotype/phenotype relationship in present-day organisms has evolved from organisms with very different metabolisms. If so, when and where did specific extant phenotypes evolve? How conserved are the genes of interest through time? How frequently were these genes transferred to other organisms? What can palaeophenotypes tell us about the origins of critical metabolic pathways? Reprogramming contemporary organisms by engineering their genomes with ancestral DNA is a fundamentally new methodology with which to extrapolate back to the origins of life. The technique is part of our broader approach that seeks to reconstruct macroevolutionary phenotypic trends across geologic time to ultimately infer conditions that approach the Last Universal Common Ancestor and possibly life's origins.

Methods (a) Genome selection
Representative genomes for different taxonomic groups were selected using the software PHYLOSKELETON [104] (available at https://bitbucket.org/lionelguy/phyloskeleton) as follows: one for each species of Streptococcus, one for each family in Proteobacteria and Cyanobacteria, and one for each order otherwise. After manual curation of poorly classified genomes, 388 genomes were retained (see electronic supplementary material, table S1).

(b) Carbonic anhydrase (CA) homologues identification
Genomes were searched with HMMer v. 3.1b2 [105], using the PFAM profile for the socalled prokaryotic carbonic anhydrase (http://pfam.xfam.org/family/PF00484), which contains representatives of the beta-carbonic anhydrase clades A, B and C (but not D), as defined in [78]. All homologues with an E-value < 1 × 10 −10 were retained. In the 388 selected genomes, between 0 and 6 homologues of CA per genome were found, with most genomes having 0 (102 genomes), 1 (176 genomes) or 2 (72 genomes) copies. In total, 457 sequences were retrieved. The sequence length varied from 118 to 867 amino acids, with eukaryotic homologues being the longest. Most sequences were between 200 and 250 amino acids.

(c) Sequence alignment
The 457 CA homologues found by PHYLOSKELETON and the 35 sequences that could be retrieved from Smeulders et al. [78] were aligned with MAFFT-LINSi v. 7.215 [106] and the resulting alignments were filtered to remove positions that had greater than 50% gaps, using TRIMAL [107]. The filtered alignment counted 200 positions and was visually inspected for any obvious misaligned regions.

(d) Phylogenetics
RAxML v. 8.2.8 [108] with the PROTCATLG model was used to infer the phylogenetic tree depicted in figure 2. A hundred parametric bootstraps were drawn to estimate branch reliability.

(e) Ancestral reconstruction
Ancestral reconstruction of the clade B carbonic anhydrase in cyanobacteria was done by first collecting all homologues of the carbonic anhydrase in cyanobacteria, choosing one representative genome per genus in cyanobacteria with PHYLOSKELETON, and adding genomes branching close to cyanobacteria, both clade A and clade B (Sorangium cellulosum, Sandaracinus amyloliticus and Beijerinckia indica) in figure 2, resulting in 83 genomes (see electronic supplementary material, table S2). Sequences belonging to clade B were then extracted and uploaded to PHYLOBOT [109]. Sequences were analysed with MUSCLE and MSAPROBS, and trees were drawn under the PROTCATLG and PROTGAMMALG models. The complete results are available at PHYLOBOT: http://phylobot.com/38899544/. The result of the MUSCLE alignment and the tree drawn under PROTCATLG were further analysed and visualized in JALVIEW v. 2.10.1 [110].
Authors' contributions. All authors analysed the data, contributed to writing the final manuscript and gave final approval for publication.