Autocatalytic chemical networks at the origin of metabolism

Modern cells embody metabolic networks containing thousands of elements and form autocatalytic sets of molecules that produce copies of themselves. How the first self-sustaining metabolic networks arose at life's origin is a major open question. Autocatalytic sets smaller than metabolic networks were proposed as transitory intermediates at the origin of life, but evidence for their role in prebiotic evolution is lacking. Here, we identify reflexively autocatalytic food-generated networks (RAFs)—self-sustaining networks that collectively catalyse all their reactions—embedded within microbial metabolism. RAFs in the metabolism of ancient anaerobic autotrophs that live from H2 and CO2 provided with small-molecule catalysts generate acetyl-CoA as well as amino acids and bases, the monomeric components of protein and RNA, but amino acids and bases without organic catalysts do not generate metabolic RAFs. This suggests that RAFs identify attributes of biochemical origins conserved in metabolic networks. RAFs are consistent with an autotrophic origin of metabolism and furthermore indicate that autocatalytic chemical networks preceded proteins and RNA in evolution. RAFs uncover intermediate stages in the emergence of metabolic networks, narrowing the gaps between early Earth chemistry and life.

were added to the network as additional parameters, and 66% of the final set of catalyzed reactions was assigned at least one small catalyst. The subsets for Met and Ace were obtained by crossing the genomic annotation of Moorella thermoacetica and Methanococcus maripaludis in KEGG with the previously built network, and with the addition of missing reactions that were present in corresponding manually-curated models [3,4]. The pipeline for the full procedure is shown in Supplementary Material, Fig. S1.

Detection of maxRAFs
All networks described above were tested for whether they contained maxRAFs with different food sets, which are described in the main text and available in Supplementary Material, Table S1. The fictional catalysts "Spontaneous" and "Pooling" were added in all tests, allowing for spontaneous reactions to always occur and synonymous cofactors to be equated. Pooling reactions that were part of the maxRAF were not accounted for in maxRAF sizes.

RAF sets
We define a chemical reaction system (CRS) as a tuple = { , , , }, where: } is a set of molecule types.
• ⊆ × is a set of catalysis assignments. A catalysis assignment is a pair ( , ) with ∈ and ∈ , denoting that molecule type can catalyse reaction .
• ⊂ is a food set (i.e., molecule types that can be assumed to be available from the environment).
Given a CRS , a subset ′ of , and a subset ′ of , we define the closure of ′ relative to ′, denoted @A ( ′), to be the (unique) minimal subset of that contains ′ and that satisfies the condition that, for each reaction = ( , ) in ′, Informally, @A ( ′) is ′ together with all molecules that can be constructed from ′ by the repeated application of reactions from ′. In other words, a subset of reactions ′ is a RAF set if, for each of its reactions, at least one catalyst and all the reactants are in the closure of the food set relative to ′ [5].

RAF algorithms
Given a CRS = { , , , }, an efficient (polynomial-time) algorithm exists for deciding whether contains a RAF set or not. It is presented formally in Algorithm 1.
In plain words, starting with the full set of reactions , the algorithm repeatedly calculates the closure of the food set relative to the current reaction set ′ and then removes all reactions from ′ that have none of their catalysts or not all of their reactants in this closure. This is repeated until no more reactions can be removed. If, upon termination of the algorithm, ′ is non-empty, then ′ is the unique maximal RAF set (maxRAF) contained in (i.e., a RAF that contains every other RAF in as a subset) [5]. If ′ is empty, then does not contain a RAF set.
Computing the closure of the food set relative to the current reaction set ′ is computationally the most expensive step in the RAF algorithm. It is presented formally in Algorithm 2. This closure computation algorithm, introduced in [5], is equivalent to the "network expansion" algorithm [6] used in [7].
A naive computational complexity analysis of the RAF algorithm gives a worst-case running time of (| || | N ). However, with some additional book-keeping (such as keeping track of all reactions that each molecule is involved in), this can be reduced. In fact, the average running time on a simple polymer-based model of CRSs turns out to be subquadratic [5].

LUCA enrichment analysis
The genetic families identified in [8] were mapped to KEGG orthologues, the corresponding EC numbers were retrieved and the reactions performed by these were listed and compared with the lists of reactions comprising the different networks, namely the global O2-independent prokaryotic network; the maxRAF obtained with this network; maxRAFs obtained with the Ace and Met subsets; and the intersection of these.

Statistical Analysis
For pathway and cofactor enrichment analysis (Fig. 3, 5A-B), a contingency table was built for each comparison between a smaller network and the global network. The p-value refers to the probability of having at least as many reactions as seen (in pathway X or catalyzed by cofactor X) in a smaller network if we were to select a random pick of reactions the same size of that smaller network from the global network. For this, one-tailed Fisher tests (with Benjamini-Hochberg multiple test corrections) were used, and significance was considered for corrected p-values smaller than 0.05. A similar one-tailed Fisher test was also used for calculation of enrichment in LUCA genes (Fig. 5C) and significance was considered for p-values smaller than 0.0001. All statistical analysis were performed in Python ver. 3.6.6 with the package scipy.stats. Network properties were calculated and visualizations were produced with Cytoscape [9] ver. 3.7.1. Fig. S1. Pipeline for reconstructing catalysis-annotated metabolic networks. Steps in grey include metabolic data only, steps in brown include catalysis rules, and steps in greens represent the inclusion of curated data from metabolic models of Moorella thermoacetica and Methanococcus maripaludis.      The network represents the maxRAF obtained with the full prokaryote O2-independent network with inorganic catalysts, abiotic compounds, all amino acids and bases but no organic cofactors added to the food set (only metabolic interconversions are depicted; catalysis arcs are omitted for clarity). Node size is scaled according to the degree, with food molecules highlighted in green. 'Acceptor' and 'Reduced Acceptor' are abstract redox molecules as represented in KEGG metabolism.

Legends for Supplementary Data
Dataset S1 (separate file). Metabolic networks annotated with catalysis rules. (A) Prokaryotic, O2-independent global metabolic network (B) subset network of Moorella thermoacetica (C) subset network of Methanococcus maripaludis.
Dataset S2 (separate file). Lists of reactions in all maxRAFs predicted for all networks in all food sets.
Dataset S3 (separate file). List and degree of metabolites in the primordial network shown in Fig.4 of the main text.
Dataset S4 (separate file). Input file with the global prokaryotic O2-independent network used to run the maxRAF algorithm. Food set with all small molecules, abiotic carbon and organic cofactors.