Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences

    Abstract

    All known life forms are based upon a hierarchy of interwoven feedback loops, operating over a cascade of space, time and energy scales. Among the most basic loops are those connecting DNA and proteins. For example, in genetic networks, DNA genes are expressed as proteins, which may bind near the same genes and thereby control their own expression. In this molecular type of self-reference, information is mapped from the DNA sequence to the protein and back to DNA. There is a variety of dynamic DNA–protein self-reference loops, and the purpose of this remark is to discuss certain geometrical and physical aspects related to the back and forth mapping between DNA and proteins. The mappings are examined as dimensional reductions and expansions between high- and low-dimensional manifolds in molecular spaces. The discussion raises basic questions regarding the nature of DNA and proteins as self-referring matter, which are examined in a simple toy model.

    1. Introduction

    To self-reproduce, living matter must refer to itself. This was lucidly depicted in von Neumann's treatment of self-reproducing automata [1], where he described a cellular automaton whose grid cells are identical finite-state machines. The cells may take several states or ‘phenotypes’, and their spatial organization forms a machine with means of storing, reading, transmitting and processing digital information. In each ‘generation’, the machine refers to a ‘tape’ of such cells, which stores a blueprint of actions to be performed by the machine in order to reproduce a complete copy of itself.

    In von Neumann's automaton, both roles of information storage and its logical processing are performed by the same kind of cells via their action as finite-state machines. In the biological realm, this is similar to the ability of certain RNA molecules to self-replicate [2]. This led some to suggest a primordial RNA world occupied by evolving ensembles of auto-catalytic and self-reproducing RNAs. However, all existing or known to have existed organisms share a dichotomy between the blueprint drawn in DNA molecules and the machine-like proteins that execute the blueprint. RNAs still carry out essential roles of catalysis (ribosome), information transfer (mRNA) and control (small RNAs), which mediate between the biochemical worlds of DNA and protein.

    The schism into two styles of biochemistry allows DNA and proteins to excel in their specialized tasks. DNAs are linear, easy-to-manipulate hetero-polymers, whose inertness secures the digital information written in the four letter alphabet of the nucleic bases.1 Proteins, on the other hand, are folded chains of amino acids that evolved to perform catalysis, signal transduction and other highly specific functions. The possible configurations of proteins may be specified by the three-dimensional position of each amino acid, and their biochemical functions are quantified by parameters such as affinities and catalysis rates. These biochemical and configurational degrees of freedom reside in high-dimensional analogue (continuous) spaces.

    The dual worlds of protein and DNA are interconnected by numerous biochemical pathways, and this back and forth exchange generates feedback loops of various length and complexity. For example, a gene written in DNA is expressed, via transcription and translation, as a protein, which in turn may bind the DNA as a transcription factor, thereby controlling its own expression or that of other proteins in the cell (figure 1). Such autoregulation loops are prevalent in genetic networks. Negative control stabilizes expression and leads to homeostasis, whereas positive loops may act as switches. The biochemical and evolutionary aspects of DNA–protein loops of various architectures were extensively studied in systems biology (e.g. [37]). Here, we focus on certain physical and geometrical aspects of the mapping between DNA and proteins and their implications on living systems.

    Figure 1.

    Figure 1. A gene autoregulation loop crosses twice the DNA–protein boundary. A DNA gene written is transcribed and translated to make a transcription factor protein (rectangle). In the case of autoregulation, the transcription factor binds the regulatory sequence adjacent to its own gene, thereby controlling its own expression. The dashed line denotes the boundary between the DNA and protein worlds. Crossing this boundary requires translation or mapping.

    2. The loop and the representation problem

    The loop described above (figure 1) maps back and forth between the dual molecular worlds of DNA and proteins. The elements or ‘atoms’ of these two worlds are the nucleic bases along the DNA, and the amino acids along the protein chain. On this elementary ‘atomic’ level, it is the genetic code that maps DNA triplets called codons into distinct amino acids. We previously discussed this map in detail [8,9], and here we mention briefly its prominent geometric features.

    The codon space can be described as a graph whose vertices are the 43=64 codons (figure 2). Two vertices are connected by an edge if the two codons are likely to be confused by the translation machinery. Possible errors include misreading of the RNA codon in the ribosome, mischarging of the tRNA by the wrong amino acid and point mutations of the DNA. In the resulting structure, called a Hamming graph, each codon is connected by nine edges to those nine other codons that differ by just one letter. In this graph, the length of the shortest path between two codons measures their number of different letters, i.e. the Hamming distance.

    Figure 2.

    Figure 2. The map of the genetic code. Left: a 19-codon sub-graph of the 64-codon graph showing the codon c=AAA with its nine nearest neighbours, and nine of its 27 next-nearest neighbours. The whole 64-codon graph is highly interconnected, and cannot be drawn on a plane without many edge intersections. Right: the genetic code maps each codon c to an amino acid α(c), denoted by a three letter code, or a stop signal. The amino acids reside in a space whose axes correspond to relevant chemical properties such as size, polarity, hydrophobicity, etc. The map deforms the graph. Neighbouring synonymous codons, which are mapped to the same amino acid, reside at the same position and the corresponding connecting edge shrinks, whereas amino acids of different chemical properties are far from each other with long connecting edges.

    The genetic code maps each codon c into an amino acid α(c), which resides in the space of chemical properties. The map α(c) is known to be rather smooth, in the sense that neighbouring codons differing by one base tend to be mapped to the same or chemically similar amino acids. Smoothness increases the tolerance of the genetic code to errors in this information channel, for example due to misreading of a codon at the ribosome. The map α(c) embeds the codon graph in the Euclidean space of chemical properties.

    The genetic code is one of the most basic examples of representation or mapping in biology. The digital DNA triplets represent the information about the corresponding amino acids in the space of chemical properties. Whenever information is translated or mapped from one type of chemical or physiological language to another one, it is being represented in the new language. For example, the retina represents visual information as neural spikes, and signalling molecules represent information regarding the presence of neighbouring bacteria in a colony.

    On the ‘atomic’ level of the genetic code, several scenarios and models were suggested to explain the origin of the code and its salient features (e.g. [10,11] and many others). While many questions remain open, the basic geometry is clear: the genetic code map is an embedding or a representation of the Hamming graph into a high-dimensional metric space of amino acid chemistry. The smoothness of the code sets an upper bound on the dimension of the chemical property space and thereby on the number of amino acids the code can embed without compromising too much its reliability [8,9].

    3. The representation of proteins

    Let us now illustrate in slightly more detail the autoregulation loop (figure 3). On the level of a whole protein, the map from DNA to amino acids is much more elaborate, and much less is known about the fundamental properties of this representation. The representation includes a linear stage in which the gene, a sequence of codons c=(c1,c2cn), is translated at the ribosome into a chain of corresponding amino acids, α(c)=(α(c1), α(c2)…α(cn)), termed a polypeptide. The chain then folds into a three-dimensional configuration of the protein pc, which is formally denoted as pc=sα(c); the letter s hints for ‘structural’ degrees of freedom, and ○ denotes the composition of functions sα(c)=s(α(c)). This stage is highly nonlinear, since the configuration pc is determined by interactions of the amino acids among themselves and with the surrounding medium, such as hydration.

    Figure 3.

    Figure 3. Following the maps along the autoregulation loop. (i) A gene c is a sequence of codons c=(c1,c2cn), which is translated at the ribosome into a chain of corresponding amino acids α(c)=(α(c1),α(c2)…α(cn)), according to the genetic code. The genetic code map α(c) is linear. (ii) Next, the amino acid hetero-polymer is folded into a protein pc. This is formally denoted by the highly nonlinear map s. (iii) The function of the protein pc as a transcription factor is facilitated by its binding site (dark amino acids). The structure and the chemical properties of the binding site bc depend on the protein configuration s. This map is formally denoted by f. (iv) The binding of the protein to the DNA is described by the map β, which depends on the binding site bc and the regulatory sequence cr. (v) The binding or unbinding of the transcription factor regulates the expression of proteins, as described by the regulation map r. This stage closes the autoregulation loop. The maps by the genetic code α and by the binding β are the ones that cross back and forth from DNA to proteins.

    In a simplified view, pc can be considered as a vector of the positions of each amino acid (or each atom on a higher resolution) in the protein's native state. Real proteins, however, fluctuate among an ensemble of possible configurations, and intrinsically disordered proteins lack even an average three-dimensional structure [12,13]. On top of that, it is not clear whether one can, even in principle, deduce a configuration, pc=sα(c), from the sequence c alone (Anfinsen's dogma [14]). Nevertheless, on the simplified abstract level considered here, pc can be taken as a vector whose entries correspond to the structural degrees of freedom which characterize the protein molecule. In the traditional structural view of proteins, those may be the amino acid positions, while on a more stochastic view one may consider the probability distribution of structures. A convenient representation may be chosen according to the biophysical context. For example, one may expand the deviations from an equilibrium structure as the modes of an elastic network [15].

    Describing the detailed structure of a protein requires numerous degrees of freedom, whose number increases with the desired resolution. A dynamical description would require even more degrees of freedom to specify the force fields in the protein, for example the spring constants of an elastic network model. However, it seems that not all these parameters are necessary to understand the function of the protein. In our simple example of an autoregulation loop, the functionality of the protein is characterized by its binding site (in figure 3, the cluster of dark ‘amino acids’).

    The function of the transcription factor in binding the DNA is governed by the interaction of the DNA with this rather small cluster of amino acids bc. This does not at all imply that the rest of the protein does not matter. On the contrary, the structure and the stability of the binding site depend on its interactions with the ‘rest’ of the amino acids. Certain binding sites physically interact with distant amino acids, thereby affecting their functional properties. This mechanism, known as allostery, is central to protein function [16]. The map f formally denotes the process in which certain functional degrees of freedom bc emerge from the physical interactions of a given protein configuration s, bc=fsα(c). As already mentioned, in general a small set of degrees of freedom are governed by the whole protein. In this sense, the map f is similar to the phenomenon of renormalization in physical systems. In broad terms, renormalization may be thought of as a recursive coarse-graining process in which one obtains a small number of effective interaction parameters that account for the original, microscopic system parameters [17].

    It is noteworthy that while in physical systems it is clear how to obtain the renormalized degrees of freedom, in the present context of proteins the concept of renormalization is at this point more of a metaphor than a realistic model. The essence of the map f is the drastic dimensional reduction in the space of possible configurations. In the following, we present a toy model that aims to capture this essence. While the physical mechanisms which result in dimensional reduction are yet to be elucidated, proteins exhibit properties which are consistent with this concept: the biochemical function of proteins is effectively described by a small number of parameters, for example many enzymes are characterized by their binding affinity and catalysis rate (the Michaelis–Menten model) [18]. The dynamics that occur during binding and catalysis appear to involve mostly a few low-frequency, large-scale modes [19,20]. Low modes in the correlations of amino acid substitution during evolution were found to be related to the function and stability of several protein families [21]. It is not clear, however, whether these two types of low modes, evolutionary and dynamic, are related.

    4. Closing the loop

    The interaction of the regulatory sequence cr with the binding site bc is formally denoted by the function β. A simple model for β may specify the probability that cr is bound to a transcription factor pc, while more detailed models also add the binding and unbinding probabilities (a two-state Markov model) up to a molecular dynamics simulation. In general, the process of stochastic binding involves diffusion and docking, which depends on the interaction of the binding surfaces, and the induced conformational changes [22,23]. Yet, in many cases, β can be described as the sigmoidal binding probability of a two-state model, Inline Formula], where EB is the binding energy and μ is the chemical potential [24]. While the shape of β is nonlinear, the binding energy itself can be reasonably approximated as a linear sum of binding energies of each base pair in the regulatory sequence Inline Formula. In this respect, the binding function β is linear in the DNA sequence cr just as the translation α is linear in the gene sequence c. Usually, there is a non-negligible probability that the transcription factor binds to sequences close enough to the ‘consensus’ sequence cr.

    The autoregulation loop is closed by the regulatory effect of transcription protein pc on the expression of its own gene c. The regulation is described by the function r, which often depends on the local concentration of the transcription factor. The functional dependence exhibits many forms, positive, negative, linear and nonlinear, all according to the biological function of the loop, such as stabilization or switching. Formally, the dynamics of the loop is a composition of the functions described above:

    Display Formula
    4.1
    The above equation is deterministic dynamics for the concentration of the transcription factor [pc], but it could as well be a stochastic equation for protein number. Similar formal description applies also to longer loops, such as a switch made of two proteins controlling the expression of each other (figure 4):
    Display Formula
    4.2
    When the two regulatory control functions, r1 and r2, are negative, the above equation becomes a switch.
    Figure 4.

    Figure 4. A two-gene switch. Two genes, c1 and c2, express proteins which bind to each other's regulatory sequence. With negative regulation function, r1 and r2, the resulting loop acts as a switch (equation (4.2)).

    To conclude, we followed schematically the flow of information from the DNA to the protein and back in the autoregulation loop, where a gene controls its own expression. The pathway composes five maps or functions, α,s,f,β and r, which operate on the DNA sequences, the gene c and the regulatory binding site cr. One may mix the maps of several genes to create larger loopy networks, such as that of the above equation. In the following, we discuss the dimensionality along this series of maps and construct a toy model with similar characteristics.

    5. Dimensional reduction and expansion: hats and bow ties

    To retrace the mapping along the loop, let us start again from the gene c (figure 5). A gene c of length n codons resides in a discrete space of approximately 64n possible configurations. The actual number is somewhat smaller, around 61n, as there are three stop codons that can appear only as a punctuation signal at the end of the gene. The dimension of the genetic space is proportional to n, and its structure can be described as the Cartesian graph product of 3n tetrahedra, each corresponding to one base pair [8,9].

    Figure 5.

    Figure 5. Tracing the dimensionality along the DNA–protein loop (see text). The height represents the effective dimension which expands and shrinks in the protein section of the loop (arrows). The map α slightly reduces the dimension due to the redundancy of the genetic code, as manifested in a small dip at α(c). The map s into the space of structures and dynamical trajectories significantly expands the effective dimension. The following map f filters the functionally relevant information, the structure and biophysical properties of the binding site, thereby drastically reducing the effective dimension. The binding interaction is approximately linear, with an effective dimension that scales like the length of the regulatory sequence.

    The genetic code map α slightly reduces this dimensionality due to the redundancy of the code: there are only 20 amino acids which are encoded by 61 codons. Hence, most amino acids are encoded by two or more codons, all synonyms for the same amino acids. The resulting number of possible polypeptides is about 20n. However, even synonymous codons yield different mRNAs, which interact differently with the tRNA reservoir and non-coding RNAs in the cell [25]. Therefore, a map which considers also RNA interactions would be of a higher dimension. Unlike the inherently digital space of DNA sequences, the amino acids reside in a space of chemical characteristics, such as polarity, size and hydrophobicity (figure 2), and the number of relevant characteristics determines the effective dimension. At any rate, both spaces of DNA genes c and amino acid sequences α(c) are linear (in other words, they are product spaces) and their dimension is proportional to n.

    In principle, one may specify the three-dimensional folding of a protein by listing the coordinates of each amino acid. Moreover, since the amino acids are concatenated into a polypeptide chain, it suffices to list the two torsion angles along the backbone of the polypeptide (Ramachandran plot [26]). However, this would only provide a static structure averaged over an ensemble of protein configurations and disordered proteins lack even this static average [12,13]. Proteins are dynamic objects which are strongly coupled to their biochemical surrounding. Therefore, understanding protein function requires much more information regarding the force fields, interaction with the solvent and the potential conformational changes. Even a coarse-grained description in terms of an elastic network, which is valid only for small deformations, requires knowledge of the effective spring constants and the connectivity of the network. The output from the elastic network is a list of n vibrational modes, each of which describes the deviations of the n amino acids from their equilibrium position, and the effective dimension therefore scales like n2.

    Characterization of large conformational changes, which are essential to protein function, requires tracing the trajectories in protein configuration space. Such knowledge is far beyond the standard crystal structure and is much harder to obtain [27,28]. This difficulty is reflected in the struggle to advance molecular dynamics modelling of protein beyond the microsecond regime [2931]. Even an ad hoc, low-resolution description in terms of the amino acid two-body correlation functions has dimension which scales like n2, similar to the scaling of the elastic modes. More elaborate dynamical models, which include trajectories, would require a much higher dimension to represent the rich dynamics.

    Similar dimensional expansion is observed in the dynamics of spatial networks of Boolean functions or neural networks. The static description in terms of an interacting spatial network is linear in the number of points if the range of interaction is limited and thereby the connectivity of each point is bounded. However, their dynamics in state space is much more complicated and requires a super-linear number of parameters. We conclude that the corresponding map s considerably expands the dimension of the space possible states (figure 5).

    The drastic dimensional expansion by the map s is followed by a dimensional reduction of similar magnitude via the map f. In the present example, the function of the protein as a transcription factor is governed by its rather small binding site (figure 3), which can be characterized by a relatively small set of parameters, such as the binding affinity, catalysis rate and the elastic response. In principle, however, this small set of numbers depends on distant amino acids, as evident from mutation studies. Further evidence of long-range correlations is the phenomenon of allostery, in which binding of a ligand to one site is modified by binding of another ligand to a distant site (which perhaps may be thought of as a ‘viscoelastic transistor’). As mentioned above, the map f involves a coarse-graining or renormalization-like process, in which the physical properties of the whole protein are integrated into a small number of effective parameters. There is certain evidence that these effective coordinates correspond to the slow, large-amplitude modes of the protein motion [19,32,33].

    In the next stage, the binding site of the transcription factor bc interacts with the regulatory sequence. Often this interaction is reasonably approximated by a linear energy, in which the interaction with each base pair is independent. At this level, the effective dimension after the binding map β is proportional to the length of the sequence cr. The specific binding is a recognition process, which provides information regarding the base pairs along the regulatory sequence. Following the linear energetics, the information content is also a sum of independent contributions from each base pair [34].

    The last stage involves the genetic regulatory function r, which is typically described as a simple relation between the transcription factor concentration and the expression rate of their corresponding gene [3,4]. The shape of this function is characterized by a few parameters, such as saturation level and nonlinearity (Hill's coefficient). At any rate, this requires only a handful of numbers and the corresponding dimensionality is therefore low.

    Dimensional compression and expansion is common to many biological systems. It has been recently suggested that a ‘bow tie’ structure, which is low-dimensional intermediate stages between inflated input and output spaces, emerges when an evolutionary task can be compressed [35]. The back and forth trajectory between DNA and proteins (figure 6) may appear more like a ‘hat’ (or a boa constrictor digesting an elephant) than a bow tie. But this distinction is quite arbitrary in this case, since in principle one could start the loop from the protein to make it appear like a bow tie, although starting at the gene seems more natural. Anyway, the basic feature is a chain of inflationary bubbles of high dimension separated by deflationary low-dimensional bottlenecks. Compression and expansion are typical in other types of information channels and artificial learning systems.

    Figure 6.

    Figure 6. Hats and bow ties. Depending on where one starts to traverse the autoregulation loop, it may look like a ‘hat’ (top) or a ‘bow tie’. However, such a distinction is arbitrary, just like the choice of starting point.

    6. A toy model

    At this point, we can recapitulate the basic geometric aspects of mapping between genes and proteins:

    • (a) The genetic information for synthesizing a protein is encoded in a linear sequence of n codons, which are practically independent. The resulting configuration space of all possible sequences is therefore a product space of dimension n.

    • (b) The genetic code preserves this linear scaling of the dimension, with a slight reduction due to the redundancy of the code (only 20 amino acids are encoded by 61 non-stop codons).

    • (c) The dynamics of proteins is governed by the three-dimensional spatial configuration of its amino acids. The short-range interactions among the amino acids yield long-range correlations.

    • (d) The space of all possible dynamic trajectories scales super-linearly with n: specifying all infinitesimal perturbations or two-body correlations requires n2 numbers, and full trajectories would require even higher dimension.

    • (e) The functionality of a protein is determined by a few effective parameters and a few slow dynamical modes.

    • (f) The interaction of protein and DNA is roughly linear in the size of the regulatory site.

    In the following, we describe a toy model that mimics these basic features of the DNA–protein system. The toy model is intentionally constructed to bear merely the abstract similarity via features (a–f), while one should abandon the illusion that it simulates the physics of real proteins (figure 7). As a protein-like structure, we consider a network of n vertices (the ‘amino acids’) ordered in a square lattice of side Inline Formula with periodic boundary condition on the left and right sides (each layer is a ring in a cylinder). Each vertex may be connected by directed edges to any of its nearest eight neighbours, and this connectivity determines the spatial architecture of the network. It can be written in terms of an adjacency matrix A, where Aij=1 if an edge points from vertex j to vertex i and Aij=0 otherwise.

    Figure 7.

    Figure 7. A toy model. The ‘protein’ is made of n vertices (‘amino acids’) ordered as a two-dimensional square lattice with periodic boundaries (cylindrical geometry). Left: each vertex may be connected by directed edges to any of its nearest eight neighbours. Input arrows can come only from five neighbours, the three below and the two on the sides. Similarly, output arrows can point to the three neighbours upwards and to the two on the sides. The network can be represented as an adjacency matrix A, in which Aij=1 if an edge points from vertex j to vertex i and Aij=0 otherwise. Right: each vertex i can be either ‘on’ (Si=1, denoted by a large black circle) or ‘off’ (Si=0). The states of the vertices are updated according to a ‘quorum’ rule: if a quorum of at least k vertices that point to i are ‘on’, then i is also ‘on’ (equation (6.1)). The steady state is shown for k=2. Due to the directed connectivity, the firing front propagates upward from the low ‘input’ row to the top ‘output’ row (large grey circles).

    Each vertex i can be in one of two states Si, ‘on’ (Si=1) and ‘off’ (Si=0). The dynamics is defined by the following update rule: count the number of ‘on’ vertices that point to vertex i; if there are k or more ‘on’ vertices, then i turns on. In terms of the adjacency matrix, this rule takes the form:

    Display Formula
    6.1
    At k=1, the dynamics is that of standard percolation, and the requirement for k>1 active inputs leads to collective dynamics called bootstrap or quorum percolation [36], which was recently applied to living neural networks [37,38]. The edges can point only in the five upward directions (figure 7), and information therefore propagates from bottom to top. Hence, the state of the bottom layer defines the ‘input’ and the top layer is the ‘output’ of the protein. The transmission of information through the ‘protein’ is akin to an allosteric effect which couples distant sites in real proteins.

    At t=0, vertices in the bottom layer start to fire according to the input pattern. The bootstrap dynamics is monotonic: a vertex that turned ‘on’ will stay ‘on’, and the state configuration in the ‘protein’ Si will reach a steady state. The configuration of the top layer at steady state is the ‘output’ of the protein, and the input–output relation defines the function of the network. In the following, we construct a ‘DNA–protein’ system that exhibits the generic features (a–f) of real autoregulation loops.

    The DNA-like structure is a linear string of 0's and 1's, composed of a ‘gene’ c and a regulatory sequence cr (figure 8). The ‘gene’ encodes (i) the ‘input’ pattern of the ‘protein’ at its bottom layer, i.e. which of the bottom vertices is ‘on’, and (ii) the connectivity of the ‘protein’ network. Since the inward edges are restricted to the five forward and sideways directions (figure 7), one can encode the whole network in a linear sequence of 0's and 1's, where each vertex i has a quintet of Aij corresponding to the five potential inward edges. The overall length of the ‘gene’ c is therefore 5n+n1/2.

    Figure 8.

    Figure 8. A toy autoregulation loop. The ‘gene’ c encodes in 0's and 1's (light and dark squares) the ‘input’ (firing pattern of the bottom layer, black circles), and the edge configuration (grey lines in the ‘protein’). For each vertex i, the gene encodes the five potential inward edges Aij. The map from ‘gene’ to the spatial organization of the ‘protein’ combines the maps s and α. The map f represents the firing dynamics, which determines the binding site bc (‘output’) as a function of the structure and the input for k=2. The regulatory sequence cr is the upper part of the ‘DNA’. The binding site bc interacts with cr according to the binding map β. Finally, the regulatory map r controls the expression of the gene as a function of the bccr interaction, thereby closing the autoregulation loop.

    In this example, the map from the linear ‘gene’ c to the architecture of the two-dimensional ‘protein’ pc combines the maps s and α. One may think of α as a non-degenerate genetic code, which is simply a one-to-one binary representation of the edges and the input. The function of the protein is defined by the pattern of its binding site bc (‘output’), which is the outcome of the collective dynamics f described by equation (6.1). The map f reduces the effective dimension from that of ‘protein’ dynamics to n1/2, the number of vertices in the ‘output’ layer. The binding site bc is indeed a small n−1/2 fraction of the vertices, but it is the outcome of the dynamics inside the whole ‘protein’. In this sense, the binding site's degrees of freedom ‘renormalize’ the degrees of freedom in the rest of the protein.

    The binding map β defines the interaction of the binding site bc with the regulatory sequence cr. A simple expression for the steady-state binding probability is the sigmoidal curve:

    Display Formula
    6.2
    with a linear binding energy, Inline Formula, which counts the number of matches between bc and cr, and a chemical potential μ. The regulatory map r uses the information from the β interaction to modify the expression of the ‘gene’. This closes the autoregulation loop.

    The performance of the loop can be tweaked by varying its DNA representation: the input pattern and the network connectivity Aij, encoded in the ‘gene’ c, which together determine the output at the binding site bc, or the regulatory sequence cr, which affects the binding and thereby the expression of the gene through the regulatory map r. In principle, the maps themselves can also be tweaked by altering, for example, the genetic code α, and the physical interaction of folding s, or binding β. However, such changes are expected to be much slower due to their global impact on many DNA–protein pathways. We demonstrate such ‘evolutionary’ dynamics, where the network Aij adapts towards an optimal input–output relation via random breaking and forming of edges in the ‘protein’ (figure 9). A more detailed and quantitative approach will be reported elsewhere.

    Figure 9.

    Figure 9. Evolution of a toy protein. Simulation of evolutionary adaptation of the ‘protein’ in the autoregulation loop. Left: the performance of the protein in the loop is governed by the interaction of its binding site bc with the regulatory sequence cr. Optimal performance is achieved when the bccr matching is maximal. Right: by mutating the network Aij, one edge at a time, the ‘protein’ evolves from an initial random configuration towards a high-performance configuration, via intermediate configurations (shown are two configurations out of a few thousands). In the simulation, edges are mutated randomly, keeping mutations that increase the target function of maximal binding energy EB. The solution configuration obeys the optimal input–output configuration.

    7. Discussion: what is a protein?

    The essence of the protein is in the nonlinear mapping from the digital information of the gene to the analogue realm of the protein: the spatial configuration of the amino acids, their collective interaction, and the resulting biochemical function, binding to the DNA. Within the abstract simplified view presented here, the DNA–protein mapping was divided into two consecutive maps, the ‘structural’ map s which corresponds to the folding, the ensemble of protein conformations and the resulting amino acid interactions, among themselves and with the surrounding solvent. The following map f represents the emergence of functionally relevant dynamical modes in the protein, which is similar in spirit to the standard analysis of physical systems in terms of low-energy, large-amplitude modes [39].

    The division into ‘structural’ and ‘functional’ maps may appear analogous to the traditional view of structural biology that the function of the protein relies on its three-dimensional structure. For example, enzymes will bind to specific substrates if their shapes match according to the ‘key and lock’ principle. However, already the ‘induced fit’ model [22] demonstrated the critical role of large conformational changes in the function of many proteins [23]. Furthermore, many proteins are known to be functional despite their inherent disorder and the lack of ‘structure’ in the traditional manner [40,41]. All this suggests that, more than the averaged static conformation, it is the dynamical trajectories in the space of protein conformations that determine the function—and the functional trajectories seem to be constrained to low-dimensional manifolds, which correspond to low-energy modes [42]. Hence, one may abandon the ‘structure’ as an intermediate stage and apply a phenomenological approach in which the composed map π=fs is a single morphism from the amino acid chain to the dynamical modes. This view is in line with studies that demonstrate the functional relevance of correlations in amino acid substitutions [4345] without the need to invoke a structural model.

    The protein map π=fs starts by dimensional expansion from the linear amino acid chain to the space of conformations and their dynamical trajectories. It is followed by a dimensional reduction to a small number of relevant modes, which govern the function of the protein. An open question is whether the relevant dynamical modes are sensitive to the detailed structure of the folded protein, or whether they are coarse-grained features, which are the outcome of ‘low modes’ in the sequence. The toy model realizes a ‘protein’ as a spatial network of ‘amino acids’ whose interaction is similar to that of a neural network or a Boolean network with short-range connectivity. This puts forward the notion of a network of logic operators with spatio-temporal dynamics that is encoded in genes. These metric logic networks represent the information in the DNA in terms of functional modes, which in turn operate on the DNA via numerous feedback loops. In this sense, one can think of the protein–DNA system as a molecular embodiment of the idea of self-reference.

    Competing interests

    I declare I have no competing interests.

    Funding

    This work was supported by the Institute for Basic Science, IBS-R020-D1. T.T. is the Helen and Martin Chooljian Founders’ Circle Member in the Simons Center for Systems Biology at the Institute for Advanced Studies, Princeton.

    Acknowledgements

    The author thanks Albert Libchaber, Stanislas Leibler and Jean-Pierre Eckmann for deep discussions and comments.

    Footnotes

    1 The DNA encodes also analogue biophysical properties, such as affinities of the corresponding mRNA to certain proteins and small RNAs, and the interaction with nucleosomes. These properties can be interpreted as secondary codes.

    One contribution of 21 to a theme issue ‘DNA as information’.

    Published by the Royal Society. All rights reserved.