Information theory, predictability, and the emergence of complex life

Despite the obvious advantage of simple life forms capable of fast replication, different levels of cognitive complexity have been achieved by living systems in terms of their potential to cope with environmental uncertainty. Against the inevitable cost associated to detecting environmental cues and responding to them in adaptive ways, we conjecture that the potential for predicting the environment can overcome the expenses associated to maintaining costly, complex structures. We present a minimal formal model grounded in information theory and selection, in which successive generations of agents are mapped into transmitters and receivers of a coded message. Our agents are guessing machines and their capacity to deal with environments of different complexity defines the conditions to sustain more complex agents.


Introduction
Simple life forms dominate our biosphere [1] and define a lower bound of embodied, self-replicating systems. But life displays an enormously broad range of complexity levels, affecting many different traits of living entities, from their body size to their cognitive abilities [2]. This creates somewhat a paradox: if larger, more complex organisms are more costly to grow and maintain, why is not all life single-celled? Several arguments help provide a rationale for the emergence and persistence of complex life forms. As an instance, Gould [1] proposes that complexity is not a trait explicitly favoured by evolution. A review of fossil records convinces Gould that, across genera, phyla and the whole biosphere, we observe the expected random fluctuations around the more successful adaptation to life. In this big picture, (a) An agent G interacts with an external environment E that is modelled as a string of random bits. These bits take value 0 with probability p and value 1 otherwise. The agent tries to guess a sequence of n bits at some cost, with a reward bestowed for each correctly guessed bit. The persistence and replication of the agent can only be granted if the balance between reward and cost is positive (ρ G E > 0). (b) For a machine attempting to guess n bits, an algorithmic description of its behaviour is shown as a flow graph. Each loop in the computation involves scanning a random subset of the environment B = (b 1 , . . . , b n ) ⊂ E by comparing each b i ∈ B to a proposed guess w i . (c) A mean field approach to a certain kind of 1-guesser (modelled in the text through equations (1.1)-(1.3)) in environments of infinite size renders a boundary between survival (ρ G E > 0) and death (ρ G E < 0) as a function of the cost-reward ratio (α) and of relevant parameters for the 1-guesser model (p in this case).
bacteria are the leading life form and the complexity of every other living system is the product of a random drift. Complex life would never be explicitly favoured, but a complexity wall exists right below bacteria: simpler forms fail to subsist. Hence, a random fluctuation is more likely to produce more complex forms, falsely suggesting that evolution promotes complexity.
Major innovations in evolution involve the appearance of new types of agents displaying cooperation while limiting conflict [3,4]. A specially important innovation involved the rise of cognitive agents, namely those capable of sensing their environments and reacting to their changes in a highly adaptable way [5]. These agents were capable of dealing with more complex, non-genetic forms of information.
The advantages of such cognitive complexity become clear when considering their potential to better predict the environment, thus reducing the average hazards of unexpected fluctuations. As pointed out by Francois Jacob, an organism is 'a sort of machine for predicting the future-an automatic forecasting apparatus' ( [6, p. 9]; see also [7,8]). The main message is that foreseeing the future is a crucial ability to cope with uncertainty. If the advantages of prediction overcome the problem of maintaining and replicating the costly structures needed for inference, more complex information-processing mechanisms might be favoured under the appropriate circumstances.
This kind of problem has been addressed within ecological and evolutionary perspectives. One particularly interesting problem concerns the potential of some types of organisms to develop cognitive potential for prediction. Are all living systems capable to develop such feature? What is the limit of predictive power for a given group, and how is it affected by the lifestyle? Plants, for example, have no nervous system but exhibit some interesting capacities for decision making, self/non-self discrimination, or error correction [9]. Studies involving the evolution of prediction in simulated plants reveal that increased predictability of available resources was achieved by a proper assessment of environmental variability [10]. Some of the molecular mechanisms that pervade plant responses seem to deal with switch-like changes triggered by genetic networks [11]. In this case, growth, seed production or germination correlate with the degree of environmental fluctuations. A key conclusion of relevance to our paper is that some reproduction strategies are not selected in given environments due to a lack of predictability.
Here we aim at providing a minimal model that captures these trade-offs. In doing so, we characterize thoroughly an evolutionary driver that can push towards evermore complex life forms. We adopt an information theory perspective in which agents are inference devices interacting with a Boolean environment. For convenience, this environment is represented by a tape with ones and zeros, akin to non-empty inputs of a Turing machine (figure 1a). The agent G locates itself in a given position and tries to predict each bit of a given sequence of length n-hence it is dubbed an n-guesser. Each attempt to predict a bit involves some cost c, while a reward r is received for each successful prediction. 1-guessers are simple and assume that all bits are uncorrelated, while (n > 1)-guessers find correlations and can get a larger benefit if some structure happens to be present in the environment. A whole n-bit prediction cycle can be described as a program ( figure 1b) reward and prediction cost. They get replicated and pass on their inference abilities. Otherwise, the agent fails to replicate and eventually dies.
As a simple illustration of our approach, consider a 1-guesser living in an infinitely large environment E where uncorrelated bits take value 1 with probability p, and 0 with probability 1 − p. The average performance of a guesser G when trying to infer bits from E is given byp G E ; this is, the likelihood of emitting a correct guess:p where p G (k) is the frequency with which the guesser emits the bit value k ∈ {0, 1}. The best strategy possible is to emit always the most abundant bit in the environment. Then Without loss of generality, let's assume 1 is the most common bit. Then the average reward minus cost obtained by such a guesser reads: This curve trivially dictates the average survival or extinction of the optimal 1-guessers in infinite, unstructured environments as a function of the cost-reward ratio α ≡ c/r (grey area subtended by the solid diagonal line in figure 1c). Note that this parameter α encodes the severity of the environmenti.e. how much does a reward pay off given the investment needed to obtain it. Further, note that any more complex guessers (like the ones described in successive sections) would always fare worst in this case: they would potentially pay a larger cost to infer some structure where there is none. This results in narrower survival areas qualitatively represented by shades of grey subtended by the discontinuous lines in figure 1c. Again, these reduced niches for more complex bit-guessers would come along because there is not structure to be inferred; but that can change if correlations across environmental bits appear, as will be seen below. The idea of autonomy and the fact that predicting the future implies performing some sort of computation suggests that a parsimonious theory of life's complexity needs to incorporate reproducing individuals (and eventually populations) and information (they must be capable of predicting future environmental states). These two components define a conflict and an evolutionary trade-off. Being too simple means that the external world is perceived as a source of noise. Unexpected fluctuations can be harmful, and useful structure cannot be harnessed in your benefit. Becoming more complex (hence able to infer larger structures, if they exist) implies a risk of not being able to gather enough energy to support and replicate the mechanisms for inference. As will be shown below, it is possible to derive the critical conditions to survive as a function of the agent's complexity and to connect these conditions to information theory. As advanced above, this allows us to characterize mathematically a scenario in which a guesser's complexity is explicitly selected for. Actual living beings will embody the necessary inference mechanisms in their morphology, or in their genetic or neural networks. Instead of developing specific models for each of these alternative implementations, we resort to mathematical abstractions based on bit-strings, whose conclusions will be general and apply broadly to any chosen strategy.
A satisfactory connection between natural selection and information theory can be obtained by mapping our survival function ρ into Shannon's transmitter-receiver scheme. To do so, we consider replicators at an arbitrary generation T attempting to 'send' a message to (i.e. getting replicated into) a later generation T + 1. Hence, the older generation acts as a transmitter, the newer one becomes a receiver, and the environment and its contingencies constitute the channel through which the embodied message must be conveyed (figure 2a). From a more biological perspective, we can think of a genotype as rsos.royalsocietypublishing.org R. Soc. open sci a generative model (the instructions in an algorithm) that produces a message that must be transmitted. That message would be embodied by a phenotype and it includes every physical process and structure dictated by the generative model. As discussed by von Neuman & Burks [50], any replicating machine must pass on a physically embodied copy of its instructions-hence the phenotype must also include a physical realization of the algorithm encoded by the genotype. 1 Finally, any evolutionary pressure (including the interaction with other replicating signals) can be included as contrivances of the channel. Following a similar idea of messages being passed from one generation to the next one, Maynard-Smith [18] proposes that the replicated genetic message carries meaningful information that must be protected against the channel contingencies. Let us instead depart from a replicating message devoid of meaning. We realize that the channel itself would convey more reliably those messages embodied by a phenotype that better deals with the environmental (i.e. channel) conditions. Dysfunctional messages are removed due to natural selection. Efficient signals get more space in successive generations (figure 2b). Through this process, meaningful bits of environmental information are pumped into the replicating signals, such that the information in future messages will anticipate those channel contingencies. In our picture, meaningful information is not protected against the channel conditions (including noise), but emerges naturally from them.

Messages, channels and bit-guessers
Let us first introduce our implementation of environments (channels), messages and the replicating agents. The latter will be dubbed bit-guessers because efficient transmission will be equivalent to accurately predicting channel conditions-i.e. to correctly guessing as many bits about the environment as possible. The notation that follows may seem arid, so it is good to retain a central picture (   Figure 3. From a generative model to inference about the world. A diagrammatic representation of the algorithmic logic of the bit guessing machine. Our n-guesser contains a generative model (represented by a pool of words) from which it draws guesses about the environment. If a bit is successfully inferred, the chosen conjecture is pursued further by comparing a new bit. Otherwise, the inference is reset.
E. Both these messages and the environments are modelled as strings of bits. What follows is a rigorous mathematical characterization of how the different bit sequences are produced.
Let us consider m-environments, strings made up of m sorted random bits. We might consider one single such m-environment-i.e. one realization E of m sorted random bits (e i ∈ E, i = 1, . . . , m; e i ∈ {0, 1}). Alternatively, we might work with the ensemble E m of all m-environments-i.e. all possible environments of the same size (e i,l ∈ E l , i = 1, . . . , m; where E l ∈ E m , l = 1, . . . , 2 m )-or we might work with a sampleÊ m of this ensemble (E l ∈Ê m , l = 1, . . . , ||Ê m ||; whereÊ m ⊂ E m ). We might evaluate the performance of our bit-guessers in single m-environments, in a whole ensemble, or in a sample of it.
These m-environments model the channels of our information theory approach. Attempting to transmit a message through this channel will be implemented by trying to guess n-sized words from within the corresponding m-environment. More precisely, given an n-bit message W (with n < m) which an agent tries to transmit, we extract an n-sized word (B ⊂ E) from the corresponding m-environment. Therefore, we choose a bit at a random position in E and the successive n − 1 bits. These make up the b i ∈ B, which are compared with the w i ∈ W. Each w i is successfully transmitted through the channel if w i = b i . Hence attempting to transmit messages effectively becomes an inference task: if a guesser can anticipate the bits that follow, it has a greater chance of sending messages through. Messages transmitted equal bits copied into a later generation, hence increasing the fitness of the agent.
In this paper, we allow bit-guessers a minimal ability to react to the environment. Hence, instead of attempting to transmit a fixed word W, they are endowed with a generative model Γ G . This mechanism (explained below) builds the message W as a function of the broadcast history: Hence, the fitness of a generative model is rather based on the ensemble of messages that it can produce. 2 To evaluate this, our guessers attempt to transmit n-bit words many (N g ) times through a same channel. For each one of these broadcasts, a new n-sized word B j ⊂ E (with b j i ∈ B j for j = 1, . . . , N g and i = 1, . . . , n) is extracted from the same m-environment; and the corresponding W j are generated, each based on the broadcast history as dictated by the generative model (see below).
We can calculate different frequencies with which the guessers or the environments present bits with value k, k ∈ {0, 1}: and with δ(x, y) being Dirac's delta. Note that p G (k; i) has a subtle dependency on the environment (because G may react to it) and thatp G E indicates the average probability that guesser G successfully transmits a bit through channel E.
Thanks to these equations we can connect with the cost and reward functions introduced before. For every bit that attempts to be transmitted, a cost c is paid. A reward r = c/α is cashed in only if that bit is successfully received. α is a parameter that controls the pay-off. The survival function reads andp G E can be read from equation (2.5). As a rule of thumb, ifp G E > α the given guesser fares well enough in the proposed environment.
It is useful to quantify the entropy per bit of the messages produced by G: and the mutual information between the messages and the environment: To evaluate the performance of a guesser over an ensembleÊ m of environments (instead of over single environments), we attempt N g broadcasts over each of N e different environments (E l ∈Ê m , l = 1, . . . , N e ≡ Ê m ) of a given size. For simplicity, instead of labeling b j i,l , we stack together all N g × N e nsized words W j and B j . This way b j i ∈ B j and w j i ∈ W j for i = 1, . . . , n and j = 1, . . . , N g N e . We have , only with j running through j = 1, . . . , N g N e . Also as before, we average the pay-off across environments to determine whether a guesser's messages get successfully transmitted or not given α and the length m of the environments in the ensemble Note that We use · Êm to indicate averages across environments of an ensembleÊ m . Finally, we discuss the generative models at the core of our bit-guessers. These are mechanisms that produce n-sized strings of bits, partly as a reaction to contingencies of the environment. Such messagegenerating processes Γ G could be implemented in different ways, including artificial neural networks (ANNs) [51], spiking neurons [52], Bayesian networks [53,54], Turing machines [55], Markovian chains [56], -machines [57], random Boolean networks (RBNs) [58], among others. These devices elaborate their guesses through a series of algorithms (e.g. back-propagation, message passing or Hebbian learning) provided they have access to a sample of their environment.
In the real world, trial and error and evolution through natural selection would be the algorithm wiring the Γ G (or, in a more biological language, a genotype) into our agents. The dynamics of such evolutionary process are very interesting. However, in this paper, we aim at understanding the limits imposed by a channel's complexity and the cost of inference, not the dynamics of how those limits may be reached. Therefore, we assume that our agents perform an almost perfect inference given the environment where they live. This best inference will be hard-wired in the guesser's generative model Γ G as explained right ahead.
A guesser's generative model usually depends on the environment where it is deployed, so we note figure 3) and a series of rules dictating how to emit those bits: either in a predetermined order or as a response to the channel's changing conditions. Whenever we pick up an environment E = {e i , i = 1, . . . , m}, the best first guess possible will be the bit (0 or 1) that shows up with more frequency. Hence If both 0 and 1 appear equally often we choose 1 without loss of generality. If the agent succeeds in its first guess, its safest next bet is to emit the bit (0 or 1) that more frequently follows g 1 in the environment. We proceed similarly if the first two bits have been correctly guessed, if the first three bits have been correctly guessed, etc. We define p B|Γ (k; i) as the probability of finding k = {0, 1} at the ith position of the B j word extracted from the environment, provided that the guess so far is correct: (2.13) The index j, in this case, labels all n-sized words within the environment (b j i ∈ B j ) ⊂ E and Z(i) is a normalization constant that depends on how many words in the environment match Γ G E up to the (i − 1)th bit: (2.14) It follows Note that the pool of bits in Γ G E consists of an n-sized word, which is what they try to emit through (i.e. it constitutes the guess about) the channel. If a guesser would not be able to react to environmental conditions, the word W that is actually generated at every emission would be the same in every case and w j i = g i always; but we allow our guessers a minimal reaction if one of the bits fails to get through (i.e. if one of the guesses is not correct). This minimal reaction capacity by our guessers results in: Altogether, our guesser consists of a generative model Γ G that contains a pool of bits and a simple conditional instruction. This is reflected in the flow chart in figure 3.
We have made a series of choices regarding how to implement environmental conditions. These choices affect how some randomness enters the model (reflected in the fact that, given an environment E, a guesser might come across different words B j ⊂ E) and also how we implement our guessers (including their minimal adaptability to wrong guesses). We came up with a scheme that codes guessers, environments (or channels), and messages as bit-strings. This allows us a direct measurement of information-theoretical features which are suitable for our discussion, but the conclusions at which we arrive should be general. Survival will depend on an agent's ability to embody meaningful information about its environment. This will ultimately be controlled by the underlying cost-efficiency trade-off.
Because of the minimal implementation discussed, all bit-guessers of the same size are equal. Environmental ensembles of a given size are also considered equivalent. Hence, the notation is not affected if we identify guessers and environments by their sizes. Accordingly, in the following we substitute the labels G and E by the more informative ones n and m respectively. Hence ρ G E m (α) becomes ρ n m (α),p G E becomesp n m , etc.

Results
The question that motivates this paper relates to the trade-off between fast replication versus the cost of complex inference mechanisms. To tackle this we report a series of numerical experiments. Some of them deal with guessers in environment ensembles of fixed size, others allow guessers to switch between environment sizes to find a place where to thrive. Our core finding is that the complexity of the guessers that can populate a given environment is determined by the complexity of the latter. (In information-theoretical terms, the complexity of the most efficiently replicated message follows from the predictability of the channel.) Back to the fast replication versus complexity question, we find environments for which simple guessers die off, but in which more complex life flourishes-thus offering a quantifiable model for real-life excursions in biological complexity.
Besides verifying mathematically that the conditions for complex life exist, our model allows us to explore and quantify when and how guessers may be pushed to m-environments of one size or another. We expect to use this model to investigate this question in future papers. As neat examples, at the end of this paper we report (i) the evolutionary dynamics established when guessers are forced to compete with each other and (ii) how the fast replication versus complexity trade-off is altered when resources can be exhausted. These are two possible evolutionary drivers of complex life, as our numerical experiments show. Figure 4 showsp n m , the average probability that n-guessers correctly guess 1 bit in m-environments. The 1-guesser (that lives off maximally decorrelated bits given the environment) establishes a lower bound. More complex machines will guess more bits on average, except for infinite environment size m → ∞, at which point all guessers have equivalent predictive power.

Numerical limits of guesser complexity
As m grows, environments get less and less predictable. Importantly, the predictability of shorter words decays faster than that of larger ones, thus enabling guessers with larger n to survive where others would perish. There are 2 n possible n-words, of which m are realized in each m-environment. When m 2 n , the environment implements an efficient, ergodic sampling of all n-words-thus making them maximally unpredictable. When n m < 2 n the sampling of n-sized words is far from ergodic and a non-trivial structure is induced in the environment because the symmetry between n-sized words is broken-they cannot be equally represented due to finite size sampling effects.
This allows that complex guessers (those with the ability to contemplate larger words, keep them in memory and make choices regarding information encoded in larger strings) can guess more bits, on average, than simpler agents. In terms of messages crossing the channel, while shorter words are meaningless and basically get transmitted (i.e. are correctly guessed) by chance alone, larger words might contain meaningful, non-trivial information that get successfully transmitted because they cope with the environment in an adequate way.
Note that this symmetry breaking to favour predictability of larger words is just a mechanism that allows us to introduce correlations in a controlled and measurable way. In the real world, this mechanism In the inset, the data have been smoothed and compared with a given value of α (represented by a horizontal line). At the intersection between this line andp n m we findm n (α), the environment size at which n-sized agents guess just enough bits to survive given α. Note that n-guessers are evaluated only in environments of size m ≥ n. might correspond to asymmetries between dynamical systems in temporal or spatial scales. Although our implementation is rather ad hoc (suitable to our computational and conceptual needs), we propose that similar mechanisms might play important roles in shaping life and endowing the universe with meaningful information. Indeed, it might be extremely rare to find a kind of environment in which words of all sizes become non-informative simultaneously.
The mutual information between a guesser's response and the environment (i.e. between broadcast messages and channel conditions) further characterizes the advantages of more complex replicators. Figure 5a shows I(G : E m ) and I(G : E) E m . As we noted above, these quantities are not the same. Let us focus on 1-guessers for a moment to clarify what these quantities encode.
Given an m-environment, 1-guessers have got just one bit that they try to emit repeatedly. They do not react to the environment-there is not room for any reaction within one bit, so their guess is persistently the same. The mutual information between the emitted bit and the arbitrary words B ⊂ E that 1-guessers come across is precisely zero, as shown in the inset of figure 5a. Hence, I(G : E) E m captures the mutual information due to the slight reaction capabilities of guessers to the environmental conditions. While the bits emitted by 1-guessers do not correlate with B ⊂ E, they do correlate with each given E as they represent the most frequent bit in the environment. Accordingly, the mutual information between a 1-guesser and the aggregated environments (reflected by I(G : E m )) is different from zero (main panel of figure 5a). To this quantity contribute both the reaction capability of guessers and the fact that they have hard-wired a near-optimal guess in Γ G E , as explained in §2.1. We take the size of a guesser n as a crude characterization of its complexity. This is justified because larger guessers can store more complex patterns. H(G) E m indicates that more complex guessers look more entropic than less complex ones (figure 5b). Larger guessers come closer to the entropy level of the environment (black thick line in figure 5b), which itself tends rapidly to log(2) per bit. Better performing guessers appear more disordered to an external observer even if they are better predictors when considered within their context. Note that H(G) E m is built based on the bits actually emitted by the guessers. In biological terms, this would mean that this quantity correlates with the complexity of the phenotype. For guessers of fixed size n, we observe a slight decay of H(G) > E m as we proceed to larger environments.
The key question is whether the pay-off may be favourable for more complex guessers provided that they need a more costly machinery in order to get successfully reproduced. As discussed above, if we would use e.g. ANN or Bayesian inference graphs to model our guessers, a cost could be introduced for  The entropy of a guesser's message given its environment seems roughly constant in these experiments despite the growing environment size. This suggests an intrinsic measure of complexity for guessers. Larger guessers look more random even if they might carry more meaningful information about their environment. The thick black line represents the average entropy of the environments (which approaches log(2)) against which the entropy of the guessers can be compared. the number of units, nodes or hidden variables. These questions might be worth studying somewhere else. Here we are interested in the mathematical existence of such favourable trade-off for more complex life. To keep the discussion simple, bit-guessers incur only in a cost proportional to the number of bits that they try to transmit. Note that we do not lose generality, because such limit cost shall always exist. Equation (2.9) captures all the forces involved: the cost of transmitting longer messages versus the reward of a successful transmission.
Guessers of a given size survive in an environment ensemble if, on average, they can guess enough bits of the environment or, using the information theory picture, if they can convey enough bits through the channel (in any case, they survive ifp n m > α, which implies ρ n m > 0). Setting a fixed value of α we find out graphicallym n (α), the largest environment at which n-guessers survive ( figure 4, inset). Because m-environments look more predictable to more complex guessers we have thatm n (α) >m n (α) if n > n . This guarantees that for α > 0. 5  This is the result that we sought. The current model allows us to illustrate mathematically that limit conditions exist under which more complex and costly inference abilities can overcome the pressure for fast and cheaper replication. Also, the model allows for explicit, information-theoretically based quantification of such a limit.

Evolutionary drivers
Despite its laborious mathematical formulation, we think that our bit-guesser model is very simple and versatile. We think that it can easily capture fundamental information-theoretical aspects of biological systems. In future papers, we intend to use it to further explore relationships between guessers and environments, within ecological communities, or in more simple symbiotic or parasitic situations. To illustrate how this could work out, we present now some minimal examples.
Let us first explore some dynamics in which guessers are encouraged to explore more complex environments, but this same complexity can become a burden. As before, let us evaluate an n-guesser N g · N e times in a sample of the m-environment ensemble. Let us also look atρ n m (α, N g , N e ), the accumulated reward after these N g · N e evaluations-note thatρ n m is an empirical random variable now. Ifρ n m (α, N g , N e ) > 0, the n-guesser fares well enough in this m-environment and it is encouraged to explore a more complex one. As a consequence, the guesser is promoted to an (m + 1)-environment, where it is evaluated again. Ifρ n m (α, N g , N e ) < 0, this m-environment is excessively challenging for this n-guesser, and it is demoted to an (m − 1)-environment. Note that the n-guesser itself remains with a fixed size throughout. It is the complexity of the environment that changes depending on the reward accumulated.
As we repeatedly evaluate the n-guesser, some dynamics are established which let the guesser explore more or less complex environments. The steady state of these dynamics is characterized by a distribution P n (m, α). This tells us the frequency with which n-guessers are found in environments of a given size (figure 6a). Each n-guesser has its own distribution that captures the environmental complexity that the guesser deals more comfortably with. The overlaps and gaps between P n (m, α) for different n suggest that: (i) some guessers would engage in harsh competition if they needed to share environments of a given kind and (ii) there is room for different guessers to get segregated into environments of increasing complexity.
The averagem n (α) = m mP n (m, α) (3.1) should converge tom n (α) m n (α) under the appropriate limit. This is, if we evaluate the guessers numerically enough times, the empirical valuem n (α) should converge to the mean field valuem n (α) shown in the inset of figure 4. Figure 6b shows dynamically derived averagesm n (α) and some deviations around them as a function of α.
It is easily justified that guessers drop to simpler environments if they cannot cope with a large complexity. It is less clear why they should seek more complicated environments if they thrive in a given one. This might happen if some external force drives them; for example, if simpler guessers (which might be more efficient in simpler environments) have already crowded the place. Let us remind, from figure 4, how given an environment size more complex guessers can always accumulate a larger reward. This might suggest that complex guessers always pay off, but the additional complexity might become a burden in energetic terms-consider, e.g. the exaggerated metabolic cost of mammal brains. It is nontrivial how competition dynamics between guessers of different size can play out. Let us gain some insights by looking at a simple model. n-guessers with n = 0, 1, 2, 3 and 4 were randomly distributed occupying 100 environments, all of them with fixed size m. These guessers were assigned an initialρ i (t = 0) = nρ 0 . Here, i = 1, . . . , 100 labels each one of the 100 available guessers. Larger guessers start out with largerρ i (t = 0) representing that they come into being with a larger metabolic load satisfied. A 0-guesser represents an unoccupied environment. New empty environments might appear only if actual (n = 0) guessers die, as we explain below. We tracked the population using (a) P n (m, α) tells us how often we find n-guessers in m-environments when they are allowed to roam constrained only by their survival function ρ n m . The central valuem n of P n (m, α) must converge tom n (α) and oscillations around it depend (through N g and N e ) on how often we evaluate the guessers in each environment. (b) Averagem n for n = 1, 2, 5, 10 and standard deviation of P n (m, α) for n = 1, 10. Deviations are not presented for n = 2, 5 for clarity. The inset represents a zoom-in into the main plot.
At each iteration, a guesser (say the ith one) was chosen randomly and evaluated with respect to its environment. Then the wasted environment was replaced by a new, random one with the same size. We ensured that every guesser attempts to guess the same amount of bits on average. This means e.g. that 1-guessers are tested twice as often as 2-guessers, etc. If after the evaluation we found thatρ i (t + t) < 0, then the guesser died and it was substituted by a new one. The n of the new guesser was chosen randomly after the current distribution P m (n, t). Ifρ i (t + t) > 2nρ 0 , the guesser got replicated and shared itsρ i with its daughter, who overrode another randomly chosen guesser. This replication at 2nρ 0 represents that, before creating a similar agent, parents must satisfy a metabolic load that grows with their size. There is a range (0 <ρ i < 2nρ 0 ) within which guessers are alive but do not replicate.
Of course, this minimal model is just a proxy and softer constraints could be placed. These could allow e.g. for random replication depending on the accumulatedρ i (t + t), or for larger progeny ifρ i (t + t) >> 2nρ 0 . These are interesting variations that might be worth exploring. There are also some insights to be gained from the simple set-up considered here. We expect that more complex models will largely inherit the exploratory results that follow.  These plots show how guessers naturally segregate in environments depending on their complexity, with simpler guessers crowding simpler environments as suggested above. In such simple environments, the extra reward earned by more complex guessers does not suffice to overcome their energetic cost and they lose in this direct competition. They are, hence, pushed to more complex environments where their costly inference machinery pays off.
After 10 000 iterations, we also observe cases in which different guessers coexist. This means that the mathematical limits imposed by this naive model do not imply an immediate, absolute dominance of the fittest guesser. Interesting temporal dynamics might arise and offer the possibility to model complex ecological interactions.
So far, our guessers only interacted with the environment in a passive way, by receiving the reward that the corresponding m-environment dictates. But living systems also shape their niche in return. Such interplay can become very complicated and we think that our model offers a powerful exploratory tool. Let us study a very simple case in which the actions of the guessers (i.e. their correctly guessing a bit or not) affect the reward that an environment can offer.
To do so, we rethink the bits in an environment as resources that can not only be exhausted if correctly guessed, but also replenished after enough time has elapsed. Alternatively, thinking from the message broadcasting perspective, a spot on the channel might appear crowded if it is engaged in a successful transmission. Assume that every time that a bit is correctly guessed it gets exhausted (or gets crowded)  figure 5), we can also conceive individual bits as valuable, finite resources that get exhausted whenever they are correctly guessed. Then a successful replicator can spoil its own environment and new conditions might apply to where life is possible. (a) Average reward obtained by 1-, 2-, 5-and 10-guessers in environments of different sizes when bits get exhausted with efficiency β = 1 whenever they are correctly guessed. (b) Given α = 0.575 and α = 0.59, 1-and 2-guessers can survive within upper and lower environment sizes. If the environment is too small, resources get consumed quickly and cannot sustain the replicators. In message transmission language, the guessers crowd their own channel. If the environment is too large, unpredictability takes over for these simple replicators and they perish.
with an efficiency β so that on average each bit cannot contribute any reward β(p n m /m) of the time. The average reward extracted by a guesser from an ensemble becomes which is plotted for 1-, 2-, 5-and 10-guessers and β = 1 in figure 8. Smaller guessers living in very small environments quickly crowd their channels (alternatively, exhaust the resources they depend on). In figure 8b (still with β = 1) given some α, 1-and 2-guessers can only survive within some lower and upper limits. Furthermore, the slope of the curves around these limits also tell us important information. If these guessers dwell in environments around the lower limit (i.e. near the smallest m-environment where they can persist), then moving to larger environments will always report larger rewards. But if they dwell close to the upper limit, moving to larger environments will always be detrimental. In other words, dynamics such as the one introduced at the beginning of this section (illustrated in figure 6a)  life. This does not intend to be an exhaustive nor a definitive model, just an illustration of the versatility of the bit-guessers and environments introduced in this paper.

Discussion
In this paper, we have considered a fundamental question related to the emergence of complexity in living systems. The problem being addressed here is whether the mathematical conditions exist such that more complex organisms can overcome the cost of their complexity by developing a higher potential to predict the external environment. As suggested by several authors [6][7][8], the behavioural plasticity provided by the exploratory behaviour of living systems can be understood in terms of their ability to deal with environmental information [59].
Our models make an explicit approach by considering a replication-predictability trade-off under very general assumptions, namely: (i) more complex environments look more unpredictable to simpler replicators and (ii) agents that can keep a larger memory and make inferences based on more elaborated information can extract enough valuable bits from the environment so as to survive in those more challenging situations. Despite the inevitable cost inherent to the cognitive machinery, a selection process towards more complex life is shown to exist. This paves the way for explicit evolutionary pressures towards more complex life forms.
In our study we identify a transmitter (replicators at a given generation), a receiver (replicators at the next generation) and a channel (any environmental conditions) through which a message (ideally instructions about how to build newer replicators) is passed on. Darwinian evolution follows naturally as effective replicators transit a channel faster and more reliably thus getting more and more space in successive generations. The inference task is implicit as the environment itself codes for meaningful bits of information that, if picked up by the replicators, boost the fitness of the phenotypes embodied by the successful messages.
This view is directly inspired by a qualitative earlier picture introduced by Maynard-Smith [18]. That metaphor assigned to the DNA some external meaning that had to be preserved against the environmental noise. Contrary to this, we propose that, as messages attempt to travel from a generation to the next one, all channel conditions (including noise) pump relevant bits into the transmitted stringshence there is no need to protect meaning against the channel because, indeed, meaningful information emerges out of the replicator's interaction with such channel contingencies.
The way that we introduce correlations in our scheme (through a symmetry breaking between the information borne by short and larger words due to finite size effects) is compatible with this view. However, interestingly, it also suggests that meaningful information might arise naturally even in highly unstructured environments when different spatial and temporal scales play a relevant role. Note that our findings imply that environmental complexity is a driver of life complexity, but a question shall remain: 'where did all that environmental complexity arise from in the first place?' The way in which we link complexity and environmental size suggests an answer: that real living systems have an option to wander in ever larger environments (which, by sheer size, will be more complex than smaller ones). This is similar in spirit to the simulations illustrated in figure 6. Another possibility is that living systems themselves shall modify their environment.
This way of integrating information theory and Darwinian evolution is convenient to analyse the questions at hand that concern the emergence of complex life forms. But it also suggests further research lines. As discussed at the beginning of the paper, guessers and their transmissible messages might and should shape the transmission channel (e.g. by crowding it, as explored briefly in §3.2). What possible co-evolutionary dynamics between guessers and channels can be established? Are there stable ones, others leading to extinction, etc.? Do some of them, perhaps, imply open-ended evolution? Which ones? These are questions that relate tightly to the phenomenon of niche construction, and that hinge on the question posed in the previous paragraph regarding the many possible origins of environmental complexity. We propose that they can be easily modelled within the bit-guesser paradigm that we introduced in this paper. Further exploring the versatility of the model, a guesser's transmitted message might be considered an environment in itself; thus opening the door to ecosystem modelling based on bare information theory. It also suggests the exploration of different symbiotic relationships from this perspective and how they might affect coevolution.
An important question was left aside that concerns the memory versus adaptability trade-off of bitguessers. Here we studied guessers with a minimal adaptability to focus on the emerging hierarchy of complexity. Adaptability at faster (e.g. at behavioural) temporal scales is linked to more complex inferences with richer dynamics. This brings in new dilemmas as to how to weight the different building blocks of complex inference-e.g. how do we compare memory and if-else or while instructions? These and other questions are left for exploration in future research.
Finally, it is interesting to contextualize our results along with two recent papers published after this work was concluded. On the one hand, Boyd et al. [60] discuss how a system can behave as a thermodynamic engine that produces work with environmental correlations as fuel. This is, we believe, a very relevant discussion for the thermodynamic limits of biophysical systems that could bring explicit meaning to our abstract costs and rewards. On the other hand, Marzen & DeDeo [34] study the relationship between an environment and the resources devoted to sensory perception. They use a utilitarian approach to discover two regimes: one in which the cost of sensory perception grows with environmental complexity, and another one in which this cost remains broadly independent of the complexity of the environment. The authors say then that lossy compression allows living systems to survive without keeping exhaustive track of all the information in the environment. In these two papers, replication and the pressure of Darwinian selection are not so explicitly discussed as in our research, but both works bring in interesting elements that can enrich the bit-guesser framework.
Data accessibility. This theoretical paper does not rely on empirical data. All the plots can be generated following the equations and instructions provided in the paper.