Algorithmically probable mutations reproduce aspects of evolution, such as convergence rate, genetic memory and modularity

Natural selection explains how life has evolved over millions of years from more primitive forms. The speed at which this happens, however, has sometimes defied formal explanations when based on random (uniformly distributed) mutations. Here, we investigate the application of a simplicity bias based on a natural but algorithmic distribution of mutations (no recombination) in various examples, particularly binary matrices, in order to compare evolutionary convergence rates. Results both on synthetic and on small biological examples indicate an accelerated rate when mutations are not statistically uniform but algorithmically uniform. We show that algorithmic distributions can evolve modularity and genetic memory by preservation of structures when they first occur sometimes leading to an accelerated production of diversity but also to population extinctions, possibly explaining naturally occurring phenomena such as diversity explosions (e.g. the Cambrian) and massive extinctions (e.g. the End Triassic) whose causes are currently a cause for debate. The natural approach introduced here appears to be a better approximation to biological evolution than models based exclusively upon random uniform mutations, and it also approaches a formal version of open-ended evolution based on previous formal results. These results validate some suggestions in the direction that computation may be an equally important driver of evolution. We also show that inducing the method on problems of optimization, such as genetic algorithms, has the potential to accelerate convergence of artificial evolutionary algorithms.

Natural selection explains how life has evolved over millions of years from more primitive forms. The speed at which this happens, however, has sometimes defied formal explanations when based on random (uniformly distributed) mutations. Here, we investigate the application of a simplicity bias based on a natural but algorithmic distribution of mutations (no recombination) in various examples, particularly binary matrices, in order to compare evolutionary convergence rates. Results both on synthetic and on small biological examples indicate an accelerated rate when mutations are not statistically uniform but algorithmically uniform. We show that algorithmic distributions can evolve modularity and genetic memory by preservation of structures when they first occur sometimes leading to an accelerated production of diversity but also to population extinctions, possibly explaining naturally occurring phenomena such as diversity explosions (e.g. the Cambrian) and massive extinctions (e.g. the End Triassic) whose causes are currently a cause for debate. The natural approach introduced here appears to be a better approximation to biological evolution than models based exclusively upon random uniform mutations, and it also approaches a formal version of open-ended evolution based on previous formal results. These results validate some suggestions in the direction that computation may be an equally important driver of evolution. We also show that inducing the method on problems of optimization, such as genetic algorithms, has the potential to accelerate convergence of artificial evolutionary algorithms.

Introduction
Central to modern synthesis and general evolutionary theory is the understanding that evolution is gradual and is explained by small genetic changes in populations over time [1]. Genetic variation in populations can arise by chance through mutation, with these small changes leading to major evolutionary changes over time. Of interest in connection to the possible links between the theory of biological evolution and the theory of information is the place and role of randomness in the process that provides the variety necessary to allow organisms to change and adapt over time.
On the one hand, while there are known sources of non-uniform random mutations, for example as a function of environment, gender and age in plants and animals, when all conditions are the same mutations are traditionally considered to be uniformly distributed across coding and non-coding regions. Non-coding DNA regions are subject to different mutation rates throughout the genome, because they are subject to less selective pressure than coding regions. This is the same for the so-called microsatellites, repetitive DNA segments which are mostly non-coding, where the mutation rate increases as a function of number of repetitions. However, beyond physical properties in which the probability of a given nucleotide mutating also depends on its weaker or stronger chemo-and thermodynamic bonds, other departures from non-uniformity are less well understood, and seem to be the result of a process rather than being related to or driven by direct physical or chemical interactions.
On the other hand, random mutation implies no evidence for a directing force, and in artificial genetic algorithms mutation has traditionally been uniform even if other strategies are subject to continuous investigation and have been introduced as a function of, for example, time or data size.
More recently, it has been suggested [2][3][4][5] that the deeply informational and computational nature of biological organisms makes them amenable to being studied or considered as computer programs following (algorithmic) random walks in software space, that is, the space of all possible-and validcomputer programs. Here, we numerically test this hypothesis and explore the consequences vis-à-vis our understanding of the biological aspects of life and natural evolution by natural selection, as well as for applications to optimization problems in areas such as evolutionary programming.
We found that the simple assumption of introducing computation in the model of random mutation had some interesting ramifications that echo some genetic and evolutionary phenomenology.

Chaitin's evolutionary model
In the context of his Metabiology programme, Gregory Chaitin, a founder of the theory of algorithmic information, introduced a theoretical computational model that evolves 'organisms' relative to their environment considerably faster than classical random mutation [2,6,7]. While theoretically sound, the ideas had not been tested and further advancements were needed for their actual implementation. Here we follow an experimental approach heavily based on the theory that Chaitin himself helped found. We apply his ideas on evolution operating in software space on synthetic and biological examples, and even if further investigation is needed this work represents the first step towards testing and advancing a sound algorithmic framework for biological evolution.
Starting with an empty binary string, Chaitin's example approximates his V number, defined as V ¼ P p[HP 2 Àjpj , where HP is the set of all halting programs [8], in an expected time of O(t 2 (logt) (1þO(1)) ), which is significantly faster than the exponential time that the process would take if random mutations from a uniform distribution were applied. This speed-up is obtained by drawing mutations according to the universal distribution (UD) [9,10], a distribution that results from the operation of computer programs that we will explain in detail in the following section.
In a previous result [11], we have shown that Chaitin's model exhibits open-ended evolution (OEE) [12] according to a formal definition of OEE as defined [11] in accordance with the general intuition about OEE, and that no decidable system with computable dynamics can achieve OEE under such computational definition. Here we will introduce a system that, by following the UD, optimally approaches OEE.
formally defined as where p is a random computer program in binary (whose bits were chosen at random) running on a so-called prefix-free (in order to constrain the number of valid programs as would happen in physical systems) universal Turing machine U that outputs s and halts. Algorithmic probability connects the algorithmic likelihood of s to the intrinsic algorithmic s. The less algorithmically complex s (like p), the more frequently it will be produced on U by running a random computer program p. If K(s) is the descriptive algorithmic complexity of s (also known as Kolmogorov-Chaitin complexity [16,17]), we have it that P(s) ¼ 1/(2 K(s)þO(1) ).
The distribution induced by P(s) over all strings is called the UD or Levin's semi-measure [9,10,18], because the measure is semi-computable and can only be approximated from below and its sum does not add up to 1 to be a full measure.
The mainstream practice in the consideration and application of mutation is that mutations happen according to a uniform distribution based on, for example, the length of a genomic sequence and independent of the fitness function. What we will show here is that all other things being equal and without making considerations of other genetic operations (e.g. sexual versus asexual) beyond the scope of this paper, our results indicate that the operation of random mutation based on algorithmic probability and UD makes 'organisms' converge faster and has interesting phenomenological implications such as modularity. Evidently, this claim is not completely independent of fitness function. If a fitness function assigns, for example, a higher fitness to organisms whose description maximizes algorithmic randomness, then the application of mutations based on algorithmic probability and the UD will fail and will do so optimally as it would be pushing exactly in the opposite direction. But we will show that as long as the fitness function maximizes some nonalgorithmic random structure, as would be expected from organisms living in a structured world [4], then mutations based on the UD will converge faster.

Classical versus algorithmic probability
To illustrate the difference between one and the other, the classical probability of producing the first n digits of a mathematical constant such as p in binary by chance by e.g. randomly typing on a typewriter is exponentially unlikely as a function of the number of digits to be produced. However, because p is not random, in the sense that it has a short description that can generate an arbitrary number of digits of p with the same (short) formula, the algorithmic likelihood of p to be generated by a random program is much higher than its classical probability. This is because the (classical) probability of producing a short computer program encoding a short mathematical formula is more likely than typing the digits of p themselves one by one. This probability based on generating computer programs rather than generating the objects that such computer programs may generate is called algorithmic probability. A p-generating formula can thus be written as a computer program in no more than N bits having a probability of occurring by chance divergent from the 1/2 n probability given by classical probability.

Motivation and theoretical considerations of algorithmic evolution
In Chaitin's evolutionary model [2,6,7], a successful mutation is defined as a computable function m, chosen according to the probabilities stated by the UD [9,10], that changes the current state of the system (as an input of the function) to a better approximation of the constant V [17]. In order to be able to simulate this system, we would need to compute the UD and the fitness function. However, both the UD and the fitness function of the system require the solution of the halting problem [8], which is uncomputable. Nevertheless, as with V itself, this solution can be approximated [19,20]. Here we are proposing a model that, to the best of our knowledge, is the first computable approximation to Chaitin's proposal.
For this first approximation, we have made four important initial concessions: one with respect to the real computing time of the system, and three with respect to Chaitin's model: rsos.royalsocietypublishing.org R. Soc. open sci. 5: 180399 -We assume that building the probability distributions for each instance of the evolution takes no computational time, while in the real computation this is the single most resource-intensive step. -The goal of our system is to approximate objects of bounded information content: binary matrices of a set size. -We use the block decomposition method (BDM) and Shannon's entropy as approximations for the algorithmic information complexity K. -We are not approximating the algorithmic probability of the mutation functions, but that of their outputs.
We justify the first concession in a similar fashion as Chaitin: if we assume that the interactions and mechanics of the natural world are computable, then the probability of a decidable event 1 occurring is given by the UD. The third one is a necessity, as the algorithmic probability of an object is uncomputable (it requires a solution for HP too). In an upcoming section, we will show that Shannon's entropy is not as good as BDM for our purposes. Finally, note that given the UD and a fixed input, the probability of a mutation is in inverse proportion to the descriptive complexity of its output, up to a constant error. In other words, it is highly probable that a mutation may reduce the information content of the input but improbable that it may increase the information content. Therefore, the last concession yields an adequate approximation, since a low information mutation can reduce the descriptive complexity of the input but not increase it in a meaningful way.

Our expectations
It is important to note that, when compared to Chaitin's metabiology model [2], we changed the goal of our system therefore we must also change the expectations we had for its behaviour.
Chaitin's evolution model [2] is faster than regular random models despite targeting a highly random object, thanks to the fact that positive mutations have low algorithmic information complexity and hence a (relatively) high probability of being stochastically chosen under the UD. The universally low algorithmic complexity of these positive mutations relies on the fact that, when assuming an oracle for HP, we are also implying a constant algorithmic complexity for its evaluation function and target, since we can write a program that verifies if a change on a given approximation of V is a positive one without needing a codification of V itself.
By contrast, we expected our model to be sensitive with respect to the algorithmic complexity of the target matrix, obtaining high speed-up for structured target matrices that decreases as the algorithmic complexity of the target grows. However, this change of behaviour remains congruent with the main argument of metabiology [2] and our assertion that, contrary to regular random mutations, algorithmic probability driven evolution tends to produce structured novelty at a faster rate, which we hope to prove in the upcoming set of experiments.
In summary, we expect that when using an approximation to the UD: -convergence will be reached in fewer total mutations than when using the uniform distribution for structured target matrices; and -the stated difference will decrease in relation to the algorithmic complexity of the target matrix.
We also aimed to explore the effect of the number of allowed shifts (mutations) on the expected behaviour.

The unsuitability of Shannon's entropy
As shown in [20,21], when compared to BDM we can think of Shannon's entropy alone as a less accurate approximation to the algorithmic complexity of an object (if its underlying probability distribution is not updated by a method equivalent to BDM, as it would not be by the typical uninformed observer). Therefore, we expect the entropy-induced speed-up to be consistently outperformed by BDM when the target matrix moves away from algorithmic randomness and has thus some structure. Furthermore, as random matrices are expected to have a balanced number of 0s and 1s, we anticipated the performance of single bit entropy to be nearly identical to the uniform distribution on unstructured (random) matrices. For block entropy [22,23], that is, the entropy computed over submatrices rather than single bits, the probability of having repeated blocks is in inverse proportion 1 An event is decidable if it can be decided by a Turing machine. to their size, while blocks of smaller sizes approximate single bit entropy, again yielding similar results to the uniform distribution. The results support our assumptions and claims.

Evolutionary model
Broadly speaking, our evolutionary model is a tuple hS,S,M 0 ,f,t,ai, where: -S is the state space (see §2.7); -M 0 , with M 0 [ S, is the initial state of the system; f: S 7 ! R þ is a function, called the fitness or aptitude function, which goes from the state space to the positive real numbers; t is a positive integer called the extinction threshold; a is a real number called the convergence parameter; and -S: A successful evolution is the sequence M 0 , (M 1 , t 1 ) ¼ S(M 0 , f, t), . . ., (`, t n )) and P t i is the total evolution time. We say that the evolution failed, or that we got an extinction, if instead we finish the process by (?, t n ), with P t i being the extinction time. The evolution is undetermined otherwise. Finally, we will call each element (M i , t i ) an instance of the evolution.

Experimental set-up: a Max One problem instance
For this experiment, our phase state is the set of all binary matrices of sizes n Â n, our fitness function is defined as the Hamming distance f (M) ¼ H(M t , M ), where M t is the target matrix, and our convergence parameter is a ¼ 0. In other words, the evolution converges when we produce the target matrix, guided only by the Hamming distance to it, which is defined as the number of different bits between the input matrix and the target matrix.
The stated set-up was chosen since it allows us to easily define and control the descriptive complexity of the fitness function by controlling the target matrix and, therefore, also control the complexity of the evolutionary system itself. Is important to note that our set-up can be seen as a generalization of the Max One problem [24], where the initial state is a binary 'initial gene' and the target matrix is the 'target gene'; when we obtain a Hamming distance of 0 we have obtained the gene equality.

Evolution dynamics
The main goal of this project is to contrast the speed of the evolution when choosing between two approaches to determining the probability of mutations: -when the probability of a given set of mutations has a uniform distribution, that is, all possible mutations have the same probability of occurrence, even if under certain constraints; and -when the probability of a given mutation occurring is given by an approximation to the UD [10,14].
As the UD is non-computable, we will approximate it by approximating the algorithmic complexity K [16,17] by means of the BDM (with no overlapping) [20] based on the coding theorem method (CTM) [25][26][27] (see Methods). -We will also investigate the results by running the same experiments using Shannon entropy instead of BDM to approximate K.
Each evolution instance was computed by iterating over the same dynamic. We start by defining the set of possible mutations as those that are within a fixed n number of bits from the input matrix. In other words, for a given input matrix M, the set of possible mutations in a single For implementation purposes, we used a minor variation to the entropy probability distribution to be used and compared to BDM. The probability distributions for the set of possible mutations using entropy were built using two heuristics: let M 0 be a possible mutation of M, then the probability of obtaining M 0 as a mutation is defined as either b 0 /(h(M 0 ) þ e) or b 00 /2 h(M 0 ) . The first definition assigns a linearly higher probability to mutations with lower entropy. The second definition is consistent with our use of BDM in the rest of the experiments. The constant e is an arbitrary small value that was included to avoid undefined (infinite) probabilities. For the presented experiments e was set at 1 210 .
Once the probability distribution is computed, we set the number of steps as 0 and then, using a (pseudo)random number generator (RNG), we proceed to stochastically draw a matrix from the stated probability distributions and evaluate its fitness with the function f, adding 1 to the number of steps. If the resultant matrix does not show an improvement in fitness, we draw another matrix and add another 1 to the number of steps, not stopping the process until we obtain a matrix with better fitness or reach the extinction threshold. We can either replace the drawn matrix or leave it out of the pool for the next iterations. A visualization of the stated work flow for a 2 Â 2 matrix is shown in figure 1. To produce a complete evolution sequence, we iterate the stated process until either convergence or extinction is reached. As stated before, we could choose to not replace an evaluated matrix from the set of possible mutations in each instance, but we chose to not keep track of evaluated matrices after an instance was complete. This was done in order to keep open the possibility of dynamic fitness functions in future experiments.
In this case, the evolution time is defined as the sum of the number of steps (or draws) it took the initial matrix to reach equality with the target matrix. When computing the evolution dynamics by one of the different probability distribution schemes we will denote it by uniform strategy, BDM strategy or h strategy, respectively. That is, the uniform distribution, the distribution for the algorithmic probability estimation by BDM and the distribution by Shannon entropy.

The speed-up quotient
We will measure how fast (or slow) a strategy is compared to the uniform by the speed-up quotient, which we will define as follows.
Definition 2.1. The speed-up quotient, or simply speed-up, between the uniform strategy and a given strategy f is defined as where S u is the average number of steps it takes a sample (a set of initial state matrices) to reach convergence under the uniform strategy and S f is the average number of steps it takes under the f strategy.

Cases of negative speed-up
In order to better explain the choices we have made to our experimental set-up, first we will present a series of cases where we obtained no speed-up or slow-down. Although these cases were expected, they shed important light on the behaviour of the system.

Entropy versus uniform on random matrices
For the following experiments, we generated 200 random matrices separated into two sets: initial matrices and target matrices. After pairing them based on their generation order, we evolved them using 10 strategies: the uniform distribution, block Shannon's entropy for blocks of size 4 Â 4, denoted below by h b , entropy for single bits denoted by h, and their variants where we divide by h 2 and h 2 b , respectively. The strategies were repeated for 1-and 2-bit shifts (mutations).
The results obtained are summarized in table 1, which lays out the strategy used for each experiment, the number of shifts/mutations allowed, the average number of steps it took to reach convergence, as well as the standard error of the sample mean. As we can see, the differences in the number of steps required to reach convergence are not statistically significant, validating our assertion that, for random matrices, entropy evolution is not much different than the uniform evolution.
Because the algorithmic complexity of a network makes sense only in its unlabelled version in general, and in most of the cases, in previous works [20,27,28], we showed, both theoretically and numerically, that approximations of algorithmic complexity of adjacency matrices of labelled graphs are a good approximation (up to a logarithmic term or the numerical precision of the algorithm) of the algorithmic complexity of the unlabelled graphs. This means that we can consider any adjacency matrix of a network a good representation of the network disregarding graph isomorphisms.

Entropy versus uniform on a highly structured matrix
For this set of experiments, we took the same set of 100 8 Â 8 initial matrices and evolved them into a highly structured matrix, which is the adjacency matrix of the star with eight nodes. For this matrix, rsos.royalsocietypublishing.org R. Soc. open sci. 5: 180399 we expected entropy to be unable to capture its structure, and the results obtained accorded with our expectations. The results are shown in table 2.
As we can see from the results, entropy was unable to show a statistically significant speed-up compared to the uniform distribution. Over the next sections, we show that we have obtained a statistically significant speed-up by using the BDM approximation to algorithmic probability distributions, which is expected because BDM manages to better capture the algorithmic structures of a matrix rather than just the distribution of the bits which entropy measures. Based on the previous experiments, we conclude that entropy is thus not a good approximation for K, and we will omit its use in the rest of the article.

Randomly generated graphs
For this set of experiments, we generated 200 random 8 Â 8 matrices and 600 16 Â 16 matrices, both sets separated into initial and target matrices. We then proceeded to evolve the initial matrix into the corresponding target by the following strategies: uniform and BDM within 2-bit and 3-bit shifts (mutations) for the 8 Â 8 matrices and only 2-bit shifts for the 16 Â 16 matrices due to computing time. The results obtained are shown in figure 2. In all cases, we did not replace drawn matrices and the extinction threshold was set at 2500.
From the results, we can see two important behaviours for the 8 Â 8 matrices. The matrices generated are of high BDM complexity and evolving the system using the uniform strategy tends to be faster than using BDM for these highly random matrices. Secondly, although increasing the number of possible shifts by 1 seems, at a first glance, a small change in our set-up, it has a big impact on our results: the number of extinctions has gone from 0 for both methods to 92 for the uniform strategy and 100 for BDM. This means that most evolutions will rise above our threshold of 2500 drafts for a single successful evolutionary step, leading to an extinction. As for the 16 Â 16 matrices, we can see a formation of two easily separable clusters that coincide perfectly with the uniform and BDM distributions, respectively.   Given the values discussed, we have chosen to set the extinction threshold at 2500 and the number of shifts at 2 for 8 Â 8 matrices, as allowing just 64 possible mutations for each stage is a number too small for showing a significant difference in the evolutionary time between the uniform and BDM strategies, while requiring evolutionary steps of approximately 41 664 for an evolutionary stage is too computationally costly. The threshold of 2500 is close to the number of possible mutations and has been shown to consume a high amount of computational resources. For 16 Â 16 matrices, we performed 1-bit shifts only, and occasionally 2-bit shifts when computationally possible.

The BDM strategy, extinctions and persistent structures
The interesting case is the BDM strategy. As we can see clearly in figure 3 for the 8 Â 8 3-bit case, the overall number of steps needed to reach each extinction is often significantly higher than 2500 under the BDM strategy. This behaviour cannot be explained by the analysis done for the uniform distribution, which predicts the sharp drop observed in the blue curve.
After analysing the set of matrices drawn during failed mutations (all the matrices drawn during a single failed evolutionary stage), we found that most of these matrices have in common highly regular structures. We will call these structures persistent structures. Formally, regular structures can be defined as follows. The previous inequality is a consequence of the fact that the possible mutations are finite and only a small number of them, if any, can have a smaller algorithmic complexity than the mutations that contain G; otherwise, we contradict the existence of C. In other words, as G has relatively low complexity, the structures that contain G tend to also have low algorithmic complexity, and hence a higher probability of being chosen. Finally, as shown in §3.2, we can expect the number of mutations with persistent structures to increase in factorial order with the number of possible mutations and in polynomial order with respect to the size of the matrices that compose the state space. Proposition 3.2. As a direct consequence of the last statement, we have it that, for systems evolving as described in §2.7 under the UD: once a structure with low descriptive complexity is developed, it is exponentially hard to get rid of it; the probability of finding a mutation without the structure decreases in factorial order with respect to the set of possible mutations; evolving towards random matrices is hard (improbable); and evolving from and to unrelated regular structures is also hard.
Given the fourth point, we will always choose random initial matrices from now on, as the probability of drawing a mutation other than an empty matrix (of zeroes), when one is present in the set of possible mutations, is extremely low (below 9 Â 10 26 for 8 Â 8 matrices with two shifts).

Positive speed-up instances
In the previous section, we established that the BDM strategy yields a negative speed-up when targeting randomly generated matrices, which are expected to be of high algorithmic information content or unstructured. However, as stated in §2.4, that behaviour is within our expectations. In the next section, we will show instances of positive speed-up, including cases where previously entropy failed to show statistically significant speed-up or was outperformed by BDM.

Synthetic matrices
For the following set of experiments, we manually built three 8 Â 8 matrices that encode the adjacency matrices of three undirected non-random graphs with eight nodes that are intuitively structured: the complete graph, the star graph and a grid. The matrices used are shown in figure 4.
After evolving the same set of 100 randomly generated matrices for the three stated matrices, we can report that we found varying degrees of positive speed-up, that correspond to their respective descriptive complexities as approximated by their BDM values. The complete graph, along with the empty graph, is the graph that has the lowest approximated descriptive complexity with a BDM value of just 24.01. As expected, we get the best speed-up quotient in this case. After the complete graph, the star intuitively seems to be one of the less complex graphs we can draw. However, its BDM value (105.434) is higher than the grid (83.503). Accordingly, the speed-up obtained is lower. The results are shown in figure 5.
As we can see from figure 5, a positive speed-up quotient was consistently found within 2-bit shifts without replacements. We have one instance of negative speed-up with one shift with replacements for the grid, and negative speed-up for all but the complete graph with two shifts.
However, it is important to say that almost all the instances of negative speed-up are not statistically significant, as we have a very high extinction rate of over 90%, and the difference between the averages is lower than two standard errors of the mean. The one exception is the grid at 1-bit shift, which had 45 extinctions for the BDM strategy. The complete tables are presented in appendix A.

Mutation memory
The cause of the extinctions found in the grid are what we will call maladaptive persistent structures (definition 3.1), as they occur at a significantly higher rate under the BDM distribution. Also, as the results suggest, a strategy to avoid this problem is adding memory to the evolution. In our case, we will not replace matrices already drawn from the set of possible mutations.
We do not believe this change to be contradictory to the stated goals, since another way to see this behaviour is that the UD dooms (with very high probability) populations with certain mutations to extinction, and evolution must find strategies to eliminate these mutations fast from the population. This argument also implies that extinction is faster under the UD than regular random evolution when a persistent maladaptive mutation is present, which can be seen as a form of mutation memory. This requirement has the potential to explain evolutionary phenomena such as the Cambrian explosion, as well as mass extinctions: once a positively structured mutation is developed, further algorithmic mutations will keep it (with a high probability), and the same applies to negatively structured mutations. This can also explain the recurring structures found in the natural world. Degradation of a structure is still possible, but will be relatively slow. In other words, evolution will remember positive and negative mutations (up to a point) when they are structured.
From now on, we will assume that our system has memory and that mutations are not replaced when drawn from the distribution.

The speed-up distribution
Having explored various cases, and found several conditions where negative and positive speed-up are present, the aim of the following experiment was to offer a broader view of the distribution of speed-up instances as functions of their algorithmic complexity.
For the 8 Â 8 case, we generated 28 matrices by starting with the undirected complete graph with eight nodes, represented by its adjacency matrix, and then we removed one edge at a time until the empty graph (the diagonal matrix) was left, obtaining our target matrix set. It is important to note that the resultant matrices are always symmetrical. The process was repeated for the 16 Â 16 matrices, obtaining a total of 120 target matrices.
For each target matrix in the first target matrix set, we generated 50 random initial matrices and evolved the population until convergence was reached using the two stated strategies, uniform and BDM, both without replacements. We saved the number of steps it took for each of the 2800 evolutions to reach convergence and computed the average speed-up quotient for each target matrix. The stated process was repeated for the second target matrix set, but by generating 20 random matrices for each of the 120 target matrices to conserve computational resources. The experiment was repeated for shifts of 1 and 2 bits and the extinction thresholds used were 2500 for 8 Â 8 and 10 000 for 16 Â 16 matrices.
As we can see from the results in figure 6, the average number of steps required to reach convergence is lower when using the BDM distribution for matrices with low algorithmic complexity, and this difference drops along with the complexity of the matrices but never crosses the extinction threshold. This suggests that symmetry over the diagonal is enough to guarantee a degree of structure that can be captured by BDM. It is important to report that we found no extinction case for the 8 Â 8 matrices, 13 in the 16 Â 16 matrices with 1-bit shifts, all for the BDM distribution, and 1794 with 2-bit shifts, mostly for the uniform distribution. This last experiment was computationally very expensive. Computing the data required for the 16 Â 16, 2-bit shifts sequence took 12 days, 6 h and 22 min on a single core of an i5-4570 PC with 8GB of RAM. Repeating this experiment for 3-bit shifts is unfeasible with our current set-up, as it would take us roughly two months shy of 3 years. Now, by combining the data obtained for the previous sequence and the random matrices used in §3.1.3, we can approximate the positive speed-up distribution. Given the nature of the data, this approximation (figure 7) is given as two curves, each representing the expected evolution time from a random initial matrix as a function of the algorithmic information complexity of the target matrix for both strategies, uniform and BDM, respectively. The positive speed-up instances are those where the BDM curve is below the uniform curve.
The first result we get from figure 7 is a confirmation of an expected one: unlike the uniform strategy, the BDM strategy is highly sensitive to the algorithmic information content of the target matrix. In other words, 'it makes no difference for a uniform probability mutation space whether the solution is structured or not, while an algorithmic probability driven mutation will naturally converge faster to structured solutions'.
The results obtained expand upon the theoretical development presented in §3.2.1. As the set of possible mutations grows, so do the instances of persistent structures and the slow-down itself. This behaviour is evident given that, when we increase the dimension of the matrices, we obtain a wider gap within the intersection point of the two curves and the expected BDM value, which corresponds to the expected algorithmic complexity of randomly generated matrices. However, we also increase the number of structured matrices, ultimately producing a richer and more interesting evolution space.

.1. A biological case
We now set as target the adjacency matrix of a biological network corresponding to the topology of an ERBB signalling network [29]. The network is involved in responses ranging from cell division, death, motility and adhesion, and when dysregulated it has been found to be strongly related to cancer [30,31]. As one of our main hypotheses is that algorithmic probability is a better model for explaining biological diversity, it is important to explore whether naturally occurring structures are more likely to be produced under the BDM strategy than the uniform strategy, which is equivalent to showing them evolving faster.
The binary target matrix is shown in figure 8, and it has a BDM of 349.91 bits. For the first experiment, we generated 50 random matrices that were evolved using 1-bit shift mutations for the uniform and BDM distributions, without repetitions. The BDM of the matrix is at the right of the intersection point inferred by the cubic models shown in figure 7. Therefore, we predict a slow-down. The results obtained are shown in table 3.
As the results show, we obtained a slow-down of 0.71, without extinctions. However, as mentioned above, the BDM of the target matrix is relatively high, so this result is consistent with our previous experiments. However, the strategy can be improved.

Evolutionary networks
An evolutionary network N is a tensor of dimension 4 of nodes M i which are networks themselves with edges drawn if M k evolves into M k 0 and weight corresponding to the number of times that a network M k has evolved into M k 0 . Figure 9 shows a subnetwork of the full network for each evolutionary strategy from 50 (pseudo-)randomly generated networks with the biological ERBB signalling network as target.
Mutations and overexpression of ERB receptors (ERBB2 and ERBB3 in this network) have been strongly associated with more than 10 types of tissue-specific cancers and they can be seen at the highest level regulating most of the acyclic network.
We name forward mutations as mutations that led to the target network, and backward mutations as mutations that get away from the target network through the same evolutionary paths induced by  One of the mutations by BDM involves the breaking of the only network cycle of size 6: EGFR ! ERBB3, ERBB3 ! IGFR, IGFR ! ESR, ESR ! MYC, MYC ! EGFR by deletion of the interaction MYC ! EGFR, with probability 0.05 among the possible mutations in the BDM immediate neighbourhood of the target. The cycle involves ERBB3 which has been found to be related to many types of cancer when overexpressed [30].
For the local BDM strategy, the following were the top five forward mutations: EGFR ! ERBB2, 0.32; EGFR ! ERBB3, 0.107; IGFR ! CCNE, 0.0714; ERBB3 ! ERBB2, 0.0714; EGFR ! ESR, 0.0714. ERBB2 and ERBB3 were heavily involved in three of the top five possible mutations with added probability 0.49 and thus more likely than any other pair and interaction of proteins in the network.
Under a hypothesis that mutations can be reversals to states in past evolutionary pathways, then mutations to such interactions may be the most likely backward mutations to occur.

The case for localized mutations and modularity
As previously mentioned in proposition 3.2, the main causes of slow-down under the BDM distribution are maladaptive persistent structures. These structures will negatively impact the evolution speed in factorial order relative to the size of the state space. One direct way to reduce the size set of possible  mutations is to reduce the size of the matrices we are evolving. However, doing so will reduce the number of interesting objects we can evolve towards too. Another way to accomplish the objective while using the same heuristic is to rely on localized (or modular) mutations. That is, we force the mutation to take place on a submatrix of the input matrix. The way we implement the stated change is by adding a single step in our evolution dynamics: at each iteration, we will randomly draw, with uniform probability, one submatrix of size 4 Â 4 out of the set of adjacent submatrices that compose the input matrix, with no overlap, and force the mutation to be there by computing the probability distribution over all the matrices that contain the bit-shift only at the chosen place. We will call this method the local BDM method.
It is important to note that, within 1-bit shifts ( point mutations), the space of total possible mutations remains the same when compared with the uniform and BDM strategies. Furthermore, the behaviour of the uniform strategy would remain unchanged if the extra step is applied using the uniform distribution.
We repeated the experiment shown in table 3 with the addition of the local BDM strategy and the same 50 random initial matrices. Its results are shown in table 4. As we can see from the results obtained, local BDM obtains a statistically significant speed-up of 1.25 when compared with the uniform.
One potential explanation of why we failed to obtain speed-up for the network with the BDM strategy is that, as an approximation to K, the model depends on finding global algorithmic BDM-based mutation (approximation to universal distribution) total number of edges: 6309 >1 -graph: local BDM-based mutation (modularity induced) total number of edges: 6327 >1 -graph: uniform random mutation total number of edges: 6339 >1 -graph: Figure 9. Evolutionary convergence in evolutionary subnetworks closest to the target matrix with edges shown only if they were used more than once. The darker the edge colour the more times that an evolutionary path was used to reach the target; the highest number is 7 for the BDM-based mutation (top) and the lower 2 (e.g. all those in the uniform random distribution). BDM-based is the only disconnected graph meaning that it produced a case of convergent evolution even before reaching the target.
rsos.royalsocietypublishing.org R. Soc. open sci. 5: 180399 structures, while the sample is based on a substructure which might not have enough information about the underlying structures that we hypothesize govern the natural world and allow scientific models and predictions. However, biology evolves modular systems [32], such as genes and cells, that in turn build building blocks such as proteins and tissues. Therefore, local algorithmic mutation is a better model. This is a good place to recall that local BDM was devised as a natural solution to the problem presented by maladaptive persistent structures in global algorithmic mutation. This also means that this type of modularity can be evolved by itself given that it provides an evolutionary advantage, as our results demonstrate. This is compatible with the biological phenomenon of non-point mutations in contrast to point mutations, which affect only a single nucleotide. For example, in microsatellites mutations may lead to the gain or loss of the entire repeated unit, and sometimes several repeats simultaneously.
We will further explore the relationship between BDM and local BDM within the context of global structures in the next section. Our current set-up is not optimal for further experimentation in biological and local structured matrices, as the computational resources required to build the probability distribution for each instance grows in quadratic order relative to matrix size, though these computational resources are not needed in the real world (cf. Conclusion).

Chasing synthetic evolving networks
The aim of the next set of experiments was to follow, or chase, the evolution of a moving target using our evolutionary strategies. In this case, we chased four different dynamical networks: the ZK graphs [21], Kary trees, an evolving N-star graph and a star-to-path graph dynamic transition artificially created for this project (see appendix A for code). These dynamical networks are families of directed labelled graphs that evolve over time using a deterministic algorithm, some of which display interesting graph-theoretic and entropy-fooling properties [21]. As the evolution dynamics of these graphs are fully deterministic, we expected BDM to be (statistically) significantly faster than the other two evolutionary strategies, uniform probability and local BDM.
We chased these dynamics in the following way. Let S 0 , S 1 , . . ., S n , . . . be the stages of the system we are chasing. Then the initial state S 0 was represented by a random matrix and, for each evolution S i 7 ! S iþ1 , the input was defined as the adjacency matrix corresponding to S i , while the target was set as the adjacency matrix for S iþ1 . In order to normalize the matrix size, we defined the networks as always containing the same number of nodes (16 for 16 Â 16 matrices). We followed each dynamic until the corresponding stage could not be defined in 16 nodes.
The results which were obtained, starting from 100 random graphs and 100 different evolution paths at each stage, are shown in figure 10. It is important to note that, since the graphs were directed, the matrices used were non-symmetrical.
From the results, we can see that local BDM consistently outperformed the uniform probability evolution, but the BDM strategy was the faster by a significant margin. The results are as expected and confirm our hypothesis: uniform evolution cannot detect any underlying algorithmic cause of evolution, while BDM can, inducing a faster overall evolution. Local BDM can only detect local regularities, which is good enough to outrun uniform evolution in these cases. However, as the algorithmic regularities are global, local BDM is slower than (global) BDM.

Discussion and conclusion
The results of our numeric experiments are statistically significant and, as shown in figures 6 and 7, the speed-up quotient increases in relationship to the ratio between the algorithmic complexity of the target matrix and the expected random matrix, confirming our theoretical expectations. The obtained speed-up can be considered low when the stated quotient is sufficiently close to 1, but on a rich evolution space we expect this difference to be significant: for a rough estimate, the human genome can potentially store 700 megabytes of data, while the biggest matrix used in our experiments represents a space limited to objects of 16 Â 16 ¼ 256 bits; therefore we expect the effects of speed-up (and slow-down) to be significantly higher in natural evolution than in these experiments. On the one hand, classical mechanics establishes that random events are only apparent and not fundamental. This means that mutations are not truly random but the result of interacting deterministic systems that may distribute differently than random. A distribution representing causal determinism is that suggested by algorithmic probability and the UD because of its theoretical stability under changes of formalism and description language [9,10]. Its relevancy, even for non-Turing universal models of computation, has also been proven [18], able to explain up to more than 50% of a bias towards simplicity.
On the other hand, the mathematical mechanisms of biological information, from Mendelian inheritance to Darwin's evolution and the discovery of the digital nature of the genetic code together with the mechanistic nature of the mechanisms of translation, transcription and other intercellular processes, suggests a strong algorithmic basis underlying fundamental biological processes. By taking it to the next consequence, these ideas indicate that evolution by natural selection may not be (very) different to, and can thus be regarded and studied as, evolving programs in software space as suggested by Chaitin [3,6,7].
Our findings demonstrate that computation can thus be a powerful driver of evolution that can better explain key aspects of life. Effectively, algorithmic probability reduces the space of possible mutations. By abandoning the uniform distribution assumption, questions ranging from the apparition of sudden major stages of evolution, the emergence of 'subroutines' in the form of modular persistent structures and the need of an evolving memory carrying information organized in such modules that drive evolution by selection may be explained.
The algorithmic distribution emerges naturally from the interaction of deterministic systems [10,18]. In other words, we are simulating the conditions of an algorithmic/procedural world, and there is no reason to believe that it requires greater real-world (thus highly parallel) computation than is required by the assumption of the uniform distribution given the highly parallel computing nature of physical laws. The UD can thus be considered as natural, or in some way, even more natural, than the uniform distribution.
The interplay of the evolvability of organisms from the persistence of such structures also explains two opposed phenomena, recurrent explosions of diversity and mass extinctions, phenomena which have occurred during the history of life on Earth that have not been satisfactorily explained under the uniform mutation assumption. The results suggest that extinction may be an intrinsic mechanism of biological evolution. In summary, taking the informational and computational aspects of life based on modern synthesis to the ultimate and natural consequences, the present approach based on weak assumptions of deterministic dynamic systems offers a novel framework of algorithmic evolution within which to study both biological and artificial evolution.

Approximations to algorithmic complexity
The algorithmic complexity of a string K(s) (also known as Kolmogorov-Chaitin complexity [16,17]) is defined as the length of the smallest program that produces s as an output and halts. This measure of complexity is invariant-up to a constant value-with respect to the choice of reference universal Turing machine. Finding the exact value of K(s) for any s is a lower semi-computable problem. This means that there is no general effective method to find K(s) for any given string, but upper bounds can be estimated.
Among the computable methods used to set an upper bound are the CTM [25,26,33] and the BDM [20,27]. The CTM relies upon approximating the algorithmic probability of an object by running every possible machine in a large set of small Turing machines, generating an empirical probability distribution for the produced strings by counting the number of small Turing machines that produce each string and halt. The algorithm can only be decided for a small number of Turing machines and for those that can be decided it runs in exponential time, therefore only approximations of K(s) for small strings are feasible. However, this computation only needs to be done once to populate a lookup table that allows its application in linear (constant in exchange of memory) time.
BDM is an extension of CTM defined as where each s i corresponds to a substring of s for which its CTM value is known and n i is the number of times the string s i appears in s. A thorough discussion of BDM is found in Zenil et al. [20].

Recursively generated graphs
To test the speed of algorithmic evolution on recursive dynamic networks, we generated three other low algorithmic graphs different from the ZK graph as defined in Zenil et al. [21] that is also of low algorithmic complexity. We needed graphs that evolved over time in a low algorithmic complexity fashion from and to low algorithmic complexity graphs. The three graphs were canonically labelled using the positive natural numbers up to n by maximizing the number of nodes with consecutive numbers, then a rule was applied from lowest to highest number until the transformation was complete. The Wolfram Language code used to generate these recursively (hence of low algorithmic complexity/randomness) evolving graphs are the following. For the ZK graph [21]