Major evolutionary transitions as Bayesian structure learning

Complexity of life forms on Earth has increased tremendously, primarily driven by subsequent evolutionary transitions in individuality, a mechanism in which units formerly being capable of independent replication combine to form higher-level evolutionary units. Although this process has been likened to the recursive combination of pre-adapted subsolutions in the framework of learning theory, no general mathematical formalization of this analogy has been provided yet. Here we show, building on former results connecting replicator dynamics and Bayesian update, that (i) evolution of a hierarchical population under multilevel selection is equivalent to Bayesian inference in hierarchical Bayesian models, and (ii) evolutionary transitions in individuality, driven by synergistic fitness interactions, is equivalent to learning the structure of hierarchical models via Bayesian model comparison. These correspondences support a learning theory oriented narrative of evolutionary complexification: the complexity and depth of the hierarchical structure of individuality mirrors the amount and complexity of data that has been integrated about the environment through the course of evolutionary history.

of evolutionary transitions in individuality. In this paper, we argue that they do. We 55 first provide a mapping between multilevel selection modeled by discrete-time replica-56 tor dynamics and Bayesian inference in belief networks (i.e., directed graphical models), 57 which shows that the underlying mathematical structures are isomorphic. The two key 58 ingredients are (i) the already known equivalence between univariate Bayesian update 59 and single-level replicator dynamics [11,12] and (ii) a possible correspondence between 60 properties of a hierarchical population composition and multivariate probability theory. 61 We then show that this isomorphism allows for a natural interpretation of evolutionary 62 transitions in individuality as learning the structure [13,14] of the belief network. Indeed, 63 following adaptive paths on the fitness landscape over possible hierarchical population 64 compositions is equivalent to a well-known method used for selecting the optimal model tic generative models complements recent efforts of searching for algorithmic analogies 71 between emergent evolutionary phenomena and neural network based learning models 72 [15,16]. These include correspondences between evolutionary-ecological dynamics and autoassociative networks [17] and also linking the evolution of developmental organiza-74 tion to learning in artificial neural networks [18]. As such connectionist models account 75 for how global self-organizing learning behavior might emerge from simple local rules 76 (e.g., weight updates), our approach aims at providing a common global framework for 77 modeling both evolutionary and learning dynamics. 78 In the following, we provide a brief introduction to the elementary building blocks likelihood that the actual data e = e(t) is being generated by hypothesis I i , given by 83 P (e(t)|I i ). Mathematically, the fitted distribution P (I i |e(t)), called the posterior, is 84 simply proportional to both the prior P (I i ) and the likelihood P (e(t)|I i ): On the other hand, the discrete replicator equation [20] that accounts for the change 86 in relative abundance f (I i ) of types of replicating individuals I i in the population driven 87 by their fitness values w(I i ), reads as .
( relations corresponding to indirect (as opposed to direct) dependencies between variables.

122
In the following, we build up an algebraic isomorphism between discrete-time multi-123 level replicator dynamics and iterated Bayesian inference in belief networks on a step-by-124 step basis. The key identified quantities are summarized in Table 1. is exactly what our model accounts for mathematically, incorporating also the effect of 143 stochastically varying environment.

144
A key assumption that enables the machinery of multivariate probability theory 145 to work is that abundance of collectives is measured in terms of abundance of indi-146 viduals they contain. Indeed, by identifying the abundance of individuals of type I i , , that are part of collectives of type C 1 j that are themselves part . This is, in turn, equivalent to successive Bayesian inference of hidden variables I, C 1 and C 2 based on the observation of current the environmental parameters e. Since these environmental parameters are sampled and observed multiple times (i.e., at every timestep t = 1, 2, 3 . . . ), the corresponding node of the belief network is conventionally placed on a plate. Also note that the deletion of links between nodes of the belief network is corresponding to conditional independence relations between variables in the Bayesian setting and to specific structural properties of selection and population composition in the evolutionary setting; see text for details.
relative abundances of units at a given level, e.g., of collectives at level C 1 , probabilities, e.g., properties of multilevel selection conditional independence of the observed variable e and a latent variable, e.g., I, P (e|I, C 1 , C 2 , . . . ) = P (e|C 1 , C 2 , . . . ) units at a given level, e.g., individuals, "freeze": their fitness is completely determined by the collective(s) they belong to: is the same for all i conditional independence between two latent variables, e.g., I and C 2 , P (I|C 1 , C 2 , . . . ) = P (I|C 1 , . . . ) the composition of units at level C 1 is independent of what units they belong to at level C 2 . Bayesian structure learning evolutionary transitions in individuality difference of average fitness of those units that are participating in the transition in individuality, causing the M a → M b change in population structure Table 1: Identified quantities of evolution and learning • marginal distributions, such as P (C 1 j ) = i,k,... P (I i , C 1 j , C 2 k , . . . ) translate to the 151 abundance distribution of types at the corresponding level (here, conditioning one variable on another corresponds to a directed link between the two.

190
Since P (e, I, C 1 , C 2 ) can always be written as P (e|I, C 1 , C 2 )P (I|C 1 , C 2 )P (C 1 |C 2 )P (C 2 ) 191 in terms of conditional probabilities, the corresponding belief network is the one illus-192 trated in Figure 1. The route to simplify the structure of the distribution and corre- situation, one has to take into consideration not only how good the best parameter 227 combination fits the data, but also how hard it is to find such a parameter-combination.
The first term in the sum describes the likelihood of the current parameters (i.e., their 233 ability to fit the data), whereas the second term weights these likelihoods according to 234 the prior probabilities of the parameters.
in which the first term in the sum corresponds to fitnesses of individuals according to 241 what collectives they belong to, and the second terms weights these fitnesses according represented as a new node in the Bayesian belief network. Then, another new collective emerges at level C 1 (the circles), therefore, the variable C 1 is renamed to C 1 as its possible values now include the circle as well. Finally, new collectives emerge at an even higher level (the rectangle and the ellipse at level C 2 ), and correspondingly, a new node is added to the network again. Note that the evolution of parameters (i.e., population composition in a fixed structure) is not illustrated here for simplicity. to be specified. A natural way to do so is to pre-define a family of basis functions (e.g.,

292
Gaussians) on the space of possible environments e, parametrized by a set of parameters 293 (e.g., the mean and covariance of the Gaussian). Then, each type at each level is assigned 294 one member of the family through its parameters. What determines the fitness of a given 295 type at time t then is the value of the basis function assigned to that type at e(t). The conditional probabilities, respectively, of multivariate discrete probability distributions.

313
Another key ingredient is that the stochastic environment determines the fitness of both 314 individuals and collectives in a multilevel selection process. These two pillars are united by 315 the already known algebraic equivalence between Bayesian update and discrete replicator as the environment e is successively observed, the distribution over the latent variables 318 I, C 1 , C 2 , . . . , corresponding to the hierarchical population composition, is successively 319 updated according to Bayes' rule.

320
Having identified this analogy, one might ask how the structure of the belief network 321 (i.e., not just the parameters of a fixed network) itself evolves. In learning theory, differ-