The emergence of division of labour through decentralized social sanctioning
Abstract
Human ecological success relies on our characteristic ability to flexibly self-organize into cooperative social groups, the most successful of which employ substantial specialization and division of labour. Unlike most other animals, humans learn by trial and error during their lives what role to take on. However, when some critical roles are more attractive than others, and individuals are self-interested, then there is a social dilemma: each individual would prefer others take on the critical but unremunerative roles so they may remain free to take one that pays better. But disaster occurs if all act thus and a critical role goes unfilled. In such situations learning an optimum role distribution may not be possible. Consequently, a fundamental question is: how can division of labour emerge in groups of self-interested lifetime-learning individuals? Here, we show that by introducing a model of social norms, which we regard as emergent patterns of decentralized social sanctioning, it becomes possible for groups of self-interested individuals to learn a productive division of labour involving all critical roles. Such social norms work by redistributing rewards within the population to disincentivize antisocial roles while incentivizing prosocial roles that do not intrinsically pay as well as others.
1. Introduction
Human societies depend on division of labour. However, an individual’s role is not specified in their genes. Rather, human roles are learned during individual lifetimes. This is one of the reasons why human groups can achieve collective welfare more quickly than would be possible with purely genetic evolution. However, many computational models of lifetime-learning formulate it as a process of maximization of individual payoffs [1–4]. Such a formulation cannot, on its own, account for the learning of division of labour. When individuals are driven by self-interest they cannot learn to perform roles that do not pay as well as other roles, yet such roles are often necessary for the group to function and achieve a high overall welfare. Consequently, a fundamental question arises: how can division of labour emerge in groups of self-interested lifetime-learning individuals?
Here, we hypothesize that social norms, which we take to be patterns of social sanctioning, are sufficient to incentivize individuals in groups to select prosocial role choices, thereby enabling group-level division of labour to emerge from self-interested lifetime-learning. To test this hypothesis, we propose a model where lifetime-learning is shaped by social sanctioning. The specific social sanctioning mechanism presented here was inspired by studies that investigated the role of reward and punishment in laboratory-based social dilemmas such as the public goods game [5–9].
In these experiments, participants start with an initial endowment of tokens and can contribute to a public good by investing tokens in it. It is best for all if all individuals invest all their tokens, yet as individuals, all face an incentive to free-ride: holding back their own investment while still benefiting from the contributions of others. For instance, Fehr & Gächter [8] designed experiments consisting of two stages. First, participants chose how much to contribute to the public good. In a second stage, they received information about the choices of others, and then decided on that basis whether or not to pay a cost to punish others for their behaviour. They found that participants were willing to engage in altruistic punishment: they would pay to punish free-riders. Furthermore, the possibility of altruistic punishment in the case of repeated interactions led to an increase in the average cooperation level of the participants in the group.
Figure 1 illustrates the learning process used in our model. In Stage I (figure 1a), a group of self-interested lifetime-learners learn their roles. This process is modelled as a K-armed bandit problem where individuals learn the role (arm) that provides the highest payoff (optimum) from a finite set of roles. Each individual’s lifetime-learning of their role is modelled by an -greedy algorithm: individuals select the role they judge to provide the maximum average payoff as estimated empirically by the payoffs they receive, and explore other roles with a small probability [4]. Stage II (figure 1b) introduces the social sanctioning stage where individuals can monitor others, and impose sanctions based on their role choices. Subsequently, the payoffs received in Stage I are updated by the social sanctions imposed in Stage II.
Figure 1. The decision process of the individuals consists of two stages: in (a) Stage I, individuals make a role selection based on the values of the roles () (see equation (2.1)). The rewards received depend on a function (F) of the roles selected by all individuals in the group (see equation (2.3)). In (b) Stage II, social sanctioning rule examples are shown in the form of statements given in the box outlined with dashed lines. Considering k number of roles, it is possible to generate k × k number of rules. For computational efficiency, we represent these rules as a matrix where each cell encodes the amount of encouragement/discouragement that is imposed by corresponding roles indicated in the row and column headers. Social sanctions can take forms negative (in the case of reward), positive (in the case of punishment) or zero (if no social sanction is imposed). Social sanctions are imposed in a decentralized fashion to shape the rewards of others to encourage the division of labour. The rewards are updated based on algorithm 1 (see electronic supplementary material). Such a pattern of decentralized social sanctioning constitutes a social norm (S). The learning process consists of two levels: lifetime-learning and social norm evolution. (c) In lifetime-learning, groups of individuals perform Stage I and Stage II iteratively to learn their roles. Through social norm evolution, social sanctioning matrix values are optimized to produce optimum lifetime-learning of the social roles.
In our model, social sanctioning occurs because individuals endorse normative rules of the form ‘X ought to encourage/discourage Y’. For example ‘Farmers ought to encourage hunters’. Encouragement is not possible when individuals have no resources, so individuals with no recent rewards do not encourage—regardless of what their society deems normative for their role. In this model costly norms cannot be implemented by impoverished individuals; however, the rules themselves are not forgotten and may reactivate once individuals with the roles they name become wealthier. Likewise, systems of norms retain rules for social roles that may not exist. Rules for how farmers ought to relate to soldiers can persist in the cultural memory even if soldiers do not currently exist.
We model social norms as patterns of social sanctioning adopted by all individuals in a group. That is, every individual in the group applies the same sanctioning scheme. This is consistent with conceptions of what it is for a collective behaviour pattern to be a social norm that rely on most agents in the group to conform [10]. This kind of sanctioning mechanism is said to be decentralized since it does not require a centralized enforcement mechanism like a government-run police force. Rather, the individuals in the group perform all the sanctioning actions themselves. However, their participation is not voluntary. They do not decide on their own whether to sanction or not. At each point in time, whether they sanction another individual, and if so, how much, is entirely determined by the group’s social sanctioning matrix, i.e. the social norm itself. This may be justified by assuming the existence of a metanorm that demands individuals sanction in accord with their group’s overall pattern, an assumption also shared by other models where norms are seen as public goods and metanorms are consequently necessary to evade the second-order free-rider problem [11,12]. The existence of the requisite metanorm to stabilize patterns of decentralized social sanctioning is supported by laboratory experiments [13–16] and ethnographic evidence [17,18]. In our model, one implication of involuntary social sanctioning is that, given the right norm, it is possible to shape learning in any direction. This works for the same reason that reward-shaping techniques are effective both in behaviourist psychology [19] and in algorithmic reinforcement learning [20]. Here groups can induce individuals by sanctioning to select any behaviour. This is consistent with other models in the evolution of cooperation literature where sanctioning may stabilize behaviours regardless of whether or not they are adaptive [6].
As illustrated in figure 1b, social sanctions are determined by rules that are computationally encoded in the form of a real-valued matrix with shape k × k (where k is the number of roles). Entries in the matrix correspond to normative rules. Each entry defines how much reward or punishment an individual with role i ought to impose on an individual of role j. Under our terminology, each such specific matrix corresponds to a distinct social norm. Sanctioning in this model is role-wise. The amount of sanctioning applied by one individual to another at time t is a function of both of their roles at that time. There is a rich and cross-disciplinary tradition that centres around theories of social structure which features this kind of intimate connection between ‘role psychology’ and normative behaviour. Intuitively, norms are deeply entangled with social roles. For instance, it would be inappropriate (i.e. sanctionable) for a student to behave like a teacher or for a judge to behave like a legislator [21–23]. The errant student would likely face discipline from an academic administrator—another role, and one for which such sanctioning is part of its job description. Note also that role-wise sanctioning—but not individual-targeted sanctioning—allows the implementation of impartial norms, like some human norms are [24,25]. It need not merely produce the self-serving resentment-driven sanctioning motivations that are sometimes observed in non-human primates [26].
Depending on the environment, a social norm may either facilitate high social welfare or court disaster. For instance, a norm that encourages many farmers to become soldiers could be important for survival in a hostile environment where soldiers are truly needed for defence. However, in a more benign environment, with no external threats for soldiers to defend against, then allocating too many individuals to the soldier role would be suboptimal, and it would be better to have more farmers instead. Typically the environment is a major factor in which specific norms are effective.
One implication is that norm evolution may grow more and more precisely optimized to local conditions over time. It is generally possible to achieve a better and better fit to the precise environmental conditions a community finds itself in by evolving more and more complex norms. However, we think there are at least two strong brakes on this process that prevent the evolution of greater and greater norm complexity. First, we must consider the fact that for cultural evolution to happen, norms must somehow be transmitted between generations. These conditions are thought to give rise to a simplicity bias (complexity regularization [27]) in the context of direct and indirect reciprocation rules for cooperation [28,29], and language learning [30]. We propose that the same mechanisms also constrain norm-acquisition. So if norm learners (children or immigrants) find it easier to acquire simpler norms, then there will be an overall bias toward simplicity throughout the evolutionary process. There is additionally one other kind of brake on complexity growth: very high norm complexity may be problematic in itself if we assume that it implies a growing overhead cost. Tainter [31] argued that the tendency for states to accumulate complexity, with large attendant costs, has been a major causal force in the collapse of ancient empires. If true, this logic would also seem to put a brake on the benefits from increasing norm complexity.
In our model, social norms are optimized through a cultural evolution process (given in algorithm 3 in the supplementary material). We interpret its mechanism of norm change as cultural group selection [32–34]. As such, our model depends critically on all the assumptions necessary for the strength of group selection to outweigh that of individual selection, e.g. sufficient separation between groups [32]. Note though that in the present work we do not explicitly model other groups beyond the focal group. Formally, our model considers only a single representative group where the effect of evaluating and evolving social norms can be computed iteratively and independently. Each social norm evolution experiment starts from a randomly initialized norm, obtained by sampling the values of the sanctioning matrix from a uniform distribution over a certain range. In each iteration, a new variant of the social norm is generated by perturbing its constituent rules. The newly generated norm is then evaluated, and its success measured by average group payoff achieved on a task demanding the learning of a division-of-labour arrangement. In addition, we introduce complexity regularization [27,35] that imposes a penalty based on the complexity of the norms. The evolutionary process of the norms starts from an initial condition that does not incorporate any prior knowledge of the task at hand. If the norm achieves a higher average group payoff then it replaces the status quo.
Overall, we find that the social norms that emerge from evolution involve redistribution mechanisms where individuals periodically pay others in order to incentivize them to perform beneficial roles for the group that they would not otherwise select. Consequently, these mechanisms allow groups of self-interested individuals to discover effective division-of-labour arrangements through lifetime-learning. Specifically, we show that the proposed method of simulating social norm evolution leads to higher collective payoff than groups of self-interested or altruist individuals. Moreover, complexity regularization promotes convergent evolution to simpler social norms in independent evolutionary processes.
2. The model
(a) Lifetime-learning of individuals’ roles
In our model, groups consist of individuals. Individuals select and re-select roles for themselves throughout their lifetime. Individuals learn from this experience which of their role choices are the most rewarding and usually select the role they expect will provide the most reward. That is, individuals face a K-armed bandit problem , where K is the number of roles. They select a new role on each step, so it is best to think of the simulation steps as corresponding to some substantial period of time like a week or a month.
The lifetime-learning process is illustrated in figure 1a. Individuals change their behaviour over time via value-based reinforcement learning [4]. On each step t, each individual i selects its role using its estimated value with
In this model, individuals learn to select roles in order to maximize their personal reward. However, this could lead them to select roles containing selfish behaviours that gain personal reward at the expense of the wider group. Individualistic reinforcement learning has the effect of discouraging agents from ‘taking one for the team’, resulting in lower joint performance in environments that require such cooperation.
(b) Incentivizing the lifetime-learning of division of labour via social norms
We consider a reward function F
Social norms in our model are regarded as decentralized patterns of social sanctioning. In our model, sanctions are rewards and punishments imposed by one individual on another individual. Amounts of sanctioning are a function of the roles of the sanctioning and the sanctioned player. The social sanctioning stage (Stage II as shown in figure 1b) is applied every time step after Stage I (shown in figure 1a). Here, the individuals in the group monitor other individuals (based on a certain neighbourhood function that defines the connectivity of their social network) and impose sanctions. The amount of the sanction provided by an individual taking role k to an individual taking role ℓ is sk,ℓ (figure 1b). The rewards received after the role selection (Stage I) are then updated based on the social sanctioning scheme proposed in Stage II. As shown in figure 1c, Stages I and II are performed consecutively for T iterations.
(c) Evolution of social norms for incentivizing division of labour
We formalize the process of social norm evolution with an optimization algorithm that models cultural evolution from the cultural group selection point of view (see algorithm 3 in the electronic supplementary material). The goal of the algorithm is to find a particular set of social sanctioning rules, represented as matrix S*, that can maximize
Other forms of complexity measures can be considered. For instance, the sum of the absolute values of the social sanctioning values (i.e. L1 norm) could be a good measure for minimizing the total amounts of sanctioning values. On the other hand, this may not reduce the number of rules that constitute the social norms since there is still a possibility of the emergence of a large number of rules with very small sanctioning values. Another option may be the number of types of roles involved in the rules. In this case, there is also a possibility of the emergence of social norms that consist of a large number of rules but contain a small number of role types. Therefore, L0 norm is a better fitting measure among these alternatives for modelling the complexity of the social norms that can facilitate the emergence of norms simpler in terms of their size.
Initially, the group is assigned a social sanctioning matrix where the amount of sanctions is randomly sampled within a certain range (i.e. uniformly in [−6, 6]). It is assumed that all individuals in the group use the same social sanctioning matrix. Then, the group can sample a new social norm by applying three possible operators to the existing one: (i) Gaussian perturbation: the values of non-zero cells perturbed by , (ii) addition of a new rule: a randomly selected zero-valued cell initialized from , and (iii) deletion of an existing rule: the value of a randomly selected non-zero cell is replaced by 0. Note that addition and deletion operators cause changes in the number of sanctioning rules, whereas Gaussian perturbation causes only a change in the sanctioning values of existing rules. In addition, we keep track of the usage frequency of each rule (i.e. how many times each rule is implemented by the individuals in a group), and if there are rules that are not used, we remove them by assigning their amount to 0 in the social sanctioning matrix.
We introduce four parameters to adjust the amount of change these operators can lead to. These parameters: mutation probability (mp) and mutation rate (mr) for Gaussian perturbation, rule addition probability (rap) for rule addition and rule deletion probability (rdp) for rule deletion. Gaussian perturbation is applied with probability mp and rate mr, where σ = mr, and rule addition and rule deletion operators are applied based on the probabilities rap and rdp, respectively. Sensitivity analysis for these parameters is provided in the electronic supplementary material.
After the perturbation step, if the perturbed social sanctioning rules S′ provides better performance (as measured by Ft), it is selected by the group. It then replaces the social sanctioning rules currently in use. After a certain number of iterations of this process we expect to find social sanctioning rules that are better at incentivizing division of labour.
The average reward of the social sanctioning rules R is found as follows:
(d) Spatial games
We investigate the evolution of cooperative norms on two spatially structured games called settlement maintenance and common pasture. In both games, each individual occupies a position in space and directly interacts only with their neighbours (figure 2a). Each individual selects a role to perform, and receives a payoff as a result. Some of the roles do not provide individuals who select them with any reward at all, yet it still may be necessary that some agent perform them for the group as a whole to be successful.
Figure 2. (a) Types of roles, colour coded for visualization, can be performed by the individuals that are spatially distributed on a two-dimensional toroidal grid environment and can impose social sanctions on the individuals within their neighbourhood. The goal is to learn a role distribution (a role in each location) that can maximize the average group payoff. The payoffs of the roles are shown in (b) settlement maintenance and (c) common pasture games. The settlement maintenance has four and common pasture has three role choices.
We indicate the locations (cells) that individuals are assigned by (i,j) coordinates. They are constrained to interact with their closest eight neighbours (i.e. their Moore neighbourhood). Furthermore, the environment is toroidal so that it allows the individuals located at the first and last columns/rows to interact with the individuals located at the last and first columns/rows.
The settlement maintenance game was inspired by research on ant colonies but is interpreted here as a model for small-scale human societies [36,37]. Payoffs associated with each role are shown in figure 2b. In this game, individuals can obtain payoffs by performing foraging and hunting roles for food. Hunting provides a higher payoff but requires coordination with a sufficient number of other hunters. In addition, there are cleaner and soldier roles that do not directly provide any payoffs but are still required for maintaining and protecting the settlement, respectively. In each time step, there is a constant rate of waste accumulation in each cell of the environment which negatively affects the rewards received by the individuals. However, if an individual performs the cleaner role in the neighbourhood then the adverse effect of waste accumulation is mitigated. Furthermore, there is an adversarial attack probability for each cell that causes the reward of an individual to be stolen (reduced to 0). However, when individuals take on the soldier role they then provide protection against adversarial attack within their neighbourhood.
The payoffs of the roles in the common pasture game are provided in figure 2c. This game models situations of resource appropriation where a bad outcome known as the tragedy of the commons may occur [38,39]. Here, the tragedy is when resources in the environment become depleted as a result of individuals pursuing their own selfish interests without regard to mounting social cost. To avoid the tragedy of the commons, a critical fraction of individuals must learn to act sustainably, restraining their use of resources. To simulate this behaviour, we model a pasture environment with a certain amount of starting resources in each location (e.g. grass for the herders’ herds). These resources can be appropriated by the herders either greedily (G) or considerately (C). The resources regenerate over time unless fully depleted. In addition, we introduce another role: worker (W), which can increase the regeneration speed of the resources (e.g. speed of grass growth). However, if the resource in a cell is fully depleted, it cannot be recovered within the present lifetime. Therefore, the individuals cannot receive any reward when the resources are depleted. The environment resets in terms of its resources and recovery ability between generations.
This model of social sanctioning is broadly compatible with other redistributive (zero-sum) sanctioning mechanisms in the reinforcement learning literature [40–42]. It differs from the purely destructive negative-sum social sanctioning schemes considered by Köster et al. [43] and Vinitsky et al. [44]. In the present work, the payoffs received in Stage I stand in for physical objects such as resources like the food received as a result of foraging behaviour. This kind of sanctioning is called zero-sum because it respects a conservation law: resources are neither created nor destroyed by sanctioning. Consequently, the function of sanctioning can be considered as a decentralized resource redistribution scheme. Losing resources is punishing while gaining resources is rewarding. One may ‘gift’ resources to another, reducing one's own reward to increase the other's (positively sanctioning them). Symmetrically, one may take resources away from another agent, gaining them oneself (negatively sanctioning them). Importantly, once reward has been reduced to zero it cannot be reduced any further since one cannot take away resources that do not exist. One implication is that individuals can only positively sanction others when they have enough resources to do so. The sanctioning process is shown in detail in algorithm 2 (see electronic supplementary material).
(i) Settlement maintenance
In the settlement maintenance game, individuals can select one of the roles in each time step t as: . We assume there is a level of waste accumulation in each cell location (i, j) with a constant rate τ, reaching a maximum level. The ideal living conditions of the individuals are affected by the waste accumulation. Thus, their payoffs are updated based on the waste accumulation as follows:
In addition, we assume that the environment is subject to random adversarial attacks that reduce the payoffs received by the affected individuals. The frequency of these attacks is controlled by the attack probability Pi,j, which is assumed to be the same for all cells. If there is an attack on a cell location, the payoff of the individual is reduced to 0. On the other hand, if there is a soldier in the neighbourhood, the attacks are defended and payoffs are protected.
(ii) Common pasture
In common pasture, individuals can select one of the roles as : . The amount of resources in each cell is represented as D(i,j). At the beginning of the game, a certain amount of resources is allocated in each cell. Herders, represented as considerate or greedy, consume the resources in their neighbourhood either considerately or greedily, as shown in figure 2c (payoffs they receive are the resources consumed in the environment). They can use available resources in their neighbouring cells selected randomly.
The resources replenish with a constant natural growth rate of c. In addition, the growth rate can be increased by w if a cell is occupied by a worker (otherwise w = 0). Consequently, the growth of the resource in each cell is modelled by the logistic growth model as
3. Results
(a) Social norms enable lifetime-learning of division of labour
In both games, two environmental parameters are used for testing the emergence of learning proper role distributions in five different environmental variations. In the settlement maintenance game (figure 3a), these parameters are based on waste accumulation and adversarial attack probability, and initialized as: (wa, p) = {(0, 0), (6, 0), (0, 1), (3.2, 0.5), (6, 1)}. In the common pasture game, these parameters are natural growth rate and worker rate, and are initialized as (w, c) = {(0, 0), (0.5, 0), (0, 0.5), (0.27, 0.27), (0.5, 0.5)}. For the two environmental parameters in both of games, we chose maximum value ranges and initialized five parameter assignment cases, four on the limits, and one in the middle of the parameter ranges. We ran the evolutionary processes independently and separately for each of these environmental instances multiple times (i.e. 30 runs each) with complexity regularization λ = 0.2 to find social norms that could provide optimum learning in each one. We observed that, owing to the complexity regularization, multiple runs of the evolutionary processes converged to similar social norms that can provide (near-)optimum group welfare. Without the complexity regularization, we observed the emergence of various social norms. The analysis of the effect of complexity regularization is provided in §3b, and examples of the social norms and the spatial role distributions they converged on are provided in the electronic supplementary material.
Figure 3. Social sanctions facilitate learning of role distributions in self-interested lifetime-learning individuals in various environmental instances in settlement maintenance and common pasture games. The results shown in each row correspond to distinct environmental parameter settings specified at the top of each column, wa (waste accumulation) and p (adversarial attack probability) in the settlement maintenance, and w (worker growth rate) and c (natural growth rate) in the common pasture game. In (a) and (b), different role distributions are learned locally (enforced by the decentralized social sanctions) to maximize group payoffs. (c) The change in the average group payoff and other environmental properties during the learning processes with social sanctions, and four other compared approaches.
In figure 3a, examples of learned spatial role distributions in five environmental variants in both games are shown. The individuals converged on different role distributions depending on the environmental conditions. Interestingly, the role distributions that emerged show clear signs of neighbourhood patterns where the roles benefit their neighbours, distributed in various neighbourhoods to minimize their overlap.
In the settlement maintenance game, since there is no need for cleaner and soldier roles when there are no waste accumulation and adversarial attacks (wa = 0, p = 0), all individuals converged to hunter role to achieve the maximum payoff. When waste accumulation or adversarial attack probabilities is increased, some individuals converge to cleaner or soldier roles, respectively, to mitigate the adverse effects of these environments. When both waste accumulation and adversarial attack probability are increased, we see the emergence of both cleaner and soldier roles.
In the common pasture game, when the natural growth and worker rates are low, there is no possibility of sustaining the resources in the environment. Therefore, we see random role selection in this case (when w = 0 and c = 0). When one or both of these factors increase, various role distributions can be learned to collect as much resources as possible while sustaining the environment. For instance, when the natural growth rate c is high, the number of (considerate) herders increases. Interestingly, an increase in the number of worker roles can also be observed even though they do not have any function when w = 0. This is due to the fact that workers can function as empty placeholder cells, which prevents additional farmers in the environment, and therefore avoids excess use of resources. When the natural growth rate is maximum, we can observe the appearance of some greedy herders.
In both games, self-interested lifetime-learning individuals without social sanctions, referred to as ‘selfish’ from now on, do not obtain good performance. In the settlement game, all the individuals converge to select the forager role since it is easier to learn because it does not depend on coordination with others. The hunter role, which can provide higher rewards, depends on coordination within the neighbourhood, so it is more difficult to discover by chance since at least five individuals would have to select it simultaneously. In the common pasture game, selfish individuals learn the greedy herder role because it has the highest immediate payoff. However, when they do, they deplete all the resources of the environment quite quickly. Both games have critical roles that do not provide any payoff on their own: cleaner and soldier in settlement maintenance and worker in common pasture. Selfish individuals will not select these roles since they do not provide any payoff. Thus, groups of selfish individuals cannot establish a socially advantageous division of labour.
The learning performance of self-interested lifetime-learning individuals with social sanctions, referred to as ‘social sanctions’ from now on, and selfish individuals were compared with three additional approaches. These were: global, altruists and selfish/altruists. Briefly, in the global approach, the role distributions of the groups are directly optimized by evolutionary algorithms where the role of each individual in each location is defined when they are initialized, and remains fixed throughout the individual's lifetime. In selfish/altruists, the individuals aim to maximize the average payoffs of both self and neighbouring individuals. In the case of altruists, they aim to maximize the average payoffs of neighbouring individuals, not including themselves. The details of these approaches are provided in the electronic supplementary material.
Figure 3c compares all five approaches with one another. In both games, the global approach provides the upper bound. This is as expected since it can take into account global knowledge of the problem, which allows making improvements on some roles independently of others in different locations by keeping them constant. In addition, this approach does not involve the costs of lifetime-learning that arise from trial and error. Results from the social sanctions approach were the closest to this upper bound. The selfish individuals learn to perform the role that maximizes their payoff. However, without the other regulatory roles in the groups, the environment degrades quickly, leading to the worst performance. Altruists and selfish/altruists improve over selfish individuals but do not perform as well as social sanctions.
(b) Complexity regularization promotes convergent evolution of simpler social norms
Figure 4 shows the relationship between the average group payoffs and the complexity of the norms for different complexity regularization (λ). To facilitate a better comparison, average group payoff shows only R (see equation (2.4)); therefore, it does not include the complexity regularization (). The complexity of the norms is assessed by the size of the social norms: the social norms that are composed of larger number of rules are more complex.
Figure 4. Complexity regularization promotes convergent evolution of simpler social norms. We observe that larger values of λ lead to the emergence of social norms that are less complex. This is indicated by the results of the average group payoff versus their complexity in five different environments in settlement maintenance and common pasture problems depending on different λ values. Each point shows the average of the result of 30 independent processes. Error bars show the standard error from the mean.
We observe that a larger λ value leads to the emergence of simpler rules in all environments. Furthermore, there is no statistical difference (based on the Wilcoxon rank-sum test [45] on α = 0.05 significance level) between the average payoff achieved with different λ values in each environment except in Environments 3 and 4 in common pasture. For the exception cases, cultural evolution processes with complexity regularization (when λ > 0) were able to produce norms that can successfully promote the division of labour only in a limited number of independent processes (i.e. 4 and 7 out of 30 independent cultural evolutionary processes in Environments 3 and 4, respectively). Therefore, on average, the results are lower in those cases. If only cultural evolution processes that led to a successful division of labour are considered, then the results are statistically similar to the case without complexity regularization (λ = 0).
Overall, complexity regularization is able to promote the emergence of norms that can achieve similar average group payoffs. Moreover, the complexity regularization promoted convergent evolution by means of producing similar norms at the end of independent cultural evolution processes (see the electronic supplementary material for examples of norms that emerged as a result of the independent cultural evolution processes for λ values 0 and 0.2). These results are further elaborated in the next two sections.
(i) Settlement maintenance
In Environment 1, independent of λ value, social norms allow groups to achieve maximum average payoff of 6 (achieved when all individuals learn to choose hunter role). When there is no importance given to the complexity regularization (λ = 0), the average number of rules of evolved social norms is more than 8. Only a few of these evolved social norms converged to a similar rule set. The rest each converged to a unique and complex set of rules. On the other hand, when the importance of complexity is introduced, all the norms converge to two possible types that consist of a single rule: either, ‘Hunter ought to discourage forager by amount X’ or ‘Forager ought to encourage hunter by amount Y’. Here, X and Y can take values from 0.03 to 2.8, though discouragement is associated with a negative sign. This is quite intuitive, and in this case, encouragement/discouragement between these two roles allows individuals to learn to select hunter to receive a higher payoff by cooperating with others.
Environments 2 to 5 are more complex in that they require additional roles in the groups to achieve a higher average payoff. In the case of Environment 2, cleaner, in Environment 3, soldier, and in 4 and 5, both cleaner and soldier roles are needed. In these environments, increasing λ leads to the emergence of similar rules in multiple independent cultural evolutionary processes. In these environments, we note the emergence of norms that are concerned with these roles in addition to the baseline norm that emerged in Environment 1. An example of this is a norm that emerged for Environment 2, when λ = 0.2: ‘(i) hunter ought to discourage forager by −1.45, and (ii) hunter ought to encourage cleaner by 0.60’. The first part of this norm is the same as the norms that emerged for Environment 1; in addition, the second part emerged to encourage uptake of the cleaner role in the group. Another example is for Environment 4: ‘(i) hunter ought to discourage forager by −2.28, (ii) hunter ought to encourage cleaner by 0.57, and (iii) hunter ought to encourage soldier by 0.42’. Here, we observe appearance of two rules regarding the cleaner and soldier roles in addition to the norm that emerged in Environment 1. Similar patterns can be observed in the results for other environments provided in the electronic supplementary material in detail.
Without complexity regularization, while the difficulty of the environments increases, the complexity of the emergent norms also increases. For example, complexity remains around 8 in Environment 1. This shows that through the cultural evolution processes the size of the norms remains stable (considering expected norm size during the initialization is 8). However, in other environments, the complexity of the norms increases throughout the processes. This indicates that larger number of rules may be required to perform better in those environments. Nevertheless, when the complexity regularization is introduced, especially when λ = 0.2, the average sizes of norms are reduced around 2.5 in Environments 2 and 3, and around 3 and 3.5 in Environments 4 and 5, respectively.
In Environment 2, the simplest norms consist of two rules. Interestingly, these norms are composed of the same rules that emerged for Environment 1 to promote the cooperation between hunter roles, and an additional rule that promotes learning cleaner roles. A similar outcome is observed for Environment 3, where soldier roles are needed as well as the cooperation between hunter roles to achieve the optimum average payoff. Finally, in Environments 4 and 5, the lengths of the rules that emerged range from 3 to 5. Similarly, here, we observe appearance of the conjunction of the norms that emerged in Environments 1, 2 and 3, since both cleaner and soldier roles are needed in these environments.
(ii) Common pasture
It is not possible to achieve any payoff in Environment 1 because of the parameters of the environment since they do not allow natural and worker growth rate for recovering the environmental resources. Therefore, the norms that are found at the end of the cultural evolutionary processes are random. However, when the complexity regularization is introduced, the length of the rules converges to 0. That is because, if no norm is helpful, having no norm is the best.
In Environment 2, we observe convergent evolution of the same type of social norm that is composed of two rules where greedy (herder) encourages considerate (herder) and considerate (herder) encourages considerate (herder). This norm facilitates the emergence of role distributions that consist mainly of considerate (herders) with some greedy (herders) that are spread around. This pattern leads to the exploitation of the environmental resources by greedy (herders) without disrupting the sustainability.
In Environment 3, only four independent cultural evolution processes produced social norms that led to sustainable use of environmental resources. These social norms are composed of partially different rules although they have similar length and produce similar division of labour patterns except in one case (provided in the electronic supplementary material). Similarly, in Environment 4, only seven of the processes produced social norms that promote sustainable use of resources. These rules are composed of partially the same rules. In Environment 5, in many cases, the norms are composed of two rules and converged to the same rules.
4. Discussion
Human ecological success has been underpinned by our species’ ability to overcome the challenges of group collaboration. In these situations, the individuals who compose a group often need to divide labour and perform various specialized roles collaboratively. However, in groups consisting of self-interested individuals that aim to maximize their own payoff, establishing such collaboration may not be straightforward. In this work, we focused on the problem of learning such a division of labour in groups of self-interested lifetime-learning individuals. We modelled social norms as decentralized social sanctioning patterns and studied how they can encourage individuals to cooperate. We also proposed a regularized cultural group evolution model to study how such norms that enable the learning of prosocial roles can be established. We demonstrated this learning problem, and how our model of social norm evolution resolves it, for two different spatial games. In both cases the emergent social norms discovered by our cultural group evolution model were successful in incentivizing individuals to learn to collaborate with one another.
There is an extensive literature on the evolution of cooperative strategies in populations of self-interested individuals [11,46–48]. Some authors, like us, studied the emergence of cooperation in spatial game settings [49–51]. However, this line of work usually does not model lifetime-learning. We modelled individual-level lifetime-learning in conjunction with group-level norm evolution. In our model, the two levels were linked because group-level norms guide the lifetime-learning of individuals. At the same time, whether a newly innovated norm undergoing testing makes the cut and becomes established depends on its guiding the learning of all individuals to a more socially advantageous outcome. Wang et al. [52] took an analogous two-level strategy, integrating individual reinforcement learning with an evolutionary process, which in their case determined the individuals’ reward functions. They showed that group selection was needed to get reward functions that incentivize altruistic behaviour to evolve, a result that supports a key assumption of our model: that social norms evolve on the level of groups.
Most evolutionary game theory-based models share a common assumption that individuals make one choice and stick with it throughout their lives [53]. Others model the evolution and spread of successful strategies [54,55] through social learning over a population of individuals. By contrast, the cultural evolutionary process in our model operates on the social norm level. Additionally, individuals are endowed with the capability of changing their behaviour based on their own experience. Therefore, we can assume that individuals can re-select what role to play over and over again throughout their lives. As a result, our model is more about deciding what one wants to do today than it is about deciding what to do with one's life. It should correspond best to situations where roles may be revisited each day. There are likely many such roles in a foraging society. For instance, one could hunt one day and gather the next day. By contrast, the evolutionary game theory approaches where role definition is specified genetically (since this can make it irrevocable) may be better suited to modelling situations faced by modern people who must decide which of several highly specialized careers to pursue, e.g. doctor or lawyer. In these situations individuals find it difficult to switch roles after investing considerable time into their training. This is why results from studies of games ostensibly similar to those we study, where a surplus may be obtained if all roles are filled but some roles pay better than others, have been interpreted in the evolutionary game theory context as speaking to the emergence of social stratification and inequality between social classes or castes [56]. However, even though we allow individuals in our model to change roles at every time step, we still find that reinforcement learning converges to a stable choice of roles. So our model could still be used to study social stratification and inequality. On the other hand, it is worth pointing out that, unlike models such as [56], we do not require individuals to engage in social learning, nor do we need to posit accumulating differences in social capital between groups as assumed by other models (e.g. [57]). In our model, agents only learn from their own direct experience in different roles and from normative sanctioning, but this is still sufficient for stratification to emerge at convergence.
The game-theoretic literature on social norms has pursued two broad categories of mechanisms by which norms may affect cooperation. They are (A) transforming the payoffs (via sanctioning or equivalent) into a new effective game with different Nash equilibria, e.g. making mutual cooperation an equilibrium in a transformed game derived from the Prisoners’ Dilemma [58,59] and (B) equilibrium selection. When social norms are regarded as equilibrium selection devices, the assumption is that many Nash equilibria are possible and would be stable if achieved, and that some of them are quite bad (e.g. tragedy of the commons situations). Society must navigate its way to the good equilibria while avoiding the bad ones. In this line of work, the norm is regarded as a piece of public knowledge that individuals may condition their behaviour on in order to rationally coordinate with one another [60–63]. For instance, a focal point is a ‘default choice’ from which it would be irrational to deviate when one knows others are trying to coordinate [64]. Many models incorporate aspects of both mechanisms (e.g. [21,65]). Our approach resonates more strongly with approach A. The two games we studied, settlement maintenance and common pasture, are social dilemmas. That is, they contain socially deficient Nash equilibria, and purely egoistic, utility-maximizing agents do not learn to cooperate; they never take on the unpaid-but-critical roles. In our model, social norms are patterns of social sanctioning that affect the rewards individuals experience. We can see them as transforming the game into a different one where it is profitable to select the unpaid-but-critical roles because other individuals positively sanction that choice, effectively paying the agents who choose them for the service they provide to the group. It is thus critical in our model that the norms themselves do not evolve within individuals, and individuals cannot unilaterally decide whether to sanction or not. Norms must evolve on the level of the cultural group, using a ‘fitness’ that incorporates the well-being of all the group’s individuals; otherwise they could not incentivize individuals to choose dominated strategies, as they must to achieve a stable and socially advantageous division of labour [32].
Ethics
This work did not require ethical approval from a human subject or animal welfare committee.
Data accessibility
Data are available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.dv41ns24r [66].
Supplementary material is available online [67].
Declaration of AI use
We have not used AI-assisted technologies in creating this article.
Authors' contributions
A.Y.: conceptualization, formal analysis, investigation, methodology, software, validation, visualization, writing—original draft, writing—review and editing; J.Z.L.: conceptualization, investigation, methodology, validation, writing—original draft, writing—review and editing; G.I.: methodology, validation, writing—review and editing; S.W.L.: conceptualization, formal analysis, funding acquisition, investigation, methodology, validation, writing—original draft, writing—review and editing.
All authors gave final approval for publication and agreed to be held accountable for the work performed herein.
Conflict of interest declaration
We declare we have no competing interests.
Funding
S.W.L. was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grants funded by the Korean government (MSIT) (no. RS-2023-00233251, System3 Reinforcement learning with high-level brain functions, and no. 2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)), and by a National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (NRF-2019M3E5D2A01066267).