Evolution of reciprocity with limited payoff memory

Direct reciprocity is a mechanism for the evolution of cooperation in repeated social interactions. According to the literature, individuals naturally learn to adopt conditionally cooperative strategies if they have multiple encounters with their partner. Corresponding models have greatly facilitated our understanding of cooperation, yet they often make strong assumptions on how individuals remember and process payoff information. For example, when strategies are updated through social learning, it is commonly assumed that individuals compare their average payoffs. This would require them to compute (or remember) their payoffs against everyone else in the population. To understand how more realistic constraints influence direct reciprocity, we consider the evolution of conditional behaviours when individuals learn based on more recent experiences. Even in the most extreme case that they only take into account their very last interaction, we find that cooperation can still evolve. However, such individuals adopt less generous strategies, and they cooperate less often than in the classical setup with average payoffs. Interestingly, once individuals remember the payoffs of two or three recent interactions, cooperation rates quickly approach the classical limit. These findings contribute to a literature that explores which kind of cognitive capabilities are required for reciprocal cooperation. While our results suggest that some rudimentary form of payoff memory is necessary, it suffices to remember a few interactions.


Introduction
Evolutionary game theory describes the dynamics of populations when an individual's fitness depends on the traits or strategies of other population members (1)(2)(3)(4).This theory can be used to describe the dynamics of animal conflict (5), cancer cells (6), and of cooperation (7).Respective models translate strategic interactions into games (8).These games specify how individuals (players) interact, which strategies individuals can choose, and what fitness consequences (or payoffs) the different strategies have.In addition, these models also specify the mode by which successful strategies spread over time.In models of biological evolution, individuals with a high fitness produce more offspring; in models of cultural evolution, such individuals are imitated more often.Although biological and cultural evolution are sometimes treated as equivalent, there can be important differences (9)(10)(11).For example, models of biological evolution do not require individuals to have any particular cognitive abilities.Here, it is the evolutionary process itself that biases the population towards strategies with higher fitness.In contrast, in models of cultural evolution, individuals need to be aware of the different strategies present in the population, and they need to identify those strategies with a higher payoff.As a consequence, evolutionary outcomes may depend on how easily different behaviors can be learned (12), and on how easy payoff comparisons are.
These difficulties to learn strategies by social imitation are particularly pronounced in models of direct reciprocity.This literature follows Trivers' insight that individuals have more of an incentive to cooperate in social dilemmas when they interact repeatedly (13).In repeated interactions, individuals can condition their behavior on their past experiences with their interaction partner.They may use strategies such as Titfor-Tat (14,15) or Generous Tit-for-Tat (16,17) to preferentially cooperate with other cooperators.Such conditional strategies approximate human behavior fairly well (18)(19)(20)(21)(22) and they have also been documented in several other species (23)(24)(25) -although direct reciprocity is generally more difficult to demonstrate in animals (26)(27)(28).However, at the outset, it is not clear how easy it is to learn reciprocal strategies by social imitation.As one obstacle, even if others' strategies are perfectly observable, individuals might find it difficult to identify which ones have the highest payoff.After all, the payoff of a strategy of direct reciprocity is not determined by the outcome of any single round.Rather, it is determined by how well this strategy fares over an entire sequence of rounds, against many different population members.In practice, such information might be both difficult to obtain and to process.
Most models of direct reciprocity abstract from these difficulties (29)(30)(31)(32)(33)(34)(35)(36)(37)(38)(39)(40)(41)(42)(43)(44)(45)(46).They just assume individuals can easily copy the strategies of others.Similarly, they just assume that updating decisions are based on the strategies' average (or expected) payoffs, which are based on all rounds and all interactions.These assumptions create a curious inconsistency in how models represent an individual's cognitive abilities.On the one hand, when playing the game, individuals are often assumed to have restricted memory.Respective studies typically assume that individuals make their decisions each round based on the outcome of the last round only (with only a few exceptions, see Refs.[47][48][49][50][51]. Yet when learning new strategies, individuals are assumed to remember (or compute) each others' precise average payoff across many rounds and many interaction partners.Herein, we wish to explore whether this latter assumption is actually necessary for the evolution of reciprocity through social imitation.We ask whether individuals can learn to adopt reciprocal strategies even when learning is based on payoff information from a limited number of rounds.
To explore that question, we theoretically study imitation dynamics in the repeated prisoner's dilemma, using two extreme scenarios.The first scenario is the usual modeling approach.Here, individuals update their strategies based on their expected payoffs.We contrast this model with an alternative scenario where individuals update their strategies based on the very last (one-shot) payoff they obtained.We find that individuals with limited payoff memory tend to adopt less generous strategies.Yet moderate levels of cooperation can still evolve.Moreover, as we increase the individuals' payoff memory to include the last two or three one-shot payoffs, cooperation rates quickly approach the rates observed in the classical baseline case.
Overall, these findings suggest that while memory is important, already minimal payoff information may suffice for the evolution of direct reciprocity based on social learning.They also suggest that the classical model of reciprocity (based on expected payoffs) can often be interpreted as a useful approximation to more realistic models that include cognitive constraints.

Model and Methods
To explore the impact of limited payoff memory, we adapt existing models of the evolution of direct reciprocity.These models involve two different time scales.The short time scale describes the game dynamics.
Here, individuals with fixed strategies are randomly matched to interact with each other in repeated social dilemmas.The long time scale describes the evolutionary dynamics.Here, individuals can update their repeated-game strategies based on the payoffs they yield.In the following, we introduce the basic setup of our model; all details and derivations are described in the electronic supplementary material.
Description of the game dynamics.We consider a well-mixed population consisting of N players.Players are randomly matched in pairs to participate in a repeated donation game (52) with their respective co-player.
Each round, they can either cooperate (C) or defect (D).A cooperating player provides a benefit b to the other player at their own cost c, with 0 < c < b.A defecting player provides no benefit and pays no cost.Thus, the players' payoffs in a single round are given by the matrix In particular, payoffs take the form of a prisoner's dilemma: Mutual cooperation yields a better payoff than mutual defection (b − c > 0), but each player individually prefers to defect independent of the co-player's action (b > b−c and 0 > −c).To incorporate that individuals interact repeatedly, we assume that after each round, there is a constant continuation probability δ of interacting for another round.For δ = 0, we recover the case of a conventional (one-shot) prisoner's dilemma.Here, mutual defection is the only equilibrium.
As δ increases, the game turns into a repeated game.Here, additional equilibria emerge, with some of them allowing for full cooperation (53)(54)(55)(56).
In a one-shot donation game, players can only choose among two pure strategies (they can either cooperate or defect).In the repeated game, strategies become arbitrarily complex.Here, strategies are contingent rules, telling players what to do depending on the outcome of all previous rounds.For simplicity, in the following we assume individuals use reactive strategies (17).A reactive strategy only depends on the other player's action in the last round.Such strategies can be written as a three-dimensional tuple s = (y, p, q).
The first entry y is the probability that the player opens with cooperation in the first round.The two other entries are the probabilities that the player cooperates in all subsequent rounds, depending on whether the co-player cooperated (p) or defected (q) in the previous round.The set of reactive strategies is simple enough to facilitate an explicit mathematical analysis ( 1).Yet it is rich enough to capture several important strategies of repeated games.For example, it contains ALLD = (0, 0, 0), the strategy that always defects.Similarly, it contains Tit-for-Tat, TFT = (1, 1, 0), the strategy that copies the co-player's previous action (and that cooperates in the first round).Finally, it contains Generous Tit-for-Tat, GTFT= (1, 1, q), where q > 0 reflects a player's generosity in response to a co-player's defection (16,17).
In the short run, the players' strategies are taken to be fixed.Players use their strategies to decide whether to cooperate in a series of repeated games against all other population members.In the long run, however, the players' strategies may change depending on the payoffs they yield, as we describe in the following.
Description of the evolutionary dynamics.Herein, we assume population members update their strategies based on social learning.To model these strategy updates, we consider a pairwise comparison process (57).
This process assumes that at regular time intervals, one population member is randomly selected, and given the chance to revise its strategy.We refer to this player as the 'learner'.With probability µ (reflecting a mutation rate), the learner simply adopts a random strategy (all reactive strategies have the same probability to be chosen).With the converse probability 1 − µ, the learner randomly picks a 'role model' from the population.The learner then compares its own payoff π L from the repeated game to the role model's payoff π RM .The learner adopts the role model's strategy with a probability φ described by a Fermi function (58,59), ( The selection strength parameter β ≥ 0 indicates how sensitive players are to payoff differences.For β = 0, payoff differences are irrelevant, and the learner simply adopts the role model's strategy with probability one half.As the selection strength β increases, players are increasingly biased to imitate the role model only if it has the higher payoff. We deviate from previous models in how we interpret the payoffs π L and π RM , which form the basis of the pairwise comparisons in Eq. ( 2).In previous work, these payoffs are taken to be the respective players' expected payoffs.We interpret that setup as a model with perfect payoff memory.There, the payoffs π L and π RM represent an average over all possible repeated games the two individuals have played with all population members (Figure 1, upper left panel).The use of expected payoffs is mathematically convenient, because explicit formulas for these payoffs are available (1).Herein, we compare this model of perfect payoff memory to a model with limited payoff memory.In that model, the players' payoffs π L and π RM are taken to be the payoffs that each player received in their very last round prior to making social comparisons.That is, players only consider the very last repeated game they participated in, and there they only take into account the outcome of the very last round (Figure 1, lower left panel).This assumption could reflect, for example, a strong recency bias in how individuals evaluate payoffs.In addition to this extreme case of limited payoff memory, later on we also explore cases in which players take into account the outcome of two, three, four, or more recent rounds.
Both in the case of perfect and limited memory, we iterate the elementary strategy updating step described above for many time steps.This gives rise to a stochastic process that describes which strategies players adopt over time.We explore the dynamics of this process mathematically and with computer simulations.
For the results presented in the following, we assume that mutations are rare (µ → 0).This assumption is fairly common in evolutionary game theory, because it makes some computations more efficient (60)(61)(62), and because the results can be interpreted more easily.However, in Section 3 of the electronic supplementary material we show that our main results continue to hold for strictly positive mutation rates.

Results
Stability of cooperative populations.To get some intuition for the differences between perfect and limited payoff memory, we first analyze when cooperation is stable in either scenario.To this end, we consider a resident population in which all players but one adopt a strategy of Generous Tit-for-Tat, GTFT = (1, 1, q).
The remaining mutant player adopts ALLD.We say cooperation is stochastically stable if the single mutant is more likely to imitate the residents than vice versa.For simplicity, we consider a large population (N → ∞) and strong selection (β → ∞).More general results are derived in the electronic supplementary material.
In the case of perfect payoff memory, it is straightforward to characterize when cooperation is stochastically stable.Here, we simply need to compute the players' expected payoffs.Because the population mostly consists of residents, and because residents mutually cooperate with each other, their expected payoff is π GTFT = b−c.On the other hand, the defecting mutant only interacts with residents.Given the residents' strategy, the mutant receives a benefit in the first round, and in every subsequent round with probability q.As a result, the mutant's expected payoff is π ALLD = (1−δ +δq)b.For perfect payoff memory, the requirement for cooperation to be stochastically stable reduces to the condition π GTFT > π ALLD .This yields In particular, we recover the previous observation that q = 1 − c/(δb) is the maximum generosity that cooperators should have (16,17,63).Because q ≥ 0, we also conclude that cooperation can only be stable if δ > c/b.
Again, this condition for the feasibility of direct reciprocity is the condition found in the literature (7).
The logic of the case with limited payoff memory is somewhat different.Here we need to compute how likely each player obtains one of the four possible payoffs {b−c, −c, b, 0} in the very last round of a game, before they make social comparisons.Because residents almost always interact with other residents, their last one-shot payoff is π GTFT = b − c almost surely.For the defecting mutant, there are two possibilities.
(i) If the mutant's co-player happens to cooperate in the last round, the mutant receives π ALLD = b.This case occurs with probability 1−δ+δq.(ii) If the co-player defects in the last round, the mutant receives π ALLD = 0.
This occurs with the converse probability δ(1−q).Because b−c < b, residents tend to imitate the mutant in the first case.Because b−c > 0, mutants tend to imitate the resident in the second case.Cooperation is stochastically stable if the first case is less likely than the second.This yields the condition Interestingly, this condition no longer depends on the exact payoff values c and b.This independence arises because of our assumption of strong selection, in which case only the payoff ordering b > c > 0 matters.
Because q is non-negative, condition (4) can only be satisfied if δ > 1/2.That is, players need to interact in more than two rounds in expectation.
By comparing the two cases, we find that payoff memory affects whether a conditionally cooperative strategy (1, 1, q) is viable.With perfect memory, the maximum generosity q needs to satisfy Eq. (3).In particular, this generosity can become arbitrarily large, provided the game's benefit-to-cost ratio b/c and the continuation probability δ are sufficiently large.In contrast, with limited payoff memory, the maximum generosity is bounded by one half, and it is independent of the benefit-to-cost ratio.
Evolutionary dynamics of reciprocity.To explore whether the previous static observations describe the dynamics of evolving populations, we turn to simulations.We have run separate simulations for perfect and limited payoff memory.In each case, we consider both a low and a high benefit of cooperation (b/c = 3 and b/c = 10, respectively).For each simulation, we record which strategies (y, p, q) the players adopt over time.
Figure 1 depicts the conditional cooperation probabilities p and q (we omit the opening move y because we use a discount factor δ close to one, such that first-round behavior is largely irrelevant).In all simulations, we find that the players' strategies cluster in two regions of the strategy space.The first region corresponds to a neighborhood of ALLD with (p, q) ≈ (0, 0).The second region corresponds to a thin strip of cooperative strategies with (p, q) ≈ (1, q).Within this strip, we observe that most strategies satisfy the constraints on q imposed by the inequalities (3) and (4).That is, with perfect memory, most evolving strategies have q < 1 − c/b, whereas with limited payoff memory, most strategies have q < 1/2.In particular, for limited payoff memory, changes in the benefit parameter have no effect on the qualitative distribution of strategies.
In each case, the evolutionary dynamics follow a similar cyclic pattern (as described in Refs.17, 33): Resident populations of defectors are most likely invaded by strategies close to TFT.Once the population adopts conditionally cooperative strategies (1, 1, q), neutral drift may introduce larger values of generosity q.
If the resident's generosity q violates the conditions (3) and ( 4), defectors can re-invade, and the cycle starts again.The relative time spent near ALLD and near the strip of conditionally cooperative strategies depends on the considered memory setting (Supplementary Figure 3, depicting the case of high benefits).For perfect memory, we find that ALLD is replaced relatively quickly by more cooperative strategies.Here, it takes on average 159 invading mutants until ALLD is successfully replaced.In contrast, for limited memory, ALLD is more robust, resisting on average 798 mutant strategies.This picture reverses when we consider an initial population that adopts GTFT.Such populations are much more robust under perfect memory than they are under limited memory.Overall, we find that the impact of memory on the population's average cooperation rate is substantial.For perfect memory, this rate is 52% for low benefits, and 98% for high benefits.For limited payoff memory, the evolving cooperation rates are smaller but still strictly positive, with 37% cooperation for low benefits and 51% cooperation for high benefits (Figure 1).
To further investigate the influence of different parameters, we have systematically varied the benefit b and the selection strength β in Figure 2. According to Figure 2a, perfect memory consistently results in a higher cooperation rate, and this relative advantage further increases with an increasing benefit b.Interestingly, for limited payoff memory, the cooperation rate remains stable at approximately 50% once b ≥ 5.This again reflects our earlier observation that the feasibility of cooperation in this scenario is largely independent of the exact values of b and c, as described by Eq. ( 4).With respect to the effect of different selection strengths, Figure 2b suggests that both perfect and limited payoff memory yields similar cooperation rates for weak selection β < 1.Beyond weak selection, increasing selection has a positive effect under perfect payoff memory, but a negative effect under limited payoff memory.
Beyond reactive strategies.While the results presented in the main text focus on reactive strategies, the patterns we observe do not seem to depend on the considered strategy space.To illustrate this point in more detail, in the electronic supplementary material we consider the dynamics among memory-1 strategies.
Here, players take into account both their co-player's and their own last move, see Refs.(64,65).Also in that case, we observe that perfect memory leads to systematically higher cooperation rates (Supplementary Figures 5,6).Again, this advantage of perfect memory is particularly pronounced for strong selection, or when there is a high benefit of cooperation (Supplementary Figure 7).
The effect of increasing individual payoff memory.So far, we have taken a rather extreme interpretation of limited payoff memory.In the respective scenario, we assumed that individuals update their strategies based on their experience in a single round of the prisoner's dilemma, against a single co-player.The limited payoff memory framework can be expanded in various ways.In particular, individuals may recall a larger number of rounds, they may recall their interactions with several co-players, or both.To gain further insights on the impact of payoff memory, we explore four additional scenarios.In the first scenario, players recall the payoffs they obtained in the last two rounds against a single co-player.In the second scenario, players recall their last-round payoffs against two co-players.In the third scenario, they recall the two last rounds against two co-players.Finally, in the last scenario, players update based on the average payoff they receive over all rounds with a single co-player (further extensions are possible, but we do not explore them here).
For most scenarios, we can again derive an analytical condition for when cooperation is stochastically stable.As before, we assume populations are large and that selection is strong.For simplicity, we also assume that the game continues almost certainly after each round (i.e., δ approaches one).The details of this analysis can be found in the electronic supplementary material.In the first two scenarios, we interestingly find that for b > 2c, cooperation is stochastically stable when q < √ 2 2 ≈ 0.707.Comparing this condition with the more stringent condition in Eq. ( 4) suggests that there are now more conditionally cooperative strategies that can sustain cooperation.Hence, cooperation should evolve more easily.In the last scenario, we find that cooperation is stochastically stable when q < 1 − c b , which is the same condition as in Eq. ( 3), even though only a single co-player is considered instead of the whole population.
We complement these analytical results with additional simulations, see Figure 3.We observe that a minimal increase in the players' payoff memory (compared to the baseline case with a single round recalled) can promote cooperation considerably.Specifically, in all four scenarios with extended memory, we see similar cooperation rates, and they approach the rates observed under perfect memory.These results suggest that while it takes some payoff memory to sustain substantial cooperation rates, the requirements on memory seem to be rather modest.Already remembering a few interactions, either with the same co-player or across different co-players, may provide players with enough information to adopt reciprocal strategies.

Discussion
In economics, if payoff is measured in terms of money and a decision is to be made at time t in the future, then currency accumulated early on is weighted more in that decision because it has more time to accumulate interest.Such a model discounts the future relative to the past.It also does not necessarily require "memory" because rewards are accumulated into a factor used in decision-making; the specific time stamps of rewards do not themselves provide better information beyond their effects on total payoff.In a similar fashion, the probabilistic interpretation of discounting as a continuation probability (15) also effectively discounts the future relative to the past.A foraging animal deciding between two behaviors might tend to choose the one that yields a moderate reward sooner relative to a larger reward later, since earlier rewards (e.g., food) contribute to immediate survival, and there is no guarantee that later rewards will happen at all (66).
Within the context of a single repeated game, the model we consider here is, in some ways, dual to the classical model of temporally-discounted rewards in repeated games.Instead of making decisions based on expected rewards in the future, we consider individuals who make decisions based on actual rewards in the past.Thus, the ability to estimate the future payoff of a strategy is replaced by the memory of how this strategy previously fared against others.This involves two time scales: interaction partners and rounds within those interactions.As a result, we are dealing with a model that discounts the past rather than the future.
Intriguingly, treating payoffs in this manner is reminiscent of the reward-smoothing technique of "eligibility traces" in reinforcement learning (67), which uses past rewards (discounted appropriately) to shape present perceived payoff.There is a sound basis for this method in neuroscience, where rewards and (temporaldifference) learning are associated to dopaminergic neurons (68) and spike-timing-dependent plasticity (69).
This suggests a more biologically-encoded interpretation of memory, which is equally applicable to models of direct reciprocity where rewards have a neurological basis.
Of course, the precise nature of "memory" also depends on what payoffs in a game represent, which should be taken into account when applying game-theoretic models.For example, a payoff stream of monetary currency might truly accumulate and not require memory on the parts of agents.Even in the context of money, however, not all of what was obtained in the past is necessarily available at the time a decision is made, which brings memory into play.The serial position effect in human psychology shows that in an ordered list of items (e.g., words), humans tend to have difficulty remembering the entirety of sequences, demonstrating moderate recall for those items coming earlier (primacy effect), substantial recall for those coming later (recency effect), and lower recall for those in between (70).It is therefore reasonable that when presented with a stream of payoffs, whether on the timescale of pairing for repeated interactions or in a stream of one-shot games, players might be able to effectively incorporate only the most recent payoffs.
In fact, even beyond specific psychological considerations, a curious interpretation of payoffs arises from the formula commonly used for expected payoffs in repeated games.If δ ∈ [0, 1) is the probability of continuing to another round in the game, then the expected payoff to an agent is (1 − δ) ∞ t=0 δ t u t , where u t is the reward the agent receives in the stage game at time t.Here, additional stochasticity arises due to uncertainty in the game length, and an agent might not be able to compute his or her expected payoff for use in decision-making.The probability that the game terminates after the interaction at time T is δ T (1 − δ), in which case (1 − δ) ∞ t=0 δ t u t is exactly the expected payoff the agent receives at time T , i.e. in the last round.As an unbiased estimator of this expectation, the agent might thus use u T as a proxy for "success" when evaluating his or her behavior.This gives a purely model-driven justification for why considering payoff in the last round of the game can result in more realistic extensions of traditional models.
We note that the expected total payoff in the game, u 0 +u 1 +• • •+u T , is given by ∞ t=0 δ t u t .This version of "expected payoff" appears less common in the literature on direct reciprocity than its normalization, t=0 δ t u t , likely owing to the fact that payoffs can grow arbitrarily large with sufficiently long time horizons (δ → 1 − ).Non-normalized payoffs interfere with selection intensity (β) in models of social imitation, which is (presumably) why they appear less frequently in the literature.On that point, we note that many of the differences between realized and expected payoffs disappear in the limit of weak selection (71).
Non-weak selection can introduce substantial differences between models with realized and expected payoffs (72), which is especially important to understand in models of social systems with cultural transmission (73).
Our main contribution is an application of these ideas to direct reciprocity, which is one of the key mechanisms to explain why unrelated individuals might cooperate (7).According to this mechanism, cooperation pays if it makes the interaction partner more cooperative in future.To describe which strategies are most effective, the previous theoretical literature assumes that the evolutionary dynamics are driven by the players' expected payoffs (29)(30)(31)(32)(33)(34)(35)(36)(37)(38)(39)(40)(41)(42)(43)(44).To the extent that strategies are learned (not inherited), this assumption seems to impose rather stringent requirements on the individuals' cognitive abilities.In the most extreme case, this would require individuals to remember (or compute) their payoffs against all population members, for all possible ways in which their repeated games may unfold.This assumption introduces a curious inconsistency in how these models represent an individual's cognitive abilities.For playing their games, individuals are often assumed to only recall the outcome of the very last round.Yet to update their strategies, individuals are implicitly assumed to have a record of the outcome of all rounds, across all interaction partners.
It is natural to ask, then, to what extent perfect payoff memory is in fact required for the evolution of reciprocity.To this end, we consider a model in which individuals only remember the payoff of their very last interaction, or the payoffs of the last few interactions.By only considering an individual's most recent experiences, the evolutionary process is subject to additional stochasticity.Strategies that perform well on average (across an entire repeated game and across many interaction partners) may still get replaced if the respective player happened to yield an inferior payoff in the very last round.A similar element of stochasticity has been previously explored in the context of one-shot (non-repeated) games (74)(75)(76)(77)(78).This literature studies which strategies are selected for when individuals only interact with a finite sample of population members.In the respective models, individuals can only choose among two strategies.They can either cooperate or defect, and stochastic sampling affects which of these two strategies is favored.Instead, in repeated games, players have access to a large set of strategies (in our case, all reactive strategies; 79).
Here, stochastic sampling does not only affect whether cooperative or non-cooperative strategies are favored; it also affects which conditionally cooperative strategies are favored.
To address these questions, we combine analytical methods and computer simulations.In the most extreme case, we consider individuals who update their strategies based on only one piece of information: the last round of a single repeated game.For that case, we find that individuals are less generous, and they tend to be less cooperative overall (Figure 1).However, once individuals update their strategies based on two or more recent experiences, overall cooperation rates quickly approach the levels observed under perfect payoff memory (Figure 3).These findings suggest that models based on expected payoffs can serve as a useful approximation to more realistic models with limited payoff memory.Our findings also contribute to a wider literature that explores which kinds of cognitive capacities are required for reciprocal altruism to be feasible (e.g., 80,81).While more payoff memory is always favorable, reciprocal cooperation can already be sustained if individuals have a record of two or three past outcomes.We believe that this kind of result, derived entirely within a theoretical model, is crucial for making model-informed deductions about reciprocity in natural systems.social interactions.PloS one 6, e18945 (2011).The leftmost panels give a schematic overview of the two main scenarios we compare.The two scenarios differ in how many past interactions individuals take into account when updating their strategy.In the scenario with perfect payoff memory, individuals consider all their past interactions (against all population members, and taking every turn of each repeated game into account).In the scenario with limited payoff memory, individuals only consider their very last interaction (against one specific population member, taking into account only one round of the repeated game).The four panels on the right side depict the outcome of evolutionary simulations for repeated games with either a low or a high benefit of cooperation.Colors represent how often the respective region of the strategy space is visited over time.In all four panels, two regions are visited particularly often.One region corresponds to a neighborhood of ALLD with p ≈ q ≈ 0 (lower left corner).The other region corresponds to a strip of conditionally cooperative strategies with p ≈ 1 and q satisfying the constraints In each case, we record the resulting average cooperation rate over the entire simulation (upper panels).In addition we record the individuals' average generosity.Here, we only take into account those residents with p ≈ 1 and we compute the average of their cooperation probability q.These simulations suggest that perfect payoff memory consistently leads to more cooperation and more generosity.Unless explicitly varied, the parameters of the simulation are N = 100, b = 3, c = 1, β = 1, δ = 0.99.Simulations are run for T = 5 × 10 7 time steps for each parameter combination.
Figure 3: Average cooperation rates for different payoff memories.We vary how much information individuals take into account when updating their strategies.From left to right, we consider the following cases.(i) Updating occurs based on expected payoffs (perfect memory), (ii) it occurs based on the last round of one interaction (limited memory), (iii) based on the last round of two interactions, (iv) based on the last two rounds of one interaction, (v) based on the last two rounds of two interactions, and (vi) based on the average payoff of one interaction.Again, simulations are run either for a comparably low benefit of cooperation (b/c = 3), or for a high benefit (b/c = 10).We observe that perfect memory always yields the highest cooperation rate.However, when individuals take into account at least two past interactions -cases (iii) to (vi) -evolving cooperation rates are close to this optimum.Baseline parameters are the same as in Figure 2.

Nikoleta E. Glynatsi, Alex McAvoy, Christian Hilbe
This document provides further details on our methods and derivations, and it contains additional simulation results.Section 1 summarizes the model.In particular, we provide further details on our implementation of the evolutionary dynamics, and our use of the rare-mutation limit.In Section 2, we derive analytical results for the various settings we consider.These settings differ in what kind of payoff information individuals take into account when updating their strategies.In the perfect-memory setting, individuals take into account all their interactions against all co-players.In the limited-memory setting, they only consider the very last round of their very last interaction.In addition, we describe several model extensions in which the amount of information taken into account is in between these two extremes.Finally, Section 3 presents further simulation results.In particular, we confirm that our main results continue to hold (i) when mutations are no longer rare, and (ii) when players use memory-one strategies instead of reactive strategies.

Description of the model
Summary of the model.As described in the main text, we study cooperative behavior in a population of size N , with N being even.The dynamics unfold on two time scales.The short time scale describes the game dynamics.Here the N individuals are randomly matched to form N/2 pairs to interact in a repeated prisoner's dilemma.Each round, individuals can choose whether to cooperate (C) or defect (D).In the most general setting, the resulting one-shot payoffs can be summarized by the payoff matrix Here, R is interpreted as the reward for mutual cooperation, S is the sucker's payoff, T is the temptation, and P is the punishment payoff [1].Throughout this work, we parametrize these payoffs as R = b−c, S = −c, T = b, and P = 0, where b and c are the benefit and cost of cooperation, respectively, with b > c > 0. After each round, players learn their co-player's previous action.Then the game continues for another round with probability δ.Players make their decisions whether to cooperate in any given round based on their reactive strategies s = (y, p, q).The entry y determines a player's first-round cooperation probability.The other entries p and q determine the player's cooperation probability in all subsequent rounds.The probability is p if the co-player cooperated in the previous round, and it is q if the co-player defected.On this short time scale, the players' strategies are fixed, and players are consecutively matched to play repeated games with all other population members.
The long time scale describes the evolutionary dynamics.Here, players are allowed to update their strategies based on the payoffs they yield.We model these strategy updates with a pairwise comparison process [2].This process assumes that at regular time intervals, one player is randomly selected from the population.We refer to this player as the 'learner' (L).The learner is given an opportunity to update its strategy.There are two possibilities for how this update may occur.With probability µ, the player's strategy mutates randomly.In that case, the player's new strategy is drawn uniformly from the space of all reactive strategies [0, 1] 3 .With probability 1−µ, the player compares itself to another population member.To this end, the player randomly picks another individual from the population (referred to as the 'role model', RM).
The learner adopts the role model's strategy with a probability ϕ, given by The parameter β ≥ 0 is the selection strength.It determines how important payoff differences are for the learner's decision to imitate the role model.The variables π RM and π L refer to the payoffs of the role model and the learner, respectively.The exact value of these payoffs depend on the players' memory.We say players have perfect memory when π RM and π L are given by the players' expected payoffs (across all rounds and across all possible co-players).We say players have limited memory when π RM and π L are given by the players' realized payoff in the very last round of the game with their very last interaction partner.In addition, we consider several model extensions in which individuals have memory capacities in between these two extremes.We provide a detailed description of these different settings and the resulting payoffs in Section 2.
Evolutionary simulations for the rare-mutation limit.To simulate the evolutionary dynamics of the pairwise comparison process, it is sometimes useful to assume that mutations are rare, µ → 0. In that case, whenever a mutant strategy appears, it either fixes in the population or goes extinct before the next mutant appears.As a result, at any given time there are at most two different strategies present in the population [3][4][5].This assumptions makes computations more efficient, and it makes some of the results easier to interpret.
In the following, we describe our implementation of the process in the rare-mutation limit in more detail.
Initially, the process starts with a population where all members use the same strategy (referred to as the resident strategy, R).Then one individual adopts a mutant strategy selected uniformly at random from the set Algorithm 1: Evolutionary process in the limit of rare mutations N ← population size; resident ← starting resident; while t < maximum number of steps do mutant ← random strategy; fixation probability ← ρ M ; if ρ M > random: i → [0, 1] then resident ← mutant; end end of feasible strategies.The fixation probability ρ M of the mutant strategy can be calculated explicitly [6], Here, the index k corresponds to the current number of players with the mutant strategy (mutants).The variables λ − k , λ + k are the probabilities that the number of mutants decreases or increases within a single updating step.These probabilities depend on the probability that a mutant and a resident are chosen as the learner and the resident, respectively.In addition, they depend on the respective switching probability ϕ, as described by Eq. ( 2).We specify the exact values of λ − k , λ + k for each memory-setting in the next section.Depending on the fixation probability ρ M , the mutant strategy either fixes (becomes the new resident) or goes extinct.Afterwards, another random mutant strategy is introduced into the population.We iterate this elementary population updating process for a large number of mutant strategies.At each step, we record the current resident strategy and the resulting average cooperation rate.
We consider this limit of rare-mutation throughout the main text.The respective process is summarised by Algorithm 1.In Section 3, we present additional simulation results to show that our qualitative results continue to hold when the mutation rate is strictly bounded away from zero.

Analytical results
In the following, we discuss our different memory settings in more detail.We discuss six cases explicitly.
In these cases, updating occurs (i) based on average payoffs based on all interactions (perfect memory), (ii) based on the last round of one interaction (limited memory), (iii) based on the last round of two interactions, (iv) based on the last two rounds of one interaction, (v) based on the last two rounds of two interactions, and (v) based on the average payoff of one interaction.In each case, we assumet there are only two strategies present in the population (a resident and a mutant strategy).We first derive how likely it is that a learner (of any type) assigns a given payoff π L to itself, and a payoff of π RM to the role model.This allows us to derive explicit expressions for λ − k /λ + k , and hence for the mutant's fixation probability according to Eq. ( 3).Based on these expressions we can characterize under which conditions cooperation is stochastically stable.

Perfect payoff memory
Computing the players' expected payoffs.The case of perfect payoff memory corresponds to the classical case considered in the previous literature.Here, individuals update their strategies based on their expected payoffs, taking into account all rounds and all possible interaction partners.When players use reactive strategies (or more generally, strategies with finite memory), these expected payoffs can be computed explicitly, based on a Markov chain approach [7].To this end, consider two players with strategies s 1 = (y 1 , p 1 , q 1 ) and s 2 = (y 2 , p 2 , q 2 ), respectively.In each round t of the game, player 1 may get one of the four possible payoffs R, S, T , or P , as described by the general payoff matrix (1).Let v(t) = v R (t), v S (t), v T (t), v P (t) denote the respective probability distribution of observing one of these four outcomes.This probability distribution can be computed recursively.Using the shortcut notation z = 1−z for any z ∈ [0, 1], we get for the initial round Given v(t), we can compute v(t+1) as where M is the transition matrix of the process, Based on this recursion, we can compute how often player 1 receives one of the four payoffs R, S, T , P on average (across all possible realizations of games among the two players).This average distribution v is where I 4 is the 4 × 4 identity matrix.Based on this general formula, the four entries of v = (v R , v S , v T , v P ) can be computed explicitly.Using the auxiliary notation r i := p i −q i , we obtain Using this distribution v, we compute the first player's expected payoff as the weighted average The second player's payoff can be computed analogously.
Computing the ratio λ − k /λ + k .After these preparations, consider now a population with k mutants and N −k residents, whose strategies we denote by s M = (y M , p M , q M ) and s R = (y R , p R , q R ), respectively.Assuming that population members are matched randomly (or equivalently, that they interact with all other population members), the resulting expected payoffs of residents and mutants are The number of mutants in the population decreases in a single time step if a mutant is chosen to be the learner and adopts the strategy of a resident.Similarly, it increases if a resident is the learner and adopts the strategy of a mutant.The respective transition probabilities are For ϕ as defined by Eq. ( 2), the ratio of these two transition probabilities simplifies to Based on these ratios for each k, we can compute the mutant's fixation probability by Eq. ( 3).
Stochastic stability of cooperation.As an application of this formalism, we can compute when cooperation is stochastically stable in the perfect-information setting.To this end, suppose there is a single mutant, k = 1.
Based on Eq. ( 10), we can compute the strategies' expected payoffs as As a consequence, we can calculate the corresponding ratio of transition probabilities according to Eq. ( 11), By definition, cooperation is stochastically stable if this ratio exceeds one, which is equivalent to For such a strategy to be feasible we require q > 0, which implies δ > b + (N −1)c / (N −1)b+c .In particular, in the limit of large populations N → ∞, we obtain that cooperation is stochastically stable if q < 1−c/(δb).The minimum continuation probability for such a strategy to exist is δ > c/b.In this way, we recover the classical conditions for cooperation to be feasible under direct reciprocity [8][9][10].

Limited payoff memory
Computing the distribution of last-round payoffs.The case of perfect payoff memory is straightforward to handle; here, every player gets the expected payoff with certainty.In comparison, computing transition probabilities for the case of limited payoff memory is more elaborate.Here, we need to consider the different possible outcomes that both the learner and role model may have experienced in their very last interaction.
A further complication arises when the learner's last interaction partner happens to be the role model.In that case, the learner's and the role model's last payoff will be correlated (e.g., if the learner got the sucker's payoff of S, the role model's payoff is T with certainty).To treat the case of limited memory analytically, let U = {R, S, T, P } be the set of possible one-shot payoffs.Since the game ends after round τ with probability δ τ (1 − δ), if two players use reactive strategies s 1 = (y 1 , p 1 , q 1 ) and s 2 = (y 2 , p 2 , q 2 ), then, by definition, the probability that the first player receives a payoff of u ∈ U in the final round of the game is given by Eq. ( 8).
To make the strategies explicit, we denote this probability by v u (s 1 , s 2 ).
Computing the ratio λ − k /λ + k .After these preparations, let us again consider the corresponding population setup, with N −k residents with strategy s R = (y R , p R , q R ) and k mutants with strategy s M = (y M , p M , q M ).
At each step of the evolutionary process we choose a learner and a role model.The learner compares the performance of its strategy by comparing its own last one-shot payoff with the last one-shot payoff of the role model.In the following, we assume either the learner or the role model is a resident, and that the other player is a mutant (otherwise the number of mutants does not change).There are two major cases to consider.
1.The learner and the role model have their last respective interaction with each other.This happens with probability 1/(N − 1).In that case, there are four possible cases for their joint final payoffs, (u R , u M ) ∈ U F := (R, R), (S, T ), (T, S), (P, P ) .By Eq. ( 8), these four outcomes follow the distribution v(s R , s M ).
2. The learner's last interaction was not with the role model, with probability (N −2)/(N −1).In this case, there are four different subcases, depending on whether the resident's last interaction partner was a mutant or a resident, and depending on whether the mutant's last interaction partner was a mutant or a resident.As a result, the resident's last one-shot payoff is distributed according to v(s R , s M ) or v(s R , s R ); the mutant's last payoff is distributed according to v(s M , s M ) or v(s M , s R ), respectively.Let x(u R , u M ) denote the probability that the resident and the mutant received the payoff u R and u M in their respective last interaction.By taking into account the above two cases, we can compute this probability as The first line on the right side corresponds to the case that the learner and the role model happened to be matched directly for their last interaction.In that case, only those payoff pairs can occur that are feasible in a direct interaction.That is, it needs to be the case that (u R , u M ) ∈ U F , as represented by the respective indicator function.The second and third line of ( 13) summarize the four possible subcases that can occur when the learner and the role model were not directly matched for their last interaction.
Overall, we now obtain the following expressions for the probability that the number of mutants increases or decreases by one, In this expression, the prefactor (N − k)k/N 2 gives the probability that the two players (learner and role model) have different strategies.The sum corresponds to the total probability that the learner adopts the role model's strategy, by summing up over all possible payoffs u R and u M that the two players may have received in their respective last rounds.
Stochastic stability of cooperation.To illustrate this formalism, we again use it to characterize the stability of cooperation.There is k = 1 mutant with strategy ALLD.The remaining players use the resident strategy GTFT.When two residents interact, it follows by Eq. ( 8) that the outcome of the final round is distributed according to the distribution v R (GTFT, GTFT), v S (GTFT, GTFT), v T (GTFT, GTFT), v P (GTFT, GTFT) = (1, 0, 0, 0).
Based on these, we compute the probability x(u R , u M ) that the payoff of a randomly chosen GTFT player is u R and that the payoff of the ALLD player is u M , with u R , u M ∈ U. We obtain x(u R , u M ) = 0 for all other payoff pairs (u R , u M ).
Based on these expressions, we calculate the ratio of transition probabilities as Cooperation is stochastically stable if λ − 1 /λ + 1 > 1.While one can solve this inequality for q, the resulting condition is somewhat lengthy.To obtain a more interpretable condition, we consider the limit of strong selection β → ∞ and large populations N → ∞.In that case, because b > c > 0, the above ratio simplifies to This ratio exceeds one if q < 1−1/(2δ).For such a strategy to be feasible, we require q > 0, which in turn implies δ > 1/2.Moreover, in the special case that games are infinitely repeated, δ → 1, we conclude that cooperation is stochastically stable if q < 1/2.(For q = 1/2, the payoff of the ALLD player is T > R for half of the time, and it is P < R for the other half.The probability that the number of mutants increase by one equals the probability that the mutant goes extinct).

Updates based on the final round of two repeated games
Motivation.In the limited-memory setting, individuals take into account the outcome of one round, and of one interaction.This setting can be generalized such that an individual considers m rounds and n interactions.Here, we discuss the case that the update depends on the last round of n interactions.
At each step of the evolutionary process, we consider the role model's and the learner's last n matches.
We need to define the probability that for each of the matches they are paired with a mutant, with a resident or with each other.We assume that each pair is unique, such that the learner and the role model can be matched at most once, which is a reasonable assumption in large populations.The case of n = 1 corresponds to the previous setting of limited memory.There, we have seen that there are five possible combinations to consider.As we increase n, the number of possible combinations increases non-linearly; for a graphical illustration, see Supplementary Fig. 1.In the following, we study the case of n = 2, such that the learner takes into account the players' final payoffs of two interactions.
Supplementary Figure 1: Possible pairs when the learner takes into account n interactions against different co-players.In this diagram, (si, sj) represents a possible pairing of two players, lines indicate possible cases, and the fractions represent the respective probabilities for each case.We consider n stages in which the population members are consecutively paired with each other to play a repeated game.In the first stage, we need to consider five possible cases, as in Section 2.2.One case arises if the learner and the role model are matched directly.This happens with probability 1/(N −1).If they are not matched directly, both can be paired with a mutant, with a resident, or one is paired with a mutant whilst the other is paired with a resident.The second stage is similar.However, here we need to take into account that the learner and the role model can only be matched directly if they have not already interacted during the previous stage.The process continues until the n-th stage.

Computing the ratio λ
As before, we assume that either the learner or the role model is a resident, and that the other player involved in the pairwise comparison is a mutant.For n = 2, there are 24 possible combinations to consider.At the first stage there are five possible combinations.These are the same as in the previous setting.The learner either interacts directly with the role model or we need to take into account whether the resident interacts with a resident or a mutant, and whether the mutant interacts with a resident or a mutant.In case the learner and the role model did not directly interact with each other during the first stage, there are again five possible combinations in the second stage.Otherwise, if there already was a direct interaction, there are four combinations.Hence, there are 4•5 + 4 = 24 combinations in total.
We assume the resident receives the payoff u R1 with their first interaction and u R2 with their second.
Similarly, the mutant receives payoffs u M1 and u M2 .Let x(u R1 u R1 , u M1 u M2 ) denote the probability that the resident and the mutant received payoffs u R1 , u R2 and u M1 , u M2 in the last round of their respective last two Computing the distribution of outcomes in the penultimate and in the final round of a game.To compute the ratio of transition probabilities, again we need to compute how likely it is that the resident and the mutant received any possible combination of two payoffs during the last two rounds.The probability that the game lasts for at least two rounds is δ, since that is the probability of continuing from round zero (the initial round) to round one.The distribution of actions in in the penultimate round, conditioned on there being at least two rounds in the game, is then which is identical to the expression for v in Eq. ( 7).Thus, the probability that the first player receives the payoff u ∈ U in the penultimate round of the game, conditional on the event that the game lasts at least two rounds, is again identical to the player's average probability v u to receive payoff u across all rounds of the game, as specified by Eq. ( 8).
We can now derive a probability distribution for the first player's payoff in the last two rounds of a repeated game.To this end, consider the 4 × 4 transition matrix M (s 1 , s 2 ) = (m u,u ) according to Eq. ( 6).Instead of the usual indexing of the four rows by numbers, i ∈ {1, 2, 3, 4}, here we label the four rows of this matrix by the first player's payoffs in the previous round, u ∈ {R, S, T, P }.Similarly, we label the four columns of this matrix by the resulting payoffs to player 1 in the next round.For example, m ST corresponds to the second row and third column of matrix M .So by Eq. ( 6), m ST = q1 p 2 = (1−q 1 )p 2 .Using this notation, we can describe the probability w uu (s 1 , s 2 ) that the first player receives the payoffs u and u in the last two rounds.
By combining the probabilities that player 1 obtains a payoff of u in the penultimate round (probability v u ) and then a payoff of u in the last round (conditional probability m u,u ), we obtain w uu (s 1 , s 2 ) = v u • m u,u .
Computing the ratio λ − k /λ + k .To compute the ratio of transition probabilities, let us again take a population perspective.We consider k mutants with strategy s M = (y M , p M , q M ) and N − k residents with strategy s R = (y R , p R , q M ).Without loss of generality, we assume that either the learner or the role model is a resident, and that the respective other player is a mutant.Let x(u R u R , u M u M ) be the probability that the two players received payoffs u and u in the last two rounds of their last repeated game.We can compute this probability based on the same logic as in the limited-memory setting (Section 2.2).This yields Overall, we obtain the following formula for the transition probabilities Stochastic stability of cooperation.We once again calculate how easily a single ALLD mutant can invade a resident population of GTFT players.When two residents interact, we obtain the following probability distribution for the outcome of the two last rounds, w RR (GTFT, GTFT) = 1, w uu (GTFT, GTFT) = 0 for all other u, u ∈ U.
Stochastic stability of cooperation.We once again calculate how easily a single ALLD mutant can invade into a resident population of GTFT players.Depending on whether a resident interacts with another resident or with a mutant, the resident's average payoff based on Eq. ( 9) is The mutant can only interact with a resident, yielding the payoff Based on these average payoffs, we compute the ratio In the limit of large populations N → ∞, this ratio simplifies to 1 + e β(b(δq−δ)+c) .
For stochastic stability of cooperation we require this ratio to be larger than one, which is equivalent to Interestingly, this condition is the same as condition (12) for the case of perfect memory when N → ∞.

Figure 1 :
Figure1: Evolutionary dynamics under perfect and limited payoff memory.The leftmost panels give a schematic overview of the two main scenarios we compare.The two scenarios differ in how many past interactions individuals take into account when updating their strategy.In the scenario with perfect payoff memory, individuals consider all their past interactions (against all population members, and taking every turn of each repeated game into account).In the scenario with limited payoff memory, individuals only consider their very last interaction (against one specific population member, taking into account only one round of the repeated game).The four panels on the right side depict the outcome of evolutionary simulations for repeated games with either a low or a high benefit of cooperation.Colors represent how often the respective region of the strategy space is visited over time.In all four panels, two regions are visited particularly often.One region corresponds to a neighborhood of ALLD with p ≈ q ≈ 0 (lower left corner).The other region corresponds to a strip of conditionally cooperative strategies with p ≈ 1 and q satisfying the constraints (3) and (4), respectively (lower right corner).The resulting average cooperation rate depends on which of these two neighborhoods is visited more often.Simulations are run for T = 10 7 time steps, using a cost c = 1, a continuation probability of δ = 0.999 and a selection strength of β = 1, in a population of size N = 100.

( 3 )Figure 2 :
Figure1: Evolutionary dynamics under perfect and limited payoff memory.The leftmost panels give a schematic overview of the two main scenarios we compare.The two scenarios differ in how many past interactions individuals take into account when updating their strategy.In the scenario with perfect payoff memory, individuals consider all their past interactions (against all population members, and taking every turn of each repeated game into account).In the scenario with limited payoff memory, individuals only consider their very last interaction (against one specific population member, taking into account only one round of the repeated game).The four panels on the right side depict the outcome of evolutionary simulations for repeated games with either a low or a high benefit of cooperation.Colors represent how often the respective region of the strategy space is visited over time.In all four panels, two regions are visited particularly often.One region corresponds to a neighborhood of ALLD with p ≈ q ≈ 0 (lower left corner).The other region corresponds to a strip of conditionally cooperative strategies with p ≈ 1 and q satisfying the constraints (3) and (4), respectively (lower right corner).The resulting average cooperation rate depends on which of these two neighborhoods is visited more often.Simulations are run for T = 10 7 time steps, using a cost c = 1, a continuation probability of δ = 0.999 and a selection strength of β = 1, in a population of size N = 100.