Pathogen evolution: slow and steady spreads the best

The theory of life history evolution provides a powerful framework to understand the evolutionary dynamics of pathogens in both epidemic and endemic situations. This framework, however, relies on the assumption that pathogen populations are very large and that one can neglect the effects of demographic stochasticity. Here we expand the theory of life history evolution to account for the effects of finite population size on the evolution of pathogen virulence. We show that demographic stochasticity introduces additional evolutionary forces that can qualitatively affect the dynamics and the evolutionary outcome. We discuss the importance of the shape of pathogen fitness landscape and host heterogeneity on the balance between mutation, selection and genetic drift. In particular, we discuss scenarios where finite population size can dramatically affect classical predictions of deterministic models. This analysis reconciles Adaptive Dynamics with population genetics in finite populations and thus provides a new theoretical toolbox to study life-history evolution in realistic ecological scenarios.


Introduction
Why are some pathogens virulent and harm their hosts while others have no effect on host fitness? Our ability to understand and predict the evolutionary dynamics of pathogen virulence has considerable implications for public-health management (Dieckmann et al., 2005;Bull and Lauring, 2014;Gandon et al., 2016). A classical explanation for pathogen virulence involves trade-offs with other pathogen life-history traits. If certain components of pathogen fitness, such as a high transmission rate or a long duration of transmission, necessarily require that the pathogen incidentally harm its host then virulence is expected to evolve (Frank, 1996). A now classical way to develop specific predictions from this hypothesis is to adopt the Adaptive Dynamics formalism (Geritz et al., 1998;Dieckmann et al., 2005;Frank, 1996). This approach relies on the assumption that the mutation rate is small so that the epidemiological dynamics occur on a faster timescale than the evolutionary dynamics (Anderson et al., 1992;Frank, 1996;Alizon et al., 2009;Cressler et al., 2016). Under simple epidemiological assumptions (no co-infections with different genotypes) the evolutionarily stable level of virulence maximizes the basic reproduction ratio R 0 of the pathogen (but see e.g., Nowak and May (1994); van Baalen and Sabelis (1995) for more complex epidemiological scenarios).
Adaptive Dynamics models allow one to predict long-term evolution but they tell us little about what we should expect to observe if epidemiological and evolutionary processes occur on a similar timescale. Novel theoretical approaches have therefore been developed to address this issue (Lenski and May, 1994;Frank, 1996;Day and Proulx, 2004;Day and Gandon, 2006;Bull and Ebert, 2008). These studies have revealed that, in addition to tradeoffs, the nature of the epidemiological dynamics (e.g., epidemic spread versus endemic disease) can also dictate the type of pathogen that will evolve. For example, pathogens with relatively high virulence can be selected for during epidemic disease spread whereas pathogens with a lower virulence may outcompete such high virulence strains in endemic diseases (Lenski and May, 1994;Frank, 1996;Day and Proulx, 2004;Berngruber et al., 2013).
The above-mentioned theory allows one to determine the level of virulence expected to evolve under a broad range of epidemiological scenarios but it still suffers from the fundamental shortcoming of being a deterministic theory. Pathogen population size, however, can be very small (e.g. at the onset of an epidemic or after a vaccination campaign) and demographic stochasticity is likely to affect both the epidemiological and evolutionary dynamics of the disease. If all that such stochasticity did was to introduce random noise then the predictions of deterministic theory would likely suffice. However, several recent studies have demonstrated that this is not the case. For example, Kogan et al. (2014) and Humplik et al. (2014) each used different theoretical approaches to demonstrate that finite population size tends to select for lower virulence and transmission. Likewise, Read and Keeling (2007) analyzed the effect of finite population size in a complex epidemiological model with unstable epidemiological dynamics and showed that finite population size could induce an evolutionary instability that may either lead to selection for very high or very low transmission.
Taken together, the existing literature presents a complex picture of the factors that drive virulence evolution and it remains unclear how all of these factors are related to one another and how they might interact. In this paper we develop a very general theory of pathogen evolution that can be used to examine virulence evolution when all of the above-mentioned factors are at play. First, we use an individual based description of the epidemiological process to derive a stochastic description of the evolutionary epidemiology dynamics of the pathogen. This theoretical framework is used to pinpoint the effect of finite population size on the interplay between epidemiology and evolution. Second, we analyze this model under the realistic assumption that the rate of mutation is small so that pathogen evolution can be approximated by a sequence of mutation fixations. We derive the probability of fixation of a mutant pathogen under both weak and strong selection regimes, and for different epidemiological scenarios. Third, we use this theoretical framework to derive the stationary distribution of pathogen virulence resulting from the balance between mutation, selection and genetic drift. This yields new predictions regarding the effect of the shape of pathogen fitness landscape, the size of the population and sources of host heterogeneities on long-term evolution of the pathogen. As the question of virulence evolution can be viewed as a specific example of the more general notion of life history evolution (Stearns, 1992;Roff, 2002) our results should be directly applicable to other life history traits and other organisms as well.

Model
We use a classical SIR epidemiological model where hosts can either be susceptible, infected or recovered. The number of each of these types of hosts is denoted by N S , N I , and N R respectively. Because we are interested in the effect of demographic stochasticity the model is derived from a microscopic description of all the events that may occur in a finite host population of total size N T = N S + N I + N R (the derivation of the model is detailed in Supplementary Information). It will be useful to explicitly specify the size of the habitat in which the population lives (e.g., the area of the habitat) and so we denote this by the parameter n.
We use λ to denote the rate at which new susceptible hosts enter the population per unit area and therefore the total rate is given by λn. We focus on the case of frequency-dependent transmission; i.e., new infections occur at rate β N T N S N I where β is a constant quantifying the combined effects of contact rate among individuals and the probability of pathogen transmission given an appropriate contact occurs. Note, however, that other forms of transmission (e.g., density dependent transmission, (McCallum et al., 2001)) yield qualitatively similar results . For simplicity we also assume that already infected hosts cannot be reinfected by another pathogen strain (i.e., no co-infections). All hosts are assumed to suffer a constant per capita death rate of δ and infected hosts die at per capita rate α and they recover at per capita rate γ. Finally, to study pathogen evolution we need to introduce genetic variation in the parasite population. Therefore we consider d pathogen strains which differ in transmission rate β i and virulence α i ,with i ∈ {1, ..., d}. Likewise we use the subscripted variable N I i to denote the number of hosts infected with strain i.
The dynamical system resulting from the above assumptions is a continuous-time Markov process tracking the number of individuals of each type of host. To progress in the analysis we use a diffusion approximation and work with host densities defined as S = N S /n, I i = N I i /n and N = N T /n and we define the total density of infected hosts as I = d i=1 I i . When n is sufficiently large these variables can be approximated using a continuous state space and so this model can be described by a system of stochastic differential equations (see Supplementary Information, §3).

Deterministic evolution
In the limit where the habitat size (and thus the host population size) gets large, demographic stochasticity becomes unimportant and the epidemiological dynamics are given by the following system of ordinary differential equations: The bars above α, β and γ refer to the mean of the transmission and the virulence distributions of the pathogen population (i.e.ᾱ = d i=1 α i I i I ). In the absence of the pathogen the density of hosts equilibrates at S 0 = λ δ . A monomorphic pathogen population (m = 1,β = β andᾱ = α) is able to invade this equilibrium if its basic reproduction ratio is R 0 = β δ+α+γ > 1. If this condition is fulfilled the system reaches an endemic equilibrium where and N eq = λ(δ+γ) δ(β−α) R 0 . When several strains are present in the population the evolutionary dynamics of the pathogen can be tracked with Gandon, 2006, 2007): where p i = I i I is the frequency of pathogen i. The quantity r i = β i S N − (δ + α i + γ) is the instantaneous per capita growth rate of strain i andr = d i=1 p i r i is the average per capita growth rate of the pathogen population. When m = 2 only two strains are competing (a wild-type, strain 1, and a mutant, strain 2) the change in frequency p 2 of the mutant strain is given by: where ∆β = β 2 −β 1 and ∆α = α 2 −α 1 are the effects of the mutation on transmission and virulence, respectively. The above formalization can be used to understand the evolution of pathogen life-history under different scenarios. First, under the classical Adaptive Dynamics assumption that mutation rate is very small one may use a separation of time scales where the epidemiological dynamics reach an endemic equilibrium (set by the resident pathogen, strain 1) before the introduction of a new variant (strain 2) by mutation. In this case evolution favours the strain with the highest basic reproduction ratio: R (0,i) = β i δ+α i +γ . In other words, evolution favours strains with higher transmission rates and lower virulence. According to the tradeoff hypothesis, however, transmission and virulence cannot evolve independently. For example, the within-host growth rate of pathogens is likely to affect both traits and result in a functional trade-off between transmission and virulence (Anderson et al., 1992;Frank, 1996;Alizon et al., 2009;Cressler et al., 2016). Under this assumption equation (3) can be used to predict the evolutionary stable virulence strategy (Figure 1). The above model can also be used to predict virulence evolution when the evolutionary and epidemiological dynamics occur on a similar time scale Gandon, 2006, 2007;Gandon and Day, 2007). For instance, these models can be used to understand virulence evolution during an epidemic (Lenski and May, 1994;Frank, 1996;Day and Proulx, 2004;Berngruber et al., 2013). In this case, a pathogen strain i with a lower R 0 may outcompete other strains if its instantaneous growth rate, r i , is higher.

Stochastic evolution
Finite population size introduces demographic stochasticity and the epidemiological dynamics can 0 − − Figure 1: Schematic representation of the effect of finite population size on the evolution of pathogen virulence. The grey line in the top figure represents the effect of pathogen virulence, α, on R 0 (for an asymmetric fitness function). The grey line in the bottom figure represents the effect of pathogen virulence, α, on pathogen transmission, β. In the deterministic version of our model the marginal value theorem can be used to find the evolutionary stable (ES) pathogen virulence, α 0 (dashed black arrow). In this model ES virulence maximizes R 0 in the absence of demographic stochasticity. Finite population size modifies selection and favours pathogen strategies with lower virulence (see equation (10)). The mode of the stationary distribution of pathogen virulence is indicated by a dashed red arrow, α mode (see equation (13)). This geometrical construction indicates that finite population size is expected to favour slower strains even if they have a lower R 0 . be described by the following system of (Itô) stochastic differential equations: where B 1 , . . . , B 5 are independent Brownian motions. As expected, when n → ∞ this set of stochastic differential equations reduces to the deterministic equations in (1).
In finite populations the pathogen, and indeed the host population itself, are destined to extinction with probability 1. The time it takes for this to occur, however, depends critically on the parameter values. For example, in a monomorphic pathogen population (i.e., , m = 1), if R 0 is larger than one the size of the pathogen population reaches a quasi-stationary distribution which is approximately normal. The mean of this distribution is of order n and its standard deviation of order √ n (Nåsell, 2001(Nåsell, , 2007. The extinction time from the quasi-stationary distribution increases exponentially with n (Barbour, 1976;Nåsell, 2001Nåsell, , 2007 and so, in the remainder of the paper we will assume that n is large enough so that we can focus on the dynamics conditional on non-extinction. As in the deterministic case, one can study evolutionary dynamics by focusing on the change in strain frequencies. We obtain a stochastic differential equation analogous to (2) (see Supplementary Information, §4): where v i = β i S N + (δ + α i + γ) is the variance in the growth rate of strain i (while r i is the mean) andv = d i=1 p i v i is the average variance in growth rate of the pathogen population. The first term in equation (5) is analogous to (2). The second term shows that finite population size (i.e., when pathogen population size, as measured by the total density of infected hosts, nI is not too large) can affect the direction of evolution. In contrast with the deterministic model, the evolutionary dynamics are not driven exclusively by the expected growth rate r i but also by a minimization of the variance. This effect is akin to bet-hedging theory stating that a mutant strategy with lower variance in reproduction may outcompete a resident strategy with a higher average instantaneous growth rate (Gillespie, 1974;Frank and Slatkin, 1990). To better understand this effect it is particularly insightful to examine the case m = 2 when only two strains are competing and the change in frequency p 2 of the mutant strain is given by: The first term (the drift term) in equation (6) is similar to (3) except for the 1 nI terms. Those terms are due to the fact that a transmission (or a death) event of the mutant is associated with a change in the number of mutants as well as an increase (decrease) of the total pathogen population size by one individual. This concomitant variation of pathogen population size affects the effective change of the mutant frequency (relative to the change expected under the deterministic model where population size are assumed to be infinite). This effect decreases the benefit associated with higher transmission and increases the cost of virulence. In the long-term this effect (the drift term in (5)) is thus expected to select for lower virulence. But this long term evolutionary outcome cannot be described by an evolutionary stable state because demographic stochasticity is also expected to generate noise (the diffusion term in (5)). Indeed, this stochasticity (i.e., genetic drift) may lead to the invasion and fixation of strains with lower per capita growth rates. In the following we fully characterize this complex evolutionary outcome with the stationary distribution of pathogen virulence under different epidemiological scenarios.

Results
The above theoretical framework embodied by the stochastic differential equations (4) and (5) subsume the deterministic model and can be used to study the interplay of all the relevant factors affecting virulence evolution. In the following we will assume that pathogen mutation is rare so that evolution can be described, as in classical Adaptive Dynamics, as a chain of fixation of new pathogen mutations. In contrast with Adaptive Dynamics, however, demographic stochasticity may allow deleterious mutations to go to fixation. The analysis of the effect of finite population size requires specific ways to quantify the stochastic fate of a genotype (Proulx and Day, 2002). To determine the fate of a new mutation we need to compute the probability of fixation of a mutant pathogen in a resident population. In the absence of selection the fixation probability of a mutant allele depends only on the demography of the population. When the size of the population is fixed and equal to N the fixation probability of a neutral allele is 1/N . When the fixation probability of a mutant is higher than neutral it indicates that the mutant is selectively favoured. This is particularly useful in many complex situations where the interplay between selection and genetic drift are difficult to disentangle like time varying demography Lambert, 2006) or spatial structure (Rousset, 2004). In our model, the difficulty arises from (i) the stochastic demography of the pathogen population and (ii) the fact that pathogen life-history traits feed-back on the epidemiological dynamics and thus on the intensity of genetic drift.

Stationary distribution of pathogen virulence at equilibrium
Here we assume, as in the Adaptive Dynamics framework, that the pathogen mutation rate µ is so low that the mutant pathogen (strain 2) arises when the resident population (strain 1) has reached a stationary equilibrium nI eq (i.e., close to the endemic equilibrium derived in the deterministic model). The R 0 of the two strains may be written in the following way: R 0,2 = R 0,1 (1 + s) where s measures the magnitude of selection.
When selection is strong (i.e., s 1 n ) the probability of fixation of the mutant when N I 2 (0) mutants are introduced into a resident population at equilibrium is (see Supplementary Information, §5.2): which may be obtained by approximating the invading strain by a branching process (see Supplementary Information, §7.2 for a rigorous justification). When the mutant and the resident have similar values of R 0 (i.e., s is of order 1 n ) selection is weak, and the derivation of the probability of fixation is a much more difficult problem. The classical population genetics approach under the assumption that population size is fixed (or is characterized by a deterministic trajectory independent of mutant frequency) is to use the diffusion equation of mutant frequency to derive the probability of fixation Lambert, 2006). But in our model, equation (3) is not autonomous and is coupled with the epidemiological dynamics. To derive the probability of fixation we use a separation of time scale argument to reduce the dimension of the system (see  for a discussion of the approach). Indeed, if selection is weak the deterministic component of the model sends the system rapidly to the endemic equilibrium. At this point, it is possible to approximate the change in frequency of the mutant by tracking the dynamics of the projection of the mutant frequency on this manifold (see Supplementary Information,  §5.3). This one dimensional system can then be used to derive the probability of fixation under weak selection. A first order approximation in s and σ is: where p = I 2 (0)/I eq and σ = β 1 −β 2 β 2 . The first term in (8) is the probability of fixation of a single neutral mutation introduced in a pathogen population at the endemic equilibrium nI eq . The second term takes into account the effect due to selection. First, selection may be driven by differences in R 0 . Second, even if strains have identical R 0 (i.e., s = 0) selection may be driven by σ which measures the difference in transmission rate. Note, however, that the effect of s rapidly overwhelms the effect of σ as pathogen population size nI eq becomes large (unless s is of order 1 n ). The probability of fixation given in (8) confirms that evolution tends to push towards higher basic reproductive ratio but when the population size is small other forces may affect the evolutionary outcome. In particular, when nI eq is small, strains with lower R 0 can reach fixation. Figure 2 shows the result of stochastic simulations that confirm the approximations (7) and (8) under different epidemiological scenarios.
Even though the probability of fixation helps understand the interplay between selection and genetic drift it does not account for any differences in the time to fixation and it is often difficult to measure this probability experimentally as well (but see Gifford et al. (2012)). What may be more accessible is a characterization of the phenotypic state of the population across different points in time (or in space among replicate populations) -that is, the stationary distribution of the virulence phenotype of the pathogen under the action of mutation, selection and genetic drift Lehmann, 2012;Debarre and Otto, 2016) (Figure 3).
To derive the stationary distribution of pathogen virulence we first need to impose a trade-off between virulence and contact rate, setting β = β(α), and introduce the mutation kernel K(α m , α), the probability distribution of mutants with strategy α m from a monomorphic population with strategy α. Here we assume that this distribution is Gaussian with a mean equal to the current resident trait value and variance ν. Under the assumption that the mutation rate µ remains small, pathogen polymorphism is limited to the transient period between the introduction of a mutant and a fixation. The probability of fixation (8) accurately describes the direction of evolution and the evolution of pathogen virulence can then be described by the following Fokker-Planck diffusion equation (see Supplementary Information, §6): where ψ(α, t) is the distribution of pathogen virulence and indicates the derivative with respect to α. The drift term of the above equation indicates that deterministic evolution tends to maximize the basic reproduction ratio while finite population size tends to select for lower transmission. Under the classical assumption that pathogen transmission and pathogen virulence are linked by a genetic trade-off one can ask what the level of pathogen virulence is where the drift coefficient is zero. This trait value corresponds to the mode of the stationary distribution of pathogen virulence and is given by the following condition (see Supplementary Information Equation S.45): When the pathogen population is very large (i.e., n → ∞) we recover the marginal value theorem while finite population size increases the slope β (α) and reduces the mode of the stationary distribution (see Figure 1). Thus, for a broad range of transmission-virulence trade-off functions, finite population size is expected to decrease virulence and transmission rates. In other words, pathogen avirulence may be viewed as a bet-hedging strategy because even if it reduces the instantaneous growth rate r i , the reduced variance in growth rate v i is adaptive in finite population size.
Let us now consider the limiting case when all the pathogen strains have the same R 0 . This corresponds to a very special case where the fitness landscape is flat. The deterministic model predicts that pathogen life-history variation is neutral near the endemic equilibrium (see (2)). The probability of fixation (8) shows, however, that selection is quasi-neutral and favours pathogens with lower transmission and virulence rates (Parsons and Quince, 2007;Kogan et al., 2014;Humplik et al., 2014). The stationary distribution results from the balance between selection (pushing towards lower values of pathogen traits) and mutation (reintroducing variation). If we focus on virulence and allow variation between a minimal value α min and a maximal value α max the stationary distribution is (see Supplementary Information Equation S.39): It is worth noting that this distribution is independent of the pathogen population size. Indeed, near the endemic equilibrium and when pathogens have the same R 0 the probability of fixation (8) is independent of pathogen population size. So this prediction holds even in very large pathogen populations. The time to fixation may, however, be considerably longer in large populations and the assumption that polymorphism is always reduced to the resident and a single mutant may not always hold as pathogen population increases. Yet, stochastic simulations confirm that (11) correctly predicts the stationary distribution, which is relatively insensitive to pathogen population size but varies with δ + γ ( Figure 4A). Second, we consider a general fitness landscape with a single maximum. It is possible to derive a good approximation for the stationary distribution (see S.44 in the Supplementary Information): where α 0 is the virulence that maximizes R 0 , I eq (α 0 ) is the expected number of infected individuals at the endemic equilibrium when the virulence is α 0 and N (α 0 , σ 2 ) is the Gaussian distribution with mean α 0 and variance σ 2 = 1 nIeq(α 0 )|R 0 (α 0 )|/(R 0 (α 0 )) . We thus see the effect of the demography is to bias the Gaussian, putting more weight on values of the virulence below α 0 ; this becomes more clear when we consider the mode and mean of the (true) stationary distribution (see §6.3 in the Supplementary Information): and respectively. These results indicate that, as expected from the simple optimization approach used above in (10) and illustrated in Figures 3 and 4, lower pathogen population size tends to decrease pathogen virulence. Yet, the above derivation of the stationary distribution goes far beyond this optimization criterion. First, it predicts accurately the mode of the stationary distribution. In particular it shows that the shape of the fitness landscape may affect the mode of the stationary distribution. The skew of the fitness landscape can have huge effects on the stationary distribution ( Figure 3). A positive skew leads to a higher mean virulence and may thus counteract the effect of small pathogen population. In other words, whether demographic stochasticity favours lower of higher virulence depends also on the shape of the fitness landscape. Second, our analysis predicts the amount of variation one may expect to see around this mode. Unlike the criteria used to derive a single optimal strategy our approach predicts accurately the expected variation around this mode (Figures 3 and 4). Note that the population remains monomorphic most of the time (because mutation is assumed to be small) but the variance of the stationary distribution refers to the distribution of phenotypes explored through time (or through space if stochastic evolution is taking place in multiple isolated populations).

Pathogen evolution after vaccination
The above analysis relies on the assumption that the epidemiological dynamics are much faster than evolutionary dynamics so that the pathogen population is always at its endemic equilibrium. The present framework can also be used to explore the fate of a mutant pathogen away from this endemic equilibrium. For instance, right after the start of a vaccination campaign the availability of susceptible hosts is going to drop rapidly if the vaccination coverage f is high and if the efficacy of the vaccine is large. This epidemiological perturbation has major consequences on the probability of fixation of a pathogen mutant. Looking at invasion out of endemic equilibrium, we show that a larger vaccination coverage f decreases the probability of fixation of all pathogen mutants (see Supplementary Information §5.2.3) but the probability of fixation of strains with low virulence and low transmission rates are less affected than more virulent and transmissible strains. In other words, strains with low turn over rates (slower strains) are less likely to be driven to extinction during the drop of the pathogen population size. This extends classical results of populations genetics Lambert, 2006) to situations where the mutations are acting on life history traits and feed back on population dynamics. Faster strains are selected for during epidemics while slower strains are favoured when the pathogen population is reduced (e.g. after a public health intervention). This is also consistent with the analysis of pathogen evolution based on deterministic models which showed that the direction of selection depends on the epidemiological state of the population (Lenski and May, 1994;Frank, 1996;Day and Proulx, 2004;Lambert, 2006;Gandon, 2006, 2007;Berngruber et al., 2013). Vaccination is also meant to induce long-term modifications of the host population. In particular, artificial immunization introduces a heterogeneity between vaccinated and unvaccinated hosts. If the vaccine is perfect, vaccination will act on the epidemiology and will reduce the endemic equilibrium. In a deterministic version of this model, such a perfect vaccine is expected to have no consequences on long-term pathogen evolution (Gandon et al., 2001;Gandon and Day, 2007). In contrast, when host population size is finite, vaccination is expected to magnify the influence of demographic stochasticity and, as discussed above, to select for lower pathogen virulence and to increase the variance of the stationary distribution ( Figure 5). It is also interesting to consider an alternative scenario where vaccinated hosts can be infected but cannot transmit the pathogen. In this case, two types of infected hosts are coexisting: good-quality (naïve) hosts and bad-quality (vaccinated) hosts. In this situation, the amount of demographic stochasticity is not governed by the whole pathogen population size but by the size of the population of infected hosts that actually contribute to transmission. In fact all the results derived above apply in this scenario provided that the equilibrium pathogen population size is replaced by the effective pathogen size: nI e = (1 − f )nI eq . This example illustrates that even for large pathogen population sizes the effect of demographic stochasticity can be important if the effective pathogen size is small. When there is variation in reproductive value among individuals the effective population size that governs the amount of genetic drift may be substantially lower than the actual size of the population (Crow and Kimura, 1970). In the context of pathogen evolution this heterogeneity in reproductive value may be driven by variations in infectiousness among hosts. This variation can be induced by publichealth interventions (e.g. transmission-blocking vaccines) but it emerges naturally from complex behavioural and/or physiological differences among hosts. For instance, evidence for superspreading events where certain individuals can infect unusually large numbers of secondary cases have been found in many human pathogens (Woolhouse et al., 1997;Lloyd-Smith et al., 2005). This heterogeneity is expected to reduce the effective population size and to magnify the effects of demographic stochasticity discussed above. But other factors like temporal fluctuations in population size are also known to modulate the intensity of genetic drift and may affect effective population size (Crow and Kimura, 1970).

Discussion
Evolutionary theory has led to the development of different mathematical tools for studying phenotypic evolution in a broad diversity of ecological scenarios (Parker and Smith, 1990;Roff, 1993;. For instance, Adaptive Dynamics is a powerful theoretical framework to study life-history evolution when mutation is assumed to be rare so that demographic and evolutionary processes can be decoupled (Geritz et al., 1998;. This analysis yields evolutionarily stable life-history strategies and captures the ultimate outcome of evolution. But this approach relies on the assumption that population size is infinite and that evolution is deterministic. Finite population size, however, can also affect evolutionary trajectories. In particular, even the fittest genotype can be invaded by a deleterious mutant when population size is reduced. This leads to the collapse of the concept of evolutionarily stable strategy. Here we develop and apply Stochastic Adaptive Dynamics (SAD) Debarre and Otto, 2016), a new theoretical framework where the evolutionary outcome of life history evolution is studied through the derivation of the stationary distribution of the phenotype under mutationselection-drift equilibrium. Under the assumption that mutation rate is small, the equilibrium distribution can be derived from a diffusion approximation. In contrast with previous population genetics models, the present framework also allows life-history evolution to affect population size and, consequently, the amount of demographic stochasticity. In other words, this framework retains key features of Adaptive Dynamics but relaxes a major assumption by allowing genetic drift to affect the evolutionary outcome (see also Waxman andGavrilets, 2005, p.1149). As such, our SAD framework is an important step towards a better integration between Adaptive Dynamics and classical population genetics. We show that finite population size induces a selective pressure towards strains with lower variance in growth rate (but see also Gillespie (1974); Lambert (2006)). A simple way to understand this effect is to compare the fate of two strains with the same R 0 but with different life-history strategies. The fast strain is very transmissible but has a short duration of infection (e.g., because of high virulence or high clearance rate). The slow strain has a long duration of infection but has a small transmission rate. Since the two strains have the same R 0 Adaptive Dynamics predicts that these two strains should coexist. With finite population size, however, the fast strain has a higher probability to go extinct simply because more events happen per unit of time. As in Aesop's Fable "Slow and steady wins the race" because the fast strain will reach extinction sooner than the slow strain. Previous studies Kogan et al., 2014;Humplik et al., 2014) pointed out the influence of finite population size on the direction of virulence evolution but they focused mainly on the quasi-neutral case where all the strains have the same R 0 . Humplik et al. (2014) did look at scenarios where strains have different R 0 but without a derivation of the stationary distribution at mutation-selection-drift equilibrium. We believe that this stationary distribution is key to explore the interaction between finite population size and phenotypic evolution. This distribution yields testable predictions on the mean as well as other moments of the phenotypic distribution.
The approximation (12) shows that this distribution is moulded by two main parameters: (i) the pathogen fitness landscape, and (ii) the effective size of the pathogen population. First, the fitness landscape at the endemic equilibrium can be derived from (5) and depends mainly on the way R 0 varies with pathogen life history traits. Under the classical transmission-virulence assumption R 0 is maximized for some intermediate virulence. But the shape of the trade-off also affects the shape of the fitness landscape and in particular its symmetry. Second, the effective size nI e of the pathogen population size depends mainly on the pathogen population size nI eq at the endemic equilibrium but other factors may reduce the effective pathogen population size as well. For instance, variance in transmission among infected hosts is likely to reduce nI e below nI eq . One source of heterogeneity in transmissibility may be induced by public-health interventions (e.g., vaccination, drug treatments), but intrinsic behavioural or immunological heterogeneities among hosts may induce superspreading transmission routes as well (Woolhouse et al., 1997;Lloyd-Smith et al., 2005).
When the fitness landscape of the pathogen is symmetric, reducing the effective population size increases the variance of the stationary distribution but decreases also the mean (and the mode) of this distribution. This effect results from the selection for a reduction of the variance identified in (5). This is the effect that emerges in the quasi-neutral case. When the fitness landscape is flat this may lead to an important bias towards lower virulence ( Figure 4). When the fitness landscape of the pathogen is asymmetric the skewness of the fitness landscape can affect the mean of the stationary distribution when the effective population size nI e of the pathogen is reduced. More specifically negative (positive) skewness reduces (increases) the mean of the stationary distribution. It is interesting to note that classical functions used to model the trade-off between virulence and transmission tend to generate positive skewness in the fitness landscape (van Baalen and Sabelis, 1995;Frank, 1996;Alizon et al., 2009). The asymmetry of these fitness functions may thus counteract the effects of stochasticity per se identified in symmetric fitness landscapes. In other words, predictions on the stochastic evolutionary outcome are sensitive to the shape on genetic constraints acting on different pathogen life-history traits. This result is very similar to the deterministic effects discussed in (Urban et al., 2013) on the influence of asymmetric fitness landscapes on phenotypic evolution. Note, however, that the effect analyzed by Urban et al (2013) is driven by environmental effects on phenotypes. In our model, we did not assume any environmental effects and a given genotype is assumed to produce a single phenotype.
We focused our analysis on the stationary distribution at the endemic equilibrium of this classical SIR model. But we also explore the effect of demographic stochasticity on the transient evolutionary dynamics away from the endemic equilibrium. For instance we recover a classical population genetics result Lambert, 2006) that the probability of fixation of adaptive mutations (i.e., with s 1 n ) is increased during epidemics. Beyond the effect of s (i.e., differences between the R 0 of the two competing strains) differences in life-history traits matter away from the endemic equilibrium. In particular, faster strains have higher probabilities of fixation when the pathogen population is growing (during epidemics) and, conversely, slower strains have higher probabilities of fixation when the pathogen population is crashing. The analysis of scenarios where the epidemiological dynamics is unstable and leads to recurrent epidemics is more challenging but may lead to unexpected evolutionary dynamics (Read and Keeling, 2007).
We analyzed the effects of demographic stochasticity induced by finite population size but environmental stochasticity may also affect evolution (Frank and Slatkin, 1990;Starrfelt and Kokko, 2012;Schreiber et al., 2015). Environmental factors are known to have dramatic impacts on pathogen transmission and it would thus be particularly relevant to expand the current framework to account for the effects of random perturbations of the environment on pathogen evolution (Nguyen et al., 2015).
Another possible extension of this model would be to analyze the effect of demographic stochasticity on the multi-locus dynamics of pathogens. Indeed, the interaction between genetic drift and selection is known to yield complex evolutionary dynamics resulting in the build up of negative linkage disequilibrium between loci. But the analysis of this so-called Hill-Robertson effect is often restricted to population genetics models with fixed population size. The build up of linkage disequilibrium in some epidemiological models has been discussed in some simulation models (Althaus and Bonhoeffer, 2005;Fraser, 2005). Our model provides a theoretical framework to explore the effect of finite population size on multi-locus dynamics of pathogens and to generate more accurate predictions on the evolution of drug resistance (Day and Gandon, 2012).
Finally, although we have presented our results in the context of pathogen evolution, it is hopefully clear that a very similar theoretical framework could be used to study other examples of life history evolution in the context of demographic stochasticity. Current general life history theory largely neglects the evolutionary consequences of stochasticity arising from small population sizes. Our results suggest that it would be profitable to determine what sorts of insights might be gained for life history evolution more generally by using the type of theoretical framework developed here.  (8)) is indicated with a green line, the strong selection approximation is indicated with a red line (equation (7)). Parameter values of the resident population: n = 100, R 0 = 4, δ = 1, α = 3, γ = 1, λ = 2, β 1 = 20. For the simulation a single mutant (an individual host infected with a mutant pathogen) is introduced at the endemic equilibrium set by the resident pathogen: S eq = 24 and I eq = 35. 10 6 simulations are realized for each parameter values and we plot the proportion of the simulations where the mutant goes to fixation.   Supplementary Information Pathogen evolution: slow and steady spreads the best Glossary of Notation

Probability of fixation
density of individuals infected with strain i at time nt R (n) (t) density of recovered individuals at time nt N (n) (t) total density of individuals at time nt asymptotic density of susceptible individuals at time t I i (t) asymptotic density of individuals infected with strain i at time t R(t) asymptotic density of recovered individuals at time t N (t) total asymptotic density of individuals at time t E ,i endemic equilibrium with resident strain i asymptotic density of susceptible individuals in slow time limit I i (t) asymptotic density of individuals infected with strain i in slow time limit R(t) asymptotic density of recovered individuals in slow time limit N (t) asymptotic total density of individuals in slow time limit asymptotic frequency of individuals infected with strain i in slow time limit

Introduction
In this SI, we derive the results in the main text. Where suitable references exist in the literature, we keep the discussion informal, sketching how the results are obtained and referring to the appropriate references for rigorous proofs. Where they do not, we first give a heuristic derivation for a broader audience, whilst deferring the proofs to the end.

A Stochastic Epidemiological Model with Multiple Pathogen Strains
We consider a family of random processes S (n) (t), I indexed by a parameter n, the "system size" (van Kampen, 1992), which plays a role similar to the census population size in population genetics (see e.g., Ewens (1979); Durrett (2009); Etheridge (2011)). Similarly to those fixed-population models, we will consider the asymptotic behaviour of our model when n is large.
, and R (n) (t) are the number of susceptible individuals, individuals infected with strain i = 1, . . . , d, and recovered individuals, respectively. We will write N (n) (t) for the total population size at time t, so that Using this notation, our compartmental model for the epidemic is represented graphically in Figure 1. Equivalently, we may describe our model as a continuous-time Markov chain E taking values in N d+2 with transition rates given in Table 2. When a transition occurs at time t, we will distinguish between the value E(t−) of the Markov chain before the transition and its value E(t) after the transition. All parameters given in Table 2 may depend on n, but are assumed to have a constant value to first approximation in n which approaches an equilibrium value of λ (n) δ (n) n as t → ∞. Thus, to first approximation, the total population size is proportional to n.
If one knows the values of S (n) (t) and , then given one of R (n) (t) or N (n) (t) one can determine the other. For our purposes, it is more convenient to track the total population size, and consider the epidemic In what follows, rather than working with E (n) (t) we will focus on the rescaled process E (n) (t) has the advantage of being a density dependent population process (Kurtz, 1970(Kurtz, , 1971(Kurtz, , 1978(Kurtz, , 1981 as generalized in Pollett (1990): the transition rates in (2) depend only on the densities S (n) (t),Ī (n) (t),N (n) (t) and not on the absolute numbers of individuals. As we discuss below, density dependent population processes have a number of nice features, including a law of large numbers and central limit theorems.
Remark 1. To simplify our subsequent use of subscripts, we will considerĒ (n) (t) as a process taking values in R d+2 , the space of points and useS (n) (t) andĒ

Stochastic Differential Equation Formulation
Here, we introduce a very convenient way of writing our Markov chain as the solution to a stochastic integral equation with the help of simple Poisson processes. A Poisson process P is a Markov process making jumps of +1 exclusively, and such that P (0) = 0. A Poisson process P is a called a simple Poisson process if it jumps at constant rate 1. In this case, (P (at)) is a Poisson process with rate a. This can be generalized by noting that (P ( t 0 a(s) ds)) is a time-inhomogeneous Poisson process which jumps at rate a(t) at time t. Similarly, there is a unique continuous-time Markov chain X satisfying Then it is not difficult to extend this (see Chapter 6, §4 in Ethier and Kurtz (1986) for details) to our Markov process as follows: where all the processes P l (t) are independent, simple Poisson processes, indexed by the corresponding jump l of the Markov process (E (n) (t)) and e i is the element of R d+2 with zeros everywhere except a 1 at row i.
Changing variables, we get This formalism is useful because it will allow us to write each r.h.s. as the sum of a deterministic trend and of a stochastic term with zero expectation.
Recall that the marginal value P (t) of a simple Poisson process at time t is a Poisson random variable with parameter t. In particular, P (t) − t has mean 0 and variance t. So if we writẽ we are writing P (t) as the sum of a deterministic trend t and of a stochastic termP (t) with mean 0. If we come back to the example of the Markov process X jumping at rate f (X), we can write where we have set In addition, since the incrementsP (t + s) −P (t) are independent of the past before t, have mean 0 and variance s, we can write the last equation in differential form ds and equals 0 otherwise. In particular, conditional on X(t−) = x, dM (t) has mean 0 and variance f (x) dt. Thus, we also recover the infinitesimal variation of X as the sum of an infinitesimal trend in the dynamics and of a stochastic fluctuation term with zero expectation. Now let us return to our initial process. We adopt the same notation as previously, for examplẽ Proposition 1. The infinitesimal variation ofĒ (n) (t) can be written as the sum of an infinitesimal deterministic trend and of a stochastic fluctuation term with zero expectation: (t), are independent infinitesimal noise terms with mean zero and respective infinitesimal variances nρ the result in the proposition can be written more compactly as d+1 (x)). Thus, the function F (n) (x) describes the infinitesimal trend in the dynamics, whereas the terms M (n) l (t) capture the de-trended fluctuations corresponding to each type of possible event. This equation is analogous to an Itô SDE, only now the driving noise is the discontinuous M (n) (t), rather than the more familiar Brownian motion.
We note that for i = 1, . . . , d, two facts that will prove useful in what follows. We define similarly and for j = 1, . . . , d ρ In particular, it makes sense to define a (n) as the infinitesimal variance-covariance matrix of Let us compute a (n) . Because all distinct terms in the definition (S.2) of M (n) are independent, all cross terms vanish. For example, for any . We should not be surprised by the fact that this infinitesimal covariance is negative. Each event where a susceptible is infected by strain i has the effect of simultaneously decreasing the number of susceptibles and increasing the number of individuals infected by strain i.
By similar calculations, it is easy to see that In other words, for any 0 Equivalently, where the sum is over all possible jumps l of the Markov process (E (n) (t)).

Obtaining Equation (3)
Note that similarly to Brownian motion, the infinitesimal mean of 1 √ n M (n) l (t) during the time interval dt is zero and its infinitesimal variance is ρ (n) l Ẽ (n) (t) dt., whereas the jump size 1 √ n tends to 0 as n → ∞ (and thus, 1 is approximately continuous for large n), we see that for large values of n, this noise is approximately equal to a Brownian motion with the same variance: This allows us to rewrite the results in Proposition 1 in the form of the following diffusion approximation for n large, Kurtz (1978) for a rigorous statement). SettingĪ and combining independent Brownian motions, we obtain equation (3) in the main text (n.b., to simplify notation in the main text, we use X, Y and Z in lieu ofS (n) (t),Ī (n) (t) andN (n) (t)). We shall not use this diffusion approximation in the sequel, where we continue to consider the process with discrete jumps, (S.3).

Itô's Formula and Derivation of Equation (4)
As a first application of the SDE representation, we apply Itô's formula with jumps to our process to obtain an SDE for the proportion of each strain.
To motivate this, suppose we had a deterministic differential equatioṅ and we let X(t) be a deterministic real function of Y (t), say where g : R d+2 → R is assumed to be continuously differentiable.
Then, applying the chain rule, we derive a differential equation satisfied by X(t): or equivalently The analogue of the chain rule in the fully stochastic case is the Meyer-Itô's formula (see e.g., Protter (2004)).
where the sum is over the times s of discontinuity ofĒ (n) . At a time t of discontinuity, denotes the magnitude of the jump inĒ (n) j at time t. The term ε (n) (t) correcting for discontinuities distinguishes the more general Meyer-Itô formula from the familiar Itô's formula for diffusions. In Section 7.1, we show that ε (n) (t) = O 1/n 2 .
Using this, we can derive Equation (4) from the main text. Let is the proportion of the population infected with strain i. Since P (n) i (t) is a deterministic function ofĒ (n) (t), we can use Itô's formula (S.8) for jump processes. The following statement will be proved rigorously in Section 7.1.
Proposition 2. The fraction of the population infected by strain i satisfies is the infinitesimal variance associated with the growth of strain i. In addition, for any To obtain Equation (4) in the main text, we omit the lower order error term ε (n) gives the total number of infectives, and observe that, similarly to the previous section,

Probability of Fixation of a Mutant Pathogen
In this section, we will be interested in the long time behaviour of our multi-strain stochastic epidemics. In particular, we tackle the problem of predicting which strains will be outcompeted and which strains will fix.

A Deterministic Limit and its Asymptotic Analysis
We begin this section with a result stating the convergence to a deterministic dynamical system as n → ∞.
Note that the result in the previous proposition can be written more compactly aṡ While we continue to work with the finite n fully stochastic process, the bifurcation structure of the deterministic system (S.12) will guide our analysis of the stochastic model. In particular, the steady states of this model, together with the degenerate case that arises when stability is exchanged between fixed points, give rise to two regimes that correspond to strong and weak selection in classical population genetics. To be explicit, let be the basic reproduction number of strain i. R 0,i is the expected total number of new infections caused by a single infected individual, assuming an unlimited supply of susceptibles. If R 0,i = R 0,j for all 1 ≤ i = j ≤ d, the equations (S.12) have d + 1 fixed points, one at 0 and one at the d equilibria where the population is infected by a single strain When d = 1, it is shown in Vargas-De-León (2011) when δ > α 1 that: if R 0,1 > 1 then unique endemic equilibrium of the strain,Ē ,1 is globally asymptotically stable, whereas if R 0,i ≤ 1, the disease-free equilibrium 0 is globally asymptotically stable. The stability of fixed points is slightly more subtle when there is more than one strain.
Definition 1. We distinguish between two regimes of selection.
Proposition 4. The long term behavior of the deterministic system (S.12) differs according to the selection regime.
(i) In the strong selection case, the equilibrium stateĒ ,1 with strain 1 endemic and all other strains extinct is globally asymptotically stable from any initial condition for which I 1 (0) > 0.
(ii) In the weak selection case, we arrive at a degenerate situation in which deterministic coexistence of strains 1, . . . , d is possible. Strains m + 1, . . . , d will eventually disappear, whereas all points are fixed points for the system (S.12). The set Ω of such points is globally attracting, but no point in Ω is an attracting fixed point.
Proof. Point (i) follows by a direct adaptation of the result in Bremermann and Thieme (1989): rearranging (S.12b), we see that and, recalling that I 1 (t) is bounded for all t > 0, we see that for all i > 1 as t → ∞.
The same argument shows that in the weak selection case, strains m + 1, . . . , d will eventually disappear, whereas all points in Ω are fixed points. Moreover, all vectors u tangent to Ω, i.e., such that m i=1 (β i − α i )u i = 0, u m+1 = · · · = u d = 0, and u d+1 = R 0 u 0 , are all eigenvectors to the Jacobian of F -evaluated at any x ∈ Ω -corresponding to the eigenvalue 0. Thus, while Ω is a globally attracting set, no point in Ω is an attracting fixed point.
We now turn to the computation of the fixation probability of a novel strain in the fully stochastic system. Informed by the previous statement, we will consider two cases, strong and weak selection, where the dynamics of the process -and thus our approach to the fixation probabilities -are qualitatively different. We will then show, despite the difference in the approaches, and in the expressions for the fixation probability thereby obtained, that our two results for the fixation probability agree on all intermediate scalings, and may thus be combined (heuristically) via the method of matched asymptotic expansions, to obtain a single expression valid across all scales.

The Strong Selection Case
We begin by recalling that for the deterministic approximation (S.12) to apply, we required that , that a non-trivial portion of the population is already infected with strain i.
For any strain with I (n) i (0) → 0, and thus I i (t) ≡ 0 for all t ≤ T , for any fixed T > 0: until O(n) individuals are infected, strain i is effectively invisible to the deterministic approximation on any finite time interval. This is not to say that the strain is absent, but rather, if we sample individuals from the population uniformly at random, the probability of sampling an individual infected with strain i is zero.
We will consider the case when a fixed number k of strain 2 individuals invade an established resident population. For our purposes, a strain i is established if it is initially present in macroscopic numbers, i.e.,Ī (n) In light of the results in 5.1, we will assume that there is only a single resident strain, strain 1. We will first consider the case when the resident strain is in endemic equilibrium, and then generalise to the case when the resident strain, whilst still present in macroscopic numbers, is initially away from equilibrium.
Should the invading strain, strain 2, exceed εn individuals, for any ε > 0, we arrive again in the domain of applicability of the deterministic approximation (Ī (n)

Invasion at the Resident Endemic Equilibrium
If we start at the endemic equilibrium of the resident strain 1, E ,1 , then until I (n) 2 exceeds εn, the epidemic process E (n) (t) will remain close to that point. We thus have and to first approximation, strain 2 has per-host transmission and clearance/mortality rates of β 2 R 0,1 and δ + α 2 + γ 2 .
This latter is a birth and death process (see e.g., Bartlett (1955)) which will go extinct with probability if R 0,2 > R 0,1 , and with probability q = 1 otherwise. This probability of extinction q is for a single initial individual infected with strain 2, and becomes q k for k initial individuals. Now, a birth and death process either goes extinct or grows arbitrarily large, so with probability 1 − R 0,2 R 0,1 it will eventually exceed εn.
Similarly, when we have reached a neighbourhood of E ,2 the transmission and clearance/mortality rates of strain 1 are approximately β 1 R 0,2 and δ + α 1 + γ 1 .
Since R 0,2 > R 0,1 , this is a subcritical birth-death process which goes extinct with probability 1. Thus, invasion implies replacement, where for our purposes, the process invades if it exceeds εn individuals for some fixed ε > 0.
To summarise, we have heuristically derived Proposition 5 (Strong Selection). Consider a population infected with 2 strains such that R 0,2 > R 0,1 .
(ii) [Novel strain in small number of copies, resident at endemic equilibrium] Suppose thatĪ In other words, the event that strain 1 becomes extinct asymptotically coincides with the event that strain 2 invades, which happens with a probability asymptotically equal to the probability of survival 1 − R 0,1 R 0,2 k of a time-homogeneous birth-death process; on this event, the system reaches in finite time the deterministic equilibrium (S.13) with only strain 2 endemic.
We give a rigorous proof of this result in Section 7.2, and compare (S.15) to fixation probabilities estimated from simulated epidemics in Figure 6b.

Invasion Away From the Resident Endemic Equilibrium
In this section, we consider exactly the same setting as previously when a novel strain in small number of copies appears when the resident strain is at macroscopic initial frequency (i.e., I (n) 2 (0) = k, I (n) 1 (0) →Ī 1 (0) > 0) but we relax the assumption that the resident strain is at endemic equilibrium. As previously, we use a branching process approximation to determine the probability that I (n) 2 (t) > εn, for some small ε > 0, for some finite t > 0. However, now, this branching process will no longer be time-homogeneous, since it will evolve within the changing environment imposed by the deterministic dynamics of the resident strain. Indeed, rather than assume thatS (n) (t) N (n) (t) ≈ 1 R 0,1 , we will use the law of large numbers to conclude that where S(t) and N (t) are determined via the reduced (deterministic) systeṁ We thus approximate the number of individuals infected with strain 2 by replacing the stochastic quantitiesS (n) (t) andN (n) (t) by their deterministic approximations, and the number of infectives with the novel strain 2 by a time-inhomogeneous birth and death process with per-host transmission and clearance/mortality rates of β 2 S(t) N (t) and δ + α 2 + γ 2 .
As before, the probability this branching process reaches εn is exactly 1 less the probability of extinction for this time-inhomogeneous branching process, which is a classical result: Theorem 1 (Kendall (1948)). Let Z(t) be a continuous time linear birth-death process with timevarying birth rate λ(t) and death rate µ(t), i.e., such that Corollary 1. Consider a single individual infected with strain 2 entering a population where strain 1 is endemic but not necessarily at equilibrium. Then, as n → ∞, the probability strain 2 dies out is More generally, if strain 2 is initially in k copies, the probability strain 2 fixes is asymptotic to The argument is, mutatis mutandis, that of Section 7.2; for details, we refer the reader to .
While we can not evaluate J analytically, we can evaluate it numerically. More generally, we will compute U (t) = P {Z(t) > 0} numerically, and observe that it rapidly converges to equilibrium. Rather than evaluate the integral directly, we find it more convenient to use the ordinary differential equations used in Kendall (1948) to derive (S.17). Using the notation of Theorem 1 (and Kendall (1948)), this is obtained via a system of two equations, Remark 3. Whilst we will not use it in the sequel, we note that the auxiliary function V (t) allows one to fully characterise the branching process Z(t): the number of individuals alive at time t is given by a modified geometric distribution with parameter 1 − V (t): For our epidemic model, this gives us the systeṁ which is easily computed numerically.
In Figure 2, we show the consequences of choosing initial conditions away from equilibrium under two scenarios. We fix the values R 0,1 = 4 and R 0,2 = 6, where the increase in R 0,2 is achieved either by increasing the contact rate β 2 while holding the virulence, α 2 , fixed (solid curves) or by reducing the virulence α 2 while holding the contact rate, β 2 , fixed (dashed curves). In both cases, we see that the change in the fixation probability (the asymptotic value of U (t)) is most visible when the number of individuals is infected with the resident strain is varied (Figure 2a), whereas it is less pronounced when the number of susceptibles is varied (Figure 2b), and relatively small when the population size is varied (Figure 2c). We note that the strain that achieves the higher value R 0,2 via an elevated virulence has a higher probability of fixation than the strain with a higher contact rate when there is a surplus of resident-strain (green curves, 2a) and a lower fixation probability when there are fewer (blue curves, 2a). This relation is inverted when the number of susceptibles is varied: when there are more susceptibles, the less virulent strain has a lower fixation probability than the strain with higher contact rate (green curves, 2b) and higher fixation probability when susceptible hosts are limited (blue curves, 2b). We will quantify these changes for small perturbations away from the endemic equilibrium in the next section. 0,1 , so that the mutant is under strong favourable selection. We implement this in two ways: first, by setting β 2 = 3 2 β 1 (solid curves) and secondly, by setting α 2 = β 2 R (n) 0,2 − δ − γ 2 (dashed curves). For all curves, λ = 2, δ = 1, β 1 = 20, α 1 = 3, and γ 1 = γ 2 = 1. When β 2 is varied, α 2 = α 1 = 3; when α 2 is varied, β 2 = β 1 = 20. For these parameters, (green). The probability of fixation for single mutant at the endemic equilibrium is 1 11 , which is shown in all panels by the azure line.

Invasion Near Endemic Equilibrium
One way to study the effect of the initial state of the resident population on the invasion of a novel strain is to consider small perturbations near the endemic equilibrium of the resident, More specifically we consider S(0) = S + εs(0), I 1 (0) = I 1 + εi 1 (0), and N (0) = N + εn (0) for some small dimensionless constant ε > 0 (independent of n -we consider perturbations that are a positive fraction of the population), and compute an expansion in ε of the probability of invasion 1 − q of strain 2 starting from one single infected with this novel strain. The proof of the next statement can be found in Section .
Proposition 6. Set Λ := β 2 1 R 0,1 − 1 R 0,2 . Then the expansion of the invasion probability 1 − q in the initial deviation ε to endemic equilibrium of strain 1 is given by To first order, as expected, we obtain the probability 1 − R 0,1 R 0,2 of fixation of strain 2 invading the strain 1 initially at equilibrium. The higher order terms allow us to study the effect of the change in population size. When the novel strain carries beneficial mutations (i.e. R 0,2 > R 0,1 ) demographic perturbations that result in initial population growth of the pathogen population (i.e. s(0) > 0, i 1 (0) < 0, n(0) < 0) increase the probability of fixation of the novel strain. This effect is related to classical population genetics results (see e.g. Ewens (1967); ) on probability of fixation in populations of changing size.
Note, however, that away from equilibrium, for given values of R 0,1 and R 0,2 , the fixation probability depends also on the transmission β 2 of the novel strain. In other words, the life history traits of the novel strain can also affect the evolutionary outcome. If the novel strain is more transmissible than the resident (i.e. β 2 > β 1 ) its probability of fixation is favoured when the perturbation leads to an initial increase of the pathogen population. For instance, a high number of susceptible hosts (i.e. s(0) > 0) increases the probability of fixation of a transmissible strain. This effect results from a form of r vs. K selection ( (Pianka, 1970(Pianka, , 1972Reznick et al., 2002)), where fast-growing (high transmission and high virulence) strains are favoured in undersaturated environments, whereas long-lived (low virulence and low transmission) strains are favoured in oversaturated environments. The effect of epidemiology on evolution is very apparent in the analysis of the deterministic dynamics (see equation (**) in the main text). The above (S.19) provides a stochastic treatment of the influence of epidemiology on the ultimate evolutionary outcome. Furthermore, we note that Thus, if R 0,1 and R 0,2 are close, we see that the difference in the fixation probability from that at equilibrium is, to lowest order, proportional to i 1 (0), and is positive if i 1 (0) < 0, and negative if i 1 (0) > 0. Moreover, the difference is proportional to β 2 2 , so that if i 1 (0) < 0, there is an advantage to larger values of β 2 , whereas if i 1 (0) > 0, the fixation probability is maximized by minimizing β 2 .
Example 1 (Vaccination). Suppose that starting at time t = 0, a fraction f of the incoming susceptibles to a population at the endemic equilibrium are immunized. Then, the rate of incoming susceptibles is reduced from λ to λ(1 − f ). Recalling the values of S , I 1 , and R above, we see that post-vaccination, the population has a new equilibrium, In particular, the initial condition differs from the new equilibrium by a perturbation of magnitude f with , i 1 (0) = λR 0,1 β 1 − α 1 , and n(0) = − α 1 λR 0,1 δ(β 1 − α 1 ) .
Moreover, (S.19) is otherwise independent of λ, and thus f , so that when f is not too large, we may use it to determine the fixation probability of a new strain that emerges at approximately the time when vaccination commences. In particular, if we assume if R 0,1 and R 0,2 are close, we can use (S.20) to see that the fixation probability of an invading strain is , which is lower than the fixation probability at equilibrium, and decreasing with increasing contact rate. Thus, vaccination reduces the probability of new strains arising, in particular more virulent strains.

The Weak Selection Case
Recall that weak selection corresponds to the case when R 0,i = R 0,j = R 0 for 1 ≤ i, j ≤ m. We hasten to clarify, however, that 0,j are allowed to differ by O 1 n terms; as we shall see below, this is analogous to the weak selection limit of classical population genetics, and the values r i will appear as selection coefficients in a diffusion approximation.
In this case, we have a separation of timescales: there is a fast time-scale, in which (S.11) tells us that the stochastic process approximately follows the trajectories of (S.12) arbitrarily closely to an arbitrarily small neighbourhood of Ω. Then, as we discuss below, there is a slow-time scale, in which, having arrived at Ω, the stochastic process remains near this critical manifold. LetŜ (n) (t) :=S (n) (nt) = 1 n S (n) (nt), where, as before, we letÎ (n) (t) = Î (n) 1 (t), . . . ,Î Here, rescaling time by n is analogous to the passage to so-called "coalescent time" or "generation time", which is used to derive the diffusion limit of the Wright-Fisher model in classical population genetics.

Recalling (S.3), the SDE forÊ (n) (t) is then
Thus, in the slow time scale, the drift is accelerated by a factor of n, causing the process to move rapidly to the critical manifold Ω; as n → ∞, this movement becomes instantaneous, and the process immediately jumps to Ω at time t = 0. Moreover, stochastic fluctuations away from Ω are restored instantaneously, so the process becomes "trapped" on Ω as n → ∞. The following statement formalizes this idea using the projection π defined as follows. Let E(t, x) = (S(t, x), I(t, x), N (t, x)) be the solution to (S.12) with initial conditions S(0) = x 0 , I i (0) = x i , and N (0) = x d+1 and let π(x) := lim t→∞ E(t, x), i.e., π(x) is the point on Ω at which the trajectory of (S.12) starting from x meets Ω.
Proposition 7. As n → ∞, (Ê (n) (t)) ⇒ (Ê(t)) 1 where the latter is a diffusion on the manifold Ω, solution to the system of stochastic differential equations where the (B k ) denote D = 3(m + 1) independent standard Brownian motions, In other words, for each i where Dπ i denotes the gradient vector of π i , Hπ i its Hessian matrix, and Tr denotes the trace operator.
This diffusion can be understood as the result of stochastic fluctuations around Ω immediately followed by a strong deterministic drift towards Ω.
As can be seen from Proposition 3, the drift pushes the process very rapidly onto Ω, so that in the limit, the process lives permanently in Ω. Now to understand the interplay between the deterministic dynamics towards Ω and the stochastic fluctuations around Ω, it is useful to think of the dynamics in two steps. Suppose that starting from a pointÊ(t−) ∈ Ω, the process E has a jump l. Then, the rescaled process (Ê (n) (t)) has a jump 1 n l. In a second step, it is immediately projected back to the manifold by the drift at the new location, so: for all f ∈ C(S); the values E[f (X)] completely characterise the distribution of X. Weak convergence is denoted by Here, S is the Skorokhod space D R d+2 [0, ∞) of right-continuous functions from [0, ∞) to R d+2 with left limits; the interested reader is referred to Billingsley (1968) for a very readable account of weak convergence on D.
Thus, expanding the i-th component of the r.h.s. of the last equation and recalling that π Ê (t−) = To determine E[dÊ where, because we have rescaled time, d(nt) replaces dt in probability of a jump at t. Recalling F (Ê(t)) = 0 and (S.7), this yields the first term of the previous equation yields the first term of Eq (S.21). A picture (Figure 3) more immediately explains the emergence of the variance induced drift: unless the flow lines are parallel, jumps of identical magnitude and direction will be returned to the manifold Ω at different distances from the initial point, as one moves along the manifold: Of course the rigorous way of obtaining the result is to use Itô's formula as done in the proof.
To characterise the limitÊ, we shall make use of π. Unfortunately, π(x) is impossible to compute analytically, but we can still use it to obtain an SDE forÊ(t). We first observe that if F is twice-continuously differentiable, then π is as well Hirsch and Smale (1974). The continuity of π then tells us that π(Ê (n) (t)) ⇒ π(Ê(t)) as well. Since π has first and second derivatives, we may apply Itô's formula (see Section 4) to π(Ê (n) (t)): where, as before, a  The mutant strain has a higher virulence than the resident. The deterministic trajectories are shown as grey arrows that point towards the manifold Ω (the black line). The light red ellipsoid has axes proportional to the infinitesimal variance of the jumps that displace each strain from a given point on the manifold (black dot). The combination of the effect of stochasticity and the fast deterministic return to the manifold generates a drift (red arrow) that favours the strain with the lower virulence. Parameter values of the resident: β 1 = 10, α 1 = 2, δ = 0.05, γ = 0.5. Virulence of the mutant: α 2 = 1.25, 2, and 2.75 in A, B and C, respectively.
On first inspection, it might appear that the drift term, which is multiplied by n, explodes as n → ∞; however, from the definition of π, we see that π(E(t, x)) = π(x), and thus, and the terms of order O(n) vanish identically. We can thus replace F (n) by F (n) − F in (S.23), leaving which remains bounded, as our assumptions (S.1) guarantee that where, for example, The latter is a stochastic process with jumps of order 1 n and variance Thus, as n → ∞, 1 n M (n) −e 0 −e d+1 (nt) approaches a continuous stochastic process with variance t 0 δÊ 0 (s) ds.
The martingale central limit theorem (see e.g., Ethier and Kurtz (1986)) tells us that the only stochastic process with these properties is a Brownian motion with the same variance, (i.e., B −e 0 −e d+1 (t) is a standard Brownian motion with mean 0 and variance t).
Proceeding similarly, in the limit, we may replace all the terms M (n) l with integrals of independent Brownian motions, so that as n → ∞, 1 n M (n) (nt) aproaches is an ordered list of the D Brownian motions corresponding to the D noises M (n) l (t) and σ(x) is as in the statement. Taking the limit as n → ∞ on both sides of (S.23) and recalling that π(Ê(t)) =Ê(t), we obtain (S.21).
While the drift terms seem rather mysterious, they may be interpreted geometrically. We first observe that Proposition 8. (Dπ)(x) is the projection onto the tangent space to Ω at x, T x Ω.
If, moreover, x ∈ Ω, we also have π(π(x)) = x, so taking derivatives on left and right, using the chain rule, we have that (Dπ)(π(x))(Dπ)(x) = I, where I denotes the identity matrix. Now, since x ∈ Ω, the right hand side is equal to so we have that (Dπ)(x) is a projection. It remains to see that it is a projection onto the tangent space. We will do so by showing it's image contains, and is contained by, the tangent space.
For the former, we recall that a vector X is in the tangent space to Ω if and only if there exists a parametric curve σ x,X (t) such that We then have π(σ x,X (t)) = σ x,X , and thus and thus T x Ω is in the image of (Dπ)(x). On the other hand, since π(x) ∈ Ω, we have F (π(x)) = 0, and again, taking derivatives using the chain rule, we have (DF )(π(x))(Dπ)(x) = 0, so that if x ∈ Ω, we have (DF )(x)(Dπ)(x) = 0 and thus (DF )(x)(Dπ)(x)X = 0 i.e., the image of (Dπ)(x) is contained in the kernel of (DF )(x), which we have already observed is T x Ω. Thus, Im ((Dπ)(x)) = T x Ω.
Thus, the drift vector (Dπ)(x)f (x) from (S.22) is the projection of the vector f (x) onto the tangent space to Ω. This is an immediate consequence of the strong drift: in the absence of constraints, the process would move (on average) in the direction of this vector, whose components are the relative fitness of each strain, multiplied by the density of that strain. However, density limitation prevents unlimited growth, confining the process to the manifold Ω, and thus the direction of motion to the tangent plane, and the strains experience a drift that is the best approximating vector to their unconstrained growth rates.

Computing the derivatives of π
To complete our derivation of the equations for the limiting processÊ(t), we must compute the derivatives of the π i .
Proposition 9. Let x ∈ Ω and 0 ≤ i ≤ d + 1. The first partial derivatives of π i at x are given by The second partial derivatives π i at x are given for any k, n both different from 0 and d + 1, by: Proof. We recall that under the weak selection hypothesis, We can solve this to obtain 1 β i ln and, taking the limit as t → ∞, Taking derivatives, we then have Moreover, using (S.14) we have that Together, these equations give us systems of linear equations that may be solved for the various derivatives of π i (x). To illustrate, consider ∂π i ∂x 0 ; from the above, we have that whence ∂π 1 ∂x 0 = 0, and thus ∂π i ∂x 0 = 0 for all i. Proceeding in the same manner, we find ∂π i ∂x d+1 = 0 as well, and thus that all second derivatives of π i (x) involving x 0 or x d+1 vanish identically, whilst We shall only need to evaluate these for x ∈ Ω, whereÊ(t) is trapped. For such x, the first derivatives simplify to the expression given in the statement, since π(x) = x for x ∈ Ω. Similar calculations lead to the second partial derivatives.

Reduced Diffusion
We can use the results of the previous section to provide semi-explicit expressions for the SDE satisfied byÊ and displayed in Proposition 7.
Proposition 10. Unlike the full stochastic SIR model, the weak selection limitÊ can be completely characterised by a system of equations that depend only on the variablesÎ 1 , . . . ,Î d : and dB 1 (t), . . . , dB m (t) are independent Brownian motions.
Proof. In the previous section, we observed that ∂π i ∂x 0 = ∂π i ∂x d+1 = 0, and thus any second partial derivative with respect to x 0 or x d+1 vanishes as well. Moreover, for i = 1, . . . , d, Moreover, for x ∈ Ω, we have so that, from (S.6), we obtain in the limit n → ∞ Similarly, for 1 ≤ j, k ≤ d, σ jk (x) depends on x 0 or x d+1 only via the ratio x 0 x d+1 which is identically equal to 1 R 0 on x ∈ Ω. Substituting these and the first derivatives into (S.21) allows us to complete the description of the weak limitÊ(t), exploiting the fact that the triples of Brownian motions B −e 0 +e i , B −e i −e d+1 and B −e i and B S,j , B j,− and B j,R are independent for i = j to combine each triple into a single Brownian motion.

Frequency Process
Repeating the argument of Section 4, we can use the functions Π i to finding an equation for the frequency of strain i, where, because the limiting process is a diffusion, the standard Itô formula applies. We omit the lengthy calculations this entails, and present simply the result.
For our process P (t), we find that for s(x) as defined by (S.26) and (S.27), and where, if p ∈ ∆ d corresponds to the point x ∈ Ω, i.e., Writing , and recalling that we see that we can explicitly express I e as a function of p: Remark 4. The notation above has been deliberately chosen to recall the Wright-Fisher diffusion in population genetics, with s i (p) and I e (p) a frequency-dependent selection coefficient and an effective population size, respectively. To understand the motivation for the notation I e , which is meant to recall the effective population sizes used in population genetics, we note that so that I e (x) is approximately the total density of infected individuals when the diffusion limit is at the point x ∈ Ω, or, equivalently, when the frequencies of the various strains is p. Remark 5. As before, vector b(p) may be interpreted geometrically as the projection of the vector    s 1 (I e (p)p)p 1 . . .
onto the simplex ∆ d .

Results for d = 2
If we have d = 2 strains, then, since P 1 (t) + P 2 (t) = 1, it is sufficient to consider the frequency of the invading strain, strain 2. Writing P (t) := P 2 (t), the results of the previous section tell us that the generator of P (t) 2 is The generator allows us to compute many quantities of interest for the process P (t). In particular, if h(p) is the probability of fixation of strain 1 given P (0) = p, then h(p) satisfies the boundary problem The generator of P (t) is the operator on the space of continuous functions on the d-simplex We recall that if the diffusion process P (t) has SDE dP (t) = b(P (t))) dt + ς(P (t))) dB(t), then is the variance-covariance matrix for dP (t), and the probability density function for P (t), say f (t, p), satisfies the Kolmogorov backward equation (see e.g., Ewens (1979); Durrett (2009); Etheridge (2011)). This may be solved to give Leth(p) be the numerator of this fraction. Substituting the expressions for a(p) and b(p) and some simplification yields .
For ease of notation, we will write We can evaluate this expression numerically, but we will be particularly interested in a number of special cases, when we can obtain analytical approximations toh(p).
(i) When r 2 = r 1 (or, more generally, when R 0,i = R 0 1 + o 1 n ) we can give an explicit closed form forh(p), and thus h(p): This expression is not, however, especially illuminating.
Unfortunately, this does not yield an estimate of the normalising constant,h(1). To obtain this, we consider the case when β 2 − β 1 and α 2 − α 1 are small. While this is a restrictive assumption, it will allow us to consider the long-term evolution in the framework of adaptive dynamics, where mutational changes are assumed to be very small. To this end, we introduce σ and θ such that β 2 = β 1 (1 + σ) and α 2 = α 1 (1 + θ).
We then have that where O(2) is used to denote terms of order O σ 2 , O θ 2 , or O(σθ).
Then, to order p 2 , the fixation probability is which may be written informally in terms of the original parameters as to lowest order. Here we have used In practice, we are most interested in the case when a single individual carries the mutant strain, so p = 1 nIe(0) = 1 I (n) (0) . While our proofs -which assume that p is independent of n, and thus that the number of invading individuals is proportional to n -do not justify taking this value for p, we find that the expression for the fixation probability obtained by taking p = 1 I (n) (0) , which to lowest order is agrees extremely well with simulations ( Figure 6c) -another example of the so-called "unreasonable effectiveness of mathematics' (Wigner, 1960) -and will use it to investigate the long term evolution of the virulence in Section 6.
Remark 6. We briefly note that this simple approach becomes invalid when I (n) i (0) = o(n) for some n (and thus (x, y, z) ∈ ∂R 2 + ); in this case, stochastic effects will lead to the rapid extinction of the rare strain. Unfortunately, our previous branching process approach does not translate directly: the heuristic approximation gives an asymptotically critical branching process, whereas our upper and lower bounds become supercritical and subcritical branching processes respectively, so that the resulting "sandwiching" of the actual process is uninformative. We are currently considering a refined approach for this case.

Reconciling the Strong and Weak Selection Results
On first inspection, our expressions for the strong and weak selection limits have little in common. In Section 5.2, we saw that if R (n) 0,i → R 0,i and R 0,2 = R 0,1 , then, if I (n) 2 (0) → I 2 (0), the fixation probability was if R 0,2 > R 0,1 , and 0 otherwise On the other hand, in Section 5.3, we assume that 2 (0) → I 2 (0), and derive a quite different appearing expression for the fixation probability.
We can find a large n asymptotic expression for this probability using a pair of lemmas: Lemma 1. Suppose that φ(x) and g(x) are an increasing continuously differentiable function and a continuous function on [a, b] (−∞ < a < b < ∞) respectively, and that φ (a) = 0. Then, Proof. Fix ε > 0 such that φ (a) > ε. Using Taylor's theorem, we may write as M → ∞, and similarly for the lower bound.
Since ε > 0 can be chosen arbitrarily small, the result follows.
Lemma 2. Let φ(x), g(x), and [a, b] be as above. Then, Proof. By direct computation, we have The result follows.
Thus, applying Lemma 2, we havẽ and the probability of fixation obtained from the weak selection expression is again asymptotic to 1 − e −(r 2 −r 1 )ι .
While this is not a rigorous proof, it does demonstrate heuristically that the weak and strong selection expressions for the fixation probability agree to first order when applied across the intermediate selective regimes. In particular, we can use the method of matched asymptotic expansions (see e.g., Hinch (1991); Kevorkian and Cole (1996)) to combine our two solutions into a single expression valid across all scales, by summing the expressions for strong and weak selection and subtracting their common limit, where all are expressed in the unscaled (i.e., strong selection parameters): where [x] + = max{x, 0} and we have used that We illustrate how these approximations compare to a simulated epidemic in Figure 6.
0,1 = R 0 = 4, λ = 2, δ = 1, β 1 = 20, and α 1 = 3. (b) shows strong selection approximation (S.15) (green curve), (c) shows the weak selection approximation (S.29) (red curve) and its second order approximation (S.29) (blue line), (d) shows the matched asymptotic approximation (S.34), whereas (a) overlays all the approximations for comparison. As would be expected, the strong and weak approximations do well in their corresponding parameter regimes, but poorly elsewhere, whereas the matched asymptotic provides a compromise, performing worse than the weak or strong approximations at the respective extremes, but interpolating between them for intermediate values of s.

Adaptive Dynamics
Using the expressions for the fixation probability derived above, we can use the framework of adaptive dynamics to investigate the long-term evolution of strains. In what follows, we give an informal discussion of the derivation of the canonical diffusion for the process, a generalisation of the canonical equation of adaptive dynamics which allows us to consider the influence of random drift on phenotypic evolution. We refer the reader to  for a more extensive discussion aimed at a biological audience and to  for a mathematically rigorous derivation of the canonical equation.
We briefly recall the assumptions of adaptive dynamics in the context of our epidemic models; throughout, we assume that a novel mutant strain with virulence α invades a population which is at the endemic equilibrium with a resident strain α.
We assume a tradeoff between transmissibility and virulence, so that the contact rate of a strain depends on its virulence according to some fixed function β(α). For our numerical investigations, we take where w is a parameter that determines the "flatness" of the fitness landscape. The reproductive number is a function of the virulence, We will assume that there is a value, α 0 , for the virulence that maximises R 0 (α). Under these assumptions, the density of ndividuals infected with the resident strain at the endemic equilibrium is is non-zero on a range (α min , α max ); outside of this range, the pathogen goes extinct. We then have, using (S.32), that the fixation probability of a mutant strain of virulence α arising in a single individual in a population in which a strain of virulence α , is To introduce the evolutionary dynamics, we assume that mutations occur in individuals with virulence α at a per-capita rate η(α), where > 0 is a dimensionless parameter that we will take to 0. This will ensure that, with high probability, fixation occurs before a second novel mutation can arise. The population is thus assumed to be monomorphic (i.e., all individuals have the same virulence) between invasion events.
Finally, we assume mutations have small effects, and are unbiased in direction, so that a mutation in a strain of virulence α gives rise to a new strain of virulence α ≈ α. We will assume that given a mutation occurs in an individual α, the offspring has virulence α with probability K(α, α ); we will further assume that αmax α min where ε is a dimensionless parameter which we will take to 0; this limit of small mutational effects allows us to ignore terms of order O |α − α | 2 in the fixation probability. We now pass from the individual based model to the trait substitution sequence Dieckmann and Law, 1996): we have seen that whenever a new strain arises, either the mutant or resident strain will rapidly go extinct. Until a new mutant arises, the population will be composed entirely of individuals of the surviving strain. Let A (t) be a random variable giving the virulence of the strain that survived the last competition event prior to time t. The population is thus entirely composed of the strain A (t) except for times t in the short intervals when two strains are competing. If we pass to a "mutational time scale", t , as → 0 the duration of these intervals shrinks to 0, and we are left with a process in which novel mutations either fix or disappear instantly, so that the population is only observed with a single strain at equilibrium.
Formally, as → 0, A ( t ) ⇒ A(t), a continuous time Markov chain that jumps from virulence α to α when a strain of virulence α successfully invades a population with resident virulence α. The process A(t) has generator (recall, the generator is the operator L defined by Lf (α) : Now, consider the time rescaled processÂ ε := A t ε ; this has generator Taylor expanding s(α, α ) (f (α ) − f (α)) in α about α, this is equal to Thus, as ε → 0,Â ε (t) converges to a limiting diffusionÂ(t) with advective coefficient and diffusion coefficient and generatorL The processÂ(t) is our canonical diffusion.

Stationary Distribution
Using (S.35) and (S.36), we may compute the stationary distribution ψ ofÂ(t); this stationary distribution describes the long-term behaviour of the virulence, after any "memory" of the initial state has been lost. Given any subset A ⊂ (α min , α max ), ψ(A) gives the proportion of time that the virulence is in the set A, or equivalently, the probability that at some random sampling time t, the virulence takes a value in A. ψ is characterised by the relation αmax α minL f (α)ψ(dα) = 0.
In particular, if ψ(dα) has a density, which, in a slight abuse of notation, we write as ψ(α), is the adjoint operator toL (and thus, is the Fokker-Planck equation for the probability density ofÂ(t), f (α, t)). Thus, dα dα is a normalising constant.
From the previous, we have Unfortunately, we can only compute its integral analytically in the case of a "flat landscape", when R 0 (α) ≡ R 0 , independently of α. In this case, we have β(α) = R 0 (δ + α + γ), β (α) = R 0 , and 2µ(α) In particular, in the case when η(α) and ν(α) (and thus σ 2 (α)) are constants independent of α, then we can integrate this to obtain Z and thus a closed expression for the stationary distribution: In the next section we will show how one may obtain an analytical approximation in the large n limit.
On the other hand, we recall that R 0 (α) = β(α) δ+α+γ , so that (S.42) and thus also R 0 (α 0 ) = β (α 0 ). We next observe that which depends on the choice of tradeoff function β(α). We note briefly that if we assume β(α) is increasing, then this is a local maximum if and only if β (α 0 ) < 0. In particular, if we assume that R 0 (α) has a unique global maximum, then it must occur at α 0 . We will henceforth make this assumption (n.b., we don't have to assume this in general, but to apply Laplace's method, we require that α 0 be a global maximum).
We then have We then have Next, recalling that φ (α 0 ) = 0, a Taylor expansion gives Now, for α close to α 0 , (α − α 0 ) 3 will be quite small, so we can locally approximate our full stationary distribution by a process that is almost a Gaussian with mean α 0 and variance 1 nI eq (α 0 ) |R 0 (α 0 )| R 0 (α 0 ) , except for a pre-factor of σ 2 (α 0 )β(α 0 ) σ 2 (α)β(α) , which skews the distribution: Again assuming that η(α) and ν(α) are constants independent of α, this simplifies to method to obtain the O 1 n terms in both the integral above and in Z; we omit the calculations, but remark that the mean can then be shown to be which simplifies to − 1 δ+α 0 +γ in the case when η and ν are independent of α, and We note that this order O 1 n term is proportional to the variance of the best-fit Gaussian of the previous section.

Estimating the Mode
We begin by observing the density function of the stationary distribution (S.37) may be written as e 2µ(α) σ 2 (α) dα−ln σ 2 (α) Z dα and thus has its maximum where or equivalently, for the value of α such that where φ(α) is given by (S.41). When σ 2 (α) is constant (i.e., the variance of the mutation kernel is independent of the resident variance) then this reduces to 2µ(α ) σ 2 = 0, and thus µ(α ) = 0. Recalling (S.35), this tells us that and thus On the other hand, (S.42) tells us that since β(α) is increasing. In particular, since R 0 (α) is maximized at α 0 , we see immediately that α < α 0 . Even in this special case, we cannot solve for α exactly. Instead we will seek a perturbative solution to in the general case. We already know that φ (α 0 ) = 0; we thus seek a solution of the form Substituting this into (S.46) and Taylor expanding right and left, we find that n which may be expanded using the expression for g (α 0 ) g(α 0 ) given in the previous sections. In particular, when when η(α) and ν(α) (and thus σ 2 (α)) are constants independent of α, we have that (S.47) so that to first order, the modal virulence is the virulence maximizing R 0 (α) less the product of the variance of the best-fit Gaussian and the expected infectious period when the virulence is α 0 .

Proof of Proposition 2
Substituting f with Π i in Itô's formula (S.8) for jump processes yields i (t) can be expressed thanks to Equation (S.9) as

Now some elementary computations yield
where 1 {i=j} is equal to 1 if i = j and 0 otherwise. Substituting these and (S.6) into (S.48) yields after some simplification Equation (S.10) in Proposition 2. Now it remains to prove that n 2 ε (n) i (t) is uniformly bounded with high probability. Taylor's theorem tells us that where the functions g ij (x, y) satisfy lim x→y g ij (x, y) = 0 uniformly on compact sets. Now, recallinḡ we see that ∆Ī (n) i (s) is non-zero only at the jump-times of the Poisson processes and are always of magnitude 1 n . In particular, since f i (x) is smooth outside of a neighbourhood of 0, we can conclude that g ij is bounded above by a constant multiple of x − y ; this allows us to conclude that |g ij (Ī (n) (s),Ī (n) (s−))| ≤ C n and |ε (n) k (s)| is non-zero if some pair of processes P j,· and P k,· jump simultaneously; if j = k, the Poisson processes are independent, and probability of such an event in an interval [t, t + ∆t) is O ∆t 2 , and thus tends to 0 if as ∆t → 0 i.e., |∆Ī (n) j (s)||∆Ī (n) k (s)| = 0 if and only if j = k. Moreover, the processes P S,j , P j,− and P −e i are also independent, and thus cannot jump simultaneously, so that |∆Ī We seek an upper bound on this quantity. To that end, we begin by observing thatS (n) (t) and eachĪ (n) i (t) is bounded above byN (n) (t), and that N (n) (t) ≤N (n) (0) + 1 n P e 0 +e d+1 (nλ (n) t), and, since this Poisson process is increasing in t, we have that for t ≤ T , and, applying Chebyshev's inequality, we see that for any C > 0, as n → ∞. Thus, for any fixed T > 0,N (n) (t) is bounded above and below on [0, T ] by e.g.,N (n) (0)+ λ (n) T ± 1, with probability that approaches one as n tends to infinity.
Thus, for example, and we may proceed exactly as above to conclude that fo t ≤ T , is bounded above with probability approaching 1 as n → ∞, and similarly for the other Poisson processes, from which we conclude that there exists some constant C such that with high probability.

Proof of Proposition 5
In this section, we will make the heuristic argument of Section 5.2 rigorous using the technique of coupling (see Ball (1995) for a very good introduction): we start by constructing birth and death processes that bound I (n) 1 (t) above and below providedS (n) (t) andN (n) (t) remain within ε of the endemic equilibrium, and such that the upper and lower bounds approach one another as ε → 0. Finally, we show that the probability thatS (n) (t) andN (n) (t) depart a ε-neighbourhood of the endemic equilibrium before strain 2 has either successfully invaded or gone extinct goes to 0 as n → ∞. Since ε is arbitrary, we recover the naïve branching process result.

Macroscopic Initial Frequencies
In this section we prove the following extinction of part (i) of Proposition 5, where the population is infected with d ≥ 2 strains.
for some constant 0 < B < 1 that will be determined later. Let where we adopt the convention that τ Next, fix η > 0 sufficiently small that For our underlying probability space, we assume sequences of independent rate 1 exponential random variables E k and independent uniformly distributed random variables U k on [0, 1], for k = 1, 2, . . .. Suppose that I (n) i (t), Z + i (t) and Z − i (t) have been constructed up to τ k (trivially true for k = 0). Note that is an upper bound on the combined rate of all transitions for all three processes. Set so τ k+1 is a rate ρ k+1 exponential random variable. Next, for t < τ k+1 we set Finally, we set It is readily verified that the resulting processes have the correct jump rates.
Remark 8. Note that given ε > ε 1 > 0, choosing η > 0 as before and η 1 > 0 analogously, we can similarly construct processes Z +,1 i (t) and Z −,1 i (t) such that etc. We shall apply this with a decreasing sequence of values ε n > 0 below. Now, our choice of η ensures that Z + i (t) is subcritical for all i > 1. In particular, setting and, for any sequence t n > 1 |µ + i | ln n, for all i > 1, we have, using Markov's inequality P Z + i (t n ) ≥ 1 ≤ E Z + i (t n ) ≤ Bεne µ + i tn → 0 as n → ∞. Thus, if we show that P τ (n) ε > t n → 1 as n → ∞, we can conclude that all strains i > 1 vanish after before t n with high probability.
To this end, we start by defining τ (n) ε,i := inf t : |Ī ε,d+1 similarly, replacingĪ i byS orN respectively in the above definition (again, τ (n) ε,i = ∞ should the respective process never exceed εn). We then have We continue with a classical result for birth and death processes: for i > 1, let Then, a simple calculation shows that ε,i }.
We saw above that τ (n) 0,i and thus τ (n) i are with high probability bounded above by any sequence t n > 1 min i>1 |µ + i | ln n. Now, , which converges to 0 as n → ∞ for any B < 1. We thus have P τ (n) ε,i > t n → 1 as n → ∞.
For the remaining three values τ (n) ε,i , i = 0, 1, d + 1, we take a different approach, as the values S ,1 ,Ī ,1 1 , andN ,1 are all non-zero and a branching process approach is no longer appropriate. Instead, we recall the SDE representation of our process (Proposition 1).
then for any α > 0, any constant C, any sequence t n n, and τ (n) ε as above, we have P C n sup t≤τ (n) ε ∧tn t 0 e −α(t−s) dM (n) (s) > R → 0 as n → ∞, for any fixed R > 0.

Novel Strain in Small Number of Copies
We now consider the possibility that a new strain invades an established population. From the results of the previous section, we see that generically, the population will eventually be in an εneighbourhood of the fixed pointĒ ,1 for arbitrary ε > 0, and that I i (t) ≡ 0 for all i > 1. If the new strain has reproductive number less than R 0,1 , then the arguments of the preceding section apply directly, and we can conclude that the novel strain will very rapidly go extinct.
If, however, it has higher reproductive number, the invading strain now has a non-zero probability of establishing itself and replacing the resident strain. In this section, we adapt the techniques above to deal with this case (by the above, we may take d = 2).
We start by defining the quantities τ (n) ε,i and τ (n) ε as before, with the understanding that τ (n) ε,1 = ∞ if the new strain goes extinct before hitting εn.
Classical results for birth and death processes (e.g., Athreya and Ney (1972)) tell us that Z + (t) and Z − (t) will hit 0 in finite time with probability q Z + (0) 2 and q 2 respectively, and will grow indefinitely otherwise, and, moreover, that there exist random variables W and W taking values on [0, ∞) such that P{W = 0} = q Z + (0) 2 and P{W = 0} = q Z − (0) 2 and e −µ + 2 t Z + (t) → W and e −µ − 2 t Z − (t) → W (S.50) both almost surely and in L 2 . Now, as before, fix ln n t n n. Taking logarithms in (S.50), we see that for almost all ω ∈ {W = 0}, we have ln Z − (ω, t n ) t n → µ − 2 .
Now, if Z − (t n ) ≤ εn then the right hand side converges to 0. Moreover, if I In particular, if we can establish that with high probability τ (n) ε = τ (n) ε,2 , then this gives a lower bound on the probability that the novel strain invades, as the latter is the probability that τ (n) ε is finite. Now, ε,2 ; τ (n) ε < t n + P τ (n) ε < τ (n) ε,2 ; τ (n) ε > t n .
We have already established that the latter is bounded above by q Z − (0) 1 as n → ∞. The former follows almost exactly as the proof that P τ (n) ε < t n → 0 of the previous section;Ē ,1 is now a hyperbolic fixed point rather than a stable fixed point, and A has a positive eigenvalue, but the stable manifold ofĒ ,1 coincides with the subset of R d+2 with x 1 = 0. In particular, we may decompose where E S = {x 1 = 0} and E X are A invariant subspaces, corresponding to the sum of the generalised eigenspaces for eigenvalues with negative and positive real parts respectively. We will write P S and P X for the corresponding projections (i.e., P S has image E S and kernel E X , and oppositely for P X ), and note that both P S and P X commute with A. Proceeding exactly as previously, we define Ξ (n) (t) :=Ē (n) (t) −Ē ,1 and let Ξ (n) S (t) = P S Ξ (n) (t) denote it's projection onto the stable subspace. Then, if Applying the arguments of the previous section, using the equation for Ξ (n) S (t) obtained by letting P S act on both sides of (S.49), one obtains almost identically that as n → ∞. Now, note that for any ω ∈ {W = 0}, we must have Z + (ω, t) < εn for all t, for some sufficiently large n, so ω ∈ {τ (n) ε < τ (n) ε,2 }. Thus we have Finally, we notice that as ε → 0 (and thus η → 0 also,) both q 2 and q 2 approach Since the choice of ε was arbitrary in our definition of invasion, we conclude that the probability of successful invasion of the new strain 1 is 1 − R 0,1 R 0,2 I 2 (0) .

(S.51)
Once this has happened, as we note above, Kurtz's deterministic approximation is applicable, and with high probability, the system will approach any arbitrarily small neighbourhood of the endemic fixed point for the new strain 1, at which point, by the argument above, the former resident strain, strain 2, goes extinct with probability approaching 1 as n → ∞.
On the other hand, we can write (S.18) as where we note that the inverse exists, as Λ, which is positive, is not an eigenvalue of A. Evaluating the latter yields (S.19).