A dimensionless number for understanding the evolutionary dynamics of antigenically variable RNA viruses

Antigenically variable RNA viruses are significant contributors to the burden of infectious disease worldwide. One reason for their ubiquity is their ability to escape herd immunity through rapid antigenic evolution and thereby to reinfect previously infected hosts. However, the ways in which these viruses evolve antigenically are highly diverse. Some have only limited diversity in the long-run, with every emergence of a new antigenic variant coupled with a replacement of the older variant. Other viruses rapidly accumulate antigenic diversity over time. Others still exhibit dynamics that can be considered evolutionary intermediates between these two extremes. Here, we present a theoretical framework that aims to understand these differences in evolutionary patterns by considering a virus's epidemiological dynamics in a given host population. Our framework, based on a dimensionless number, probabilistically anticipates patterns of viral antigenic diversification and thereby quantifies a virus's evolutionary potential. It is therefore similar in spirit to the basic reproduction number, the well-known dimensionless number which quantifies a pathogen's reproductive potential. We further outline how our theoretical framework can be applied to empirical viral systems, using influenza A/H3N2 as a case study. We end with predictions of our framework and work that remains to be done to further integrate viral evolutionary dynamics with disease ecology.


INTRODUCTION
Dramatic increases in the amount of viral sequence data have enabled researchers to describe the evolutionary dynamics of specific viral pathogens in ever greater detail over the last several decades. What we are still largely lacking, however, is a synthetic understanding of the diversity of viral evolutionary patterns that are empirically observed [1]. Here, we seek to address this gap by developing a theoretical framework that relates ecological factors to the dynamics of viral antigenic evolution. We specifically focus on antigenically variable RNA viruses, whose surface proteins can exhibit starkly different evolutionary dynamics, as is evidenced by the topological differences in their phylogenies (e.g. figure 1a -c). An understanding of the causes of these evolutionary differences is necessary to anticipate how these viruses, and the ones that will emerge in the future, will evolve. Moreover, this understanding could lead to novel control strategies that are not simply reactive, but instead aim to shape the evolutionary dynamics of viral pathogens to allow for more effective control.
At the heart of our theoretical framework is a dimensionless number that links herd immunity-driven epidemiological dynamics to an expected rate of antigenic diversification. We base our framework on a dimensionless number because these numbers (e.g. the Reynolds number [2] and the basic reproduction number R 0 [3]) have been shown to be useful in understanding and predicting a system's dynamics, regardless of the specific scale at which the system operates. The dimensionless number k that we propose here for viral evolutionary dynamics is similar to both of these numbers in that its value depends on the properties of two players (the fluid and the pipe in the case of the Reynolds number, and the host and the virus in the case of R 0 and k). It is further similar to R 0 in that its value depends on epidemiological parameters and, ideally, in that it could be used to inform control strategies. The critical difference between R 0 and k is that R 0 quantifies a pathogen's reproductive potential, while k quantifies a virus's antigenic diversification potential.
Our approach is novel for two reasons. First, although classifications of viral phylogenies based on ecological factors such as the degree of immune selection have previously been introduced [4], these have been qualitative in nature, with notable exceptions [5]. Second, previous analyses considering either within-host [6] or populationlevel [4,5] evolutionary dynamics have classified viral phylogenies into one of the two general types: 'cactus-like' (or 'ladder-like') phylogenies, with low levels of genetic diversity circulating at any point in time and rapid lineage turnover, and 'acacia-like' phylogenies, with growth in genetic diversity over time. Examples of the former include influenza A/H3N2's haemagglutinin (HA) (figure 1a) [7,8] and norovirus's capsid protein [9,10], while examples of the latter include HIV's gp120 protein (figure 1b) and the HA of influenza A/H5N1 circulating among avian hosts. In contrast to these previous analyses, our k framework considers a spectrum between these diversification patterns. This allows us to quantitatively consider phylogenetic patterns that do not readily fall into one of these two extremes, such as the evolutionary dynamics of influenza B's HA (figure 1c).
We organize this paper by first developing our theoretical framework in §2 and then by illustrating the ways in which this framework can be applied to empirical hostvirus systems in §3. In §4, we conclude with a discussion of the limitations of our current framework and ways forward towards a better understanding of viral evolutionary dynamics.

THE k FRAMEWORK
In §2a, we first describe the structure of the epidemiological model that underlies our theoretical framework and develop our dimensionless number k from this model. In §2b, we then show how k can be used to probabilistically describe long-term patterns of antigenic evolution.
(a) The epidemiological model We assume that the dynamics of antigenically variable RNA viruses are governed by a status-based multi-strain model [11] of the form: where subscript i denotes a single antigenic variant, m is the birth rate and death rate, b is the transmission rate, n is the recovery rate and s ij is the degree of cross-immunity between antigenic variants i and j. The term h W (t 2 t i ) represents a per capita antigenic emergence rate: the rate at which an individual infected with antigenic variant i changes into being infected with a new and yet-unseen antigenic variant [12]. While most evolutionary multistrain models assume that this rate is constant and equivalent to the mutation rate (or a factor thereof ), we here more generally allow h W to be a function of an antigenic variant's age a, given by the difference between the current time t and the time of the variant's emergence t i . Specifically, we parametrize h W using a Weibull hazard rate with shape parameter k W and scale parameter l W : ð2:3Þ With k W ¼ 1, h W is a constant. With k W . 1, h W increases with a variant's age. This can happen if phenotypically neutral mutations accruing over time increase the probability that additional mutations result in significant antigenic change (e.g. [13]) or if a certain number of 'permissive' mutations are necessary for phenotypic change to occur (e.g. [14]). The scale parameter l W can be thought of as quantifying the extent of phenotypic robustness to genetic change, with higher values of l W being indicative of higher robustness. As written, equations (2.1)-(2.3) explicitly consider immune selection on viral antigenic phenotypes, rather than on individual genotypes. We use this approach because it yields a computationally simpler model, while remaining consistent with a range of previously published multi-strain model formulations and a range of empirical observations (e.g. [15]). Along with equations (2.1)-(2.3), we assume that a single antigenic variant comprises genotypes that evolve neutrally.
To derive our dimensionless number k, we consider a simplified, stochastic version of this epidemiological model. We start with a single endemic antigenic variant X circulating in a host population, giving rise to new variants ('offspring variants') over time. We assume that the degree of cross-immunity between variant X and an offspring variant is given by s, that there is no crossimmunity between offspring variants, and that offspring variants do not give rise to new antigenic variants. The epidemiological dynamics of variant X are governed by classic single-strain susceptible-infected-recovered (SIR) dynamics until an offspring variant emerges that is not stochastically lost from the population. We call this first successful variant Y. As a result of variant Y's frequencydependent selective advantage, and the competition between variants X and Y for susceptible hosts, the invasion of variant Y results in a decrease in the number of individuals infected with variant X, until, under a wide range of model parameter values, variant X is lost from the host population. Once this happens, the only variants remaining in the population are variant Y and any secondary successful offspring variants that may have been generated by variant X prior to its exclusion. The dimensionless number k we propose is given by the expected number of these secondary variants ('excess variants'). Setting the time of emergence of variant X to t X ¼ 0, k is given by: where I X (t) is the number of individuals infected with variant X at time t, u(t) is the probability that a variant given rise to at time t is not stochastically lost from the population and g Y (t Y ) is the probability density function for the time to variant Y's emergence. A conceptually similar quantity was outlined by Sasaki & Haraguchi [6] when considering viral dynamics within chronically infected hosts. Two critical differences, however, exist between the formulation of their dimensionless number and the one we propose here. First, their within-host model assumes that a variant will be driven to increasingly low numbers and ultimately to extinction from a persistent, variant-specific B-cell response. In contrast, an antigenic variant in our population-level model reaches a quasi-stationary endemic steady state when alone in the host population, as long as R 0 . 1. Variant X will therefore always generate at least one new successful antigenic variant, variant Y, and k quantifies the expected number of excess variants that variant X generates over its lifetime. Second, Sasaki & Haraguchi's within-host model defines a variant genetically, such that the production of new variants within a host occurs at a constant mutation rate. As described above, our model defines a variant antigenically.
To evaluate k for a given set of epidemiological parameters, we can derive analytical expressions to approximate I X (t), u(t) and g Y (t Y ). To derive I X (t), we assume that the selective advantage s of variant Y is constant and given by its initial selective advantage at the time of its emergence t Y . This selective advantage is appropriately defined using variants Y's and X's net reproductive rates. These can be evaluated at time t Y by analytically solving equations (2.1) -(2.2) for equilibrium when only one variant (variant X ) is present in the population: where S Y and S X are the number of hosts susceptible to variants Y and X, respectively, and R 0 ¼ b/(m þ n). The relative reproductive rate of variant Y in units of calendar time is therefore given by s(m þ n), where 1/(m þ n) is an individual's expected duration of infection. Further assuming that the total number of infected individuals (I X þ I Y ) stays constant at the equilibrium number of infected individuals for a single-strain SIR system (I ¼ mN(R 0 -1)/b), variant Y's population dynamics following its emergence at time t Y can be written deterministically according to the logistic equation: The analytical solution to this equation is: ). Our final analytical approximation to I X (t) is therefore given by the difference between I and I Y (t): ð2:8Þ In figure 2a, the dynamics of variants X and Y from a representative stochastic simulation are shown alongside the analytical expression for I X (t) given by equation (2.8).
To derive u(t), we rely on the well-known result that the probability of stochastic loss of an invading virus is given by the inverse of its net reproductive rate [16]. Assuming that the number of individuals susceptible to infection at time t by an offspring variant that has not yet emerged can be approximated by the number of individuals susceptible to this variant prior to variant Y's emergence; u becomes where we have dropped the notation of time dependence. In figure 2b, our constant approximation of u is shown alongside the dynamics of u(t) computed from the epidemiological simulation shown in figure 2a.
Finally, we derive an analytical approximation to g Y (t). This density function can be written as: ð2:10Þ where f(t) is the probability density function for the time it takes a single individual infected with variant X to generate an antigenic variant that is not stochastically lost, and F(t) is its cumulative distribution function. f(t) is given by: where u is the probability given by equation (2.9). Substituting equation (2.3) into equation (2.11) and writing F(t) in terms of f(t), equation (2.10) simplifies to: Taking expectations, we have the mean time to the generation of variant Y: ð2:13Þ In figure 2c,d, we compare our approximation for g Y (t) given by equation (2.12)  have been specified. As long as I and the product uI are large, an approximation to this double-integral expression yields: The derivation of this approximation is detailed in the electronic supplemental material. It is clear from equation (2.14) that the epidemiological model parameters affect the final value of k directly and/or indirectly through I, s and u. The sensitivity of k to each of these parameters is illustrated in the electronic supplementary material, figure S1. k increases with population size N, with the duration of infection (1/n), with the degree of cross-immunity s and decreases with host lifespan (1/m). Interestingly, the effect of the basic reproduction number R 0 on k depends on the values of the other parameters. Finally, k increases with a decrease in the antigenic emergence rate's scale parameter l W and with a decrease in its shape parameter k W , most dramatically as k W approaches one.
To determine whether k is a good approximation to the mean number of excess variants generated by variant X prior to extinction, we simulated the simplified 15 000 42. The number of excess variants for a given simulation was calculated by subtracting one from the number of variants cocirculating in the population 5 years after the extinction of variant X. We used a likelihood-ratio test to determine whether the null hypothesis (that the number of excess variants comes from a Poisson distribution with mean k) could be rejected at the 95% confidence level in any of the three cases (e-g) above. In none of the three cases could this null hypothesis be rejected. epidemiological model under three different k parametrizations (figure 2e-g). These simulations show that k, despite the numerous approximations made in its derivation, closely matches the mean number of excess variants computed from simulations. Furthermore, these simulations indicate that the number of excess variants generated by a single endemic variant can be considered a Poisson random variable with mean k.
(b) The link between k and long-term patterns of viral antigenic evolution To relate patterns of viral antigenic evolution to k, we first define some useful terminology. Let a generation be defined as the mean time between an endemic variant's emergence time and the emergence time of this variant's first successful antigenic variant offspring. In terms of the simplified model, a generation is given by E(t Y ). Furthermore, let n k be defined as the number of antigenic variants co-circulating in a host population in generation k. k can be used to project the distribution of n k over successive generations and thereby to describe long-term patterns of viral antigenic evolution under three assumptions: (i) that the time between the emergence of the offspring variant(s) and the extinction of the variant that gave rise to it/them is negligible when compared with the generation time, (ii) that successful antigenic variants immediately equilibrate to their endemic levels, given by I, and (iii) that the number of excess variants generated by a single endemic variant is a Poisson random variable with mean k, which we have already seen to be a reasonable assumption (figure 2e-g).
Given these assumptions, we have a forward model for simulating antigenic evolution. Starting with only a single endemic antigenic variant (n 0 ¼ 1), n 1 ¼ 1 þ i 0 , where i 0 is a random variable drawn from a Poisson distribution with mean k. Continuing, the next generation then has n 2 ¼ n 1 þ i 1 , where i 1 is the sum of a collection of n 1 independent random variables all drawn from a Poisson distribution with mean k. Continuing further, this process results in a synchronized accumulation of antigenic diversity over viral generations. Figure 3a illustrates a stochastic realization of these viral diversification dynamics for k ¼ 0.2. Given our model's specification that viral genotypes belonging to a single antigenic phenotype evolve neutrally, we can sample a representative phylogeny arising from these diversification dynamics using coalescent theory (figure 3b).
The benefit of k being dimensionless now becomes evident. Although the epidemiological parameters and generation times of two host-virus systems may differ from one another, these systems will be governed by the same diversification rates if they have the same k value. Epidemiologically distinct host-virus systems may therefore exhibit what is known as 'dynamic similarity', with similar patterns of antigenic evolution arising from a quantitative agreement of their dimensionless numbers.
From the structure of our forward model, we can also compute the probability mass function for the number of co-circulating variants in a given generation using only k: g k ðn k ; kÞ ¼ X n k m¼1 g kÀ1 ðm; kÞlðn k À m; kmÞ ½ ; ð2:15Þ where l(i;x) is the probability that there are exactly i variants from a Poisson distribution with mean x, g 0 (n 0 ¼ 1; k) ¼ 1 and g 0 (n 0 ; k) ¼ 0 for all n 0 = 1.
It is clear from this expression and from the structure of the forward model that a large number of generations may pass in systems with small k before a substantial amount of viral diversity accrues. However, the mean of n k is given by (1 þ k) k , and its standard deviation grows similarly with the number of generations k. The number of co-circulating variants will therefore always end up growing exponentially. According to this model, all viral populations will therefore eventually have many co-circulating variants, and the level of antigenic diversity will not saturate.
The relationship between k and the number of cocirculating antigenic variants in a given generation rests on the three assumptions listed above. To determine our framework's ability to anticipate patterns of antigenic  figure S2). These simulations also show that a range of viral diversification patterns (cactus-like, acacia-like and intermediate phylogenetic tree shapes) can be reproduced under this framework. In the electronic supplementary material, we also further detail how the k framework can be modified for a broader set of strain interactions.

APPLYING THE k FRAMEWORK TO EMPIRICAL HOST -VIRUS SYSTEMS
Here, we detail two ways in which the k framework can be used to understand and predict the evolutionary dynamics of viruses in empirical systems, using influenza A/H3N2 circulating in humans as a case study.
(a) Estimating k from a phylogeny with antigenic classification of viral isolates The forward model of antigenic evolution assumes that the number of excess variants generated by an endemic viral variant prior to its extinction is Poisson-distributed with mean k. We can use this assumption to compute the probability of observing a given number of excess variants arising from a single circulating antigenic variant, and multiply these probabilities to compute the overall likelihood of the dimensionless number k from the distribution of observed excess variants. Figure 4a,b provides an example, with a phylogeny simulated as in figure 3.
The antigenically typed phylogeny shows that 60 out of the 72 endemic variants yielded zero excess variants, while the remaining 12 yielded a single excess variant. The likelihood function is therefore given by:  The mean cluster persistence time is 3.51 years. Black curve shows the g Y (t) for the best estimate of k W and l W , calculated as follows: I was first computed from estimates of the annual attack rate, N and n, u was then computed using equation (2.9), given estimates of R 0 and s. Using these values of I and u, and for given values of k W and l W , the probability of observing each of the estimated cluster persistence times can be calculated using equation (2.12). The maximum-likelihood estimates of k W and l W were found by searching (k W and l W ) parameter space for the maximum product of these probabilities, yielding k W ¼ 2.26 and l W ¼ 1162. ( f ) The probability of observing a given number of co-circulating antigenic variants in 2011. These probabilities were computed using equation (2.15), knowing that there was only one circulating variant in 2003 (FU02) and that a generation lasts approximately 3.5 years, and assuming that k ¼ 0.11. Evaluating this function for k values between 0 and 0.5 clearly shows that our maximum-likelihood estimate of k ¼ 0.167 falls close to the true value of k ¼ 0.18 that was used in the simulation of the phylogeny.
Applying this same approach to influenza A/H3N2 in humans, we see that eight antigenic variants between 1968 and 2003 yielded zero excess variants, and only a single variant, SI87, yielded one excess variant, BE92 (figure 4c). (SI87's first successful offspring variant BE89 was lost from the population within a few years.) Our maximum-likelihood estimate of k from this distribution of excess variants is 0.11 (figure 4d ). Given a mean generation time of approximately 3.5 years (figure 4e), we can now use this estimate to probabilistically predict the number of co-circulating antigenic variants in three viral generations, bringing us roughly to today (figure 4f ). Our framework predicts with a probability of 64 per cent that only a single antigenic variant will be circulating today. This is indeed the case, with antigenic variant Perth/09 currently circulating, having evolved from variant Brisbane/07 [17].
(b) Estimating k directly from epidemiological parameters To review, the seven fundamental parameters that are needed to estimate k are N, n, m, R 0 (or b), s and the antigenic emergence rate's parameters k W and l W . Host-population size N, however, only plays a role in determining the equilibrium number of infected individuals I, and host lifespan similarly only plays this role if (m þ n) % n. An estimate of I, along with estimates of n, R 0 , s, k W and l W , are therefore only required to compute k. Some of these parameters are well-known or routinely estimated. For example, for influenza A/H3N2, 1/n is roughly 3 days, R 0 is roughly 2 [18], s is roughly 0.75 [19,20] and I can be estimated by assuming an annual attack rate of 6 per cent and a population size of six billion. This leaves k W and l W , which can be estimated together from the distribution of cluster persistence times and estimates of I, R 0 and s (figure 4e). Given these parameter estimates, we compute k ¼ 0.50, at the very top end of the 95% confidence interval (CI) using the method discussed above (figure 4d ). Clearly, this value of k is an overestimate, and we return to this disparity below.

DISCUSSION
Here, we proposed a dimensionless number k for understanding and predicting the evolutionary dynamics of antigenically variable viruses. We derived this number from our knowledge of multi-strain epidemiological dynamics and the frequency-dependent selection pressures that arise from these dynamics. The dynamics depend only on a handful of parameters, many of which are known or can be estimated in a straightforward way from empirical data.
The k framework presented here makes several predictions. One set of predictions is provided by the expression of k, given by equation (2.4), approximated by equation (2.14), and depicted graphically in electronic supplementary material, figure S1: with all else equal, we expect a virus to exhibit more rapid antigenic diversification when circulating in a larger host population, when circulating in a shorter-lived host population and when resulting in longer durations of infection. We also expect viruses with lower cross-immunity between antigenic variants to have lower rates of diversification. Finally, we expect the rate of antigenic diversification to strongly depend on the antigenic emergence rate's k W parameter, controlling how the rate of antigenic emergence depends on a variant's age. As the effect of 'ageing' diminishes, k W approaches one, yielding higher k values. This relationship can therefore explain why epidemiological models that do not include this effect (such that k W ¼ 1) result in explosive genetic and antigenic diversity in the absence of other factors such as generalized immunity [21], whereas models that include this ageing effect can reproduce cactus-like evolutionary dynamics without invoking other factors [12,13,22].
In our application to influenza, we outlined two approaches that can be used to estimate k. The first was based on an antigenically typed phylogeny and yielded reasonable k estimates. The second was based on independent parameter estimates; in this case our calculation of k was a considerable overestimate. This could have been owing to several reasons. First, our framework assumed a homogeneous, well-mixed host population and thereby ignored geographic subdivision and networks of interaction, which have clearly been shown to play a role in shaping viral phylogenies [23 -25] and impacting antigenic diversity [26]. Second, we assumed no seasonality or other factors that could lead to significant variation in the number of infected individuals over time. These factors are likely to reduce the degree of antigenic diversity by increasing the probability of stochastic loss. Finally, viral strains may differ phenotypically in ways unrelated to their antigenicity, and these differences may also have impacted influenza A/H3N2's HA phylogeny. Future work should therefore focus on extending this minimal k framework to improve our ability to predict patterns of antigenic diversification from independent parameter estimates.
Despite these limitations, the k framework presented here was shown to reasonably anticipate the viral evolutionary dynamics of influenza A/H3N2's HA when k was estimated from a phylogeny. Our estimate of k, however, assumed a status-based multi-strain model, which has recently been criticized for its biological implausibility. Although this may seem like a major limitation, alternative multi-strain models (e.g. [27]) can easily be accommodated by modifying subcomponents of k (electronic supplementary material).
From the framework we presented here, we can start to understand why viral phylogenies from certain host-virus systems differ topologically from those in other systems, or why they may have remarkable evolutionary similarities. Through this improvement in our understanding, we hope that k may result in more effective disease prediction and control. For endemic viruses, the dependence of k on modifiable epidemiological parameters already points us towards control strategies that may drive a given virus into an evolutionary regime that can be more effectively controlled. For example, the lengthening of livestock lifespans is expected to significantly lower the rates of antigenic diversification. In conjunction with vaccines that would be more effective and long-lasting, this could therefore lead to dramatic reductions in infection levels. For emerging viral zoonoses, k could be estimated by combining what is Viral dimensionless number K. Koelle et al. 3729 known about a naive host population (e.g. its population size) with what is known about the virus from its circulation in a reservoir population (e.g. its duration of infection). This could lead to predictions of how a virus will evolve in a host population prior to its establishment and help us focus our efforts on the viruses that would be difficult to control antigenically if establishment occurred. In sum, the minimal k framework presented here, open to many extensions and modifications, could prove increasingly useful for understanding and controlling the evolutionary dynamics of antigenically variable RNA viruses.