Real-time characterization of the molecular epidemiology of an influenza pandemic

Early characterization of the epidemiology and evolution of a pandemic is essential for determining the most appropriate interventions. During the 2009 H1N1 influenza A pandemic, public databases facilitated widespread sharing of genetic sequence data from the outset. We use Bayesian phylogenetics to simulate real-time estimates of the evolutionary rate, date of emergence and intrinsic growth rate (r0) of the pandemic from whole-genome sequences. We investigate the effects of temporal range of sampling and dataset size on the precision and accuracy of parameter estimation. Parameters can be accurately estimated as early as two months after the first reported case, from 100 genomes and the choice of growth model is important for accurate estimation of r0. This demonstrates the utility of simple coalescent models to rapidly inform intervention strategies during a pandemic.


Introduction
When the swine-origin influenza A virus (A(H1N1)pdm09) was detected in April 2009, rapid characterization of its transmission potential and pathogenicity was urgently required for determination of appropriate interventions [1]. Early estimates of its emergence and transmission using phylogenetic analysis of genetic sequence data were reported within just three months of detection [2,3]. Such analyses are possible owing to rapid accumulation of genetic variation within the virus population, enabling its evolution to be modelled on an epidemiological timescale [4].
Here, we determine the efficiency with which Bayesian phylogenetics based on coalescent processes can estimate the evolutionary rate, date of emergence and intrinsic growth rate, r 0 , of A(H1N1)pdm09 using whole-genomes. The evolutionary rate provides an indication of the adaptive potential of a virus in a new host population. Accurate estimation is required for inferring divergence times and population size changes. The time of the most recent common ancestor (TMRCA) of a random sample of viruses provides an upper bound to the date of emergence of an epidemic. We use simple parametric growth models to estimate r 0 as a measure of the relative ease with which A(H1N1)pdm09 spread through a host population.

Material and methods
We downloaded all available A(H1N1)pdm09 whole-genome sequences sampled April -December 2009 from the EpiFlu database hosted by the Global Initiative on Sharing All Influenza Data (GISAID; platform.gisaid.org) on 26 April 2010 (see the electronic supplementary material, table S1). We analysed whole-genomes (by concatenating segments) to maximize genetic variation in the dataset and only included North American samples to limit spatial heterogeneity in viral & 2013 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0/, which permits unrestricted use, provided the original author and source are credited. population structure. We removed all isolates sampled from a non-human host, missing an exact sampling date, or with sequencing coverage less than 80% for any genome segment. To minimize the effect of epidemiologically linked cases, which may confound assumptions of the coalescent, we subsampled one isolate/location/day [5,6], resulting in a dataset of 328 sequences. After aligning, we trimmed sequences to 13 158 bp.
We carried out Bayesian phylogenetic analysis of the entire dataset in BEAST v. 1.7.4 [7][8][9], using the GTRþG nucleotide substitution model and uncorrelated lognormal relaxed molecular clock, which had greater Bayes factor support than a strict clock (BF ¼ 3.65) [10,11]. The clock used a gamma-distributed prior on the mean evolutionary rate, with a mean of 1 substitution/site/ year (k ¼ 0.001, u ¼ 1000) and exponentially distributed prior on the standard deviation (m ¼ 0.333). To model the demographic history of the virus population, we used the non-parametric Gaussian Markov random fields Bayesian skyride model [12], which specifies the prior on the TMRCA. We performed four independent Markov chain Monte Carlo runs of 100 million steps to achieve good mixing, sampling trees every 10 000 steps and combining runs after removing 10% burn-in.
To investigate how accurately and precisely Bayesian phylogenetics can estimate the evolutionary rate, date of emergence and r 0 throughout the pandemic, we extracted nine subsets of sequences, each with an increasingly longer temporal range and size. This sampling strategy is akin to carrying out phylogenetic analyses using all genome data available at the end of each month between April and December 2009. Given the increasing capacity with which sequencing can be performed, we included all data available from GISAID on 26 April 2010 to estimate parameters from the maximum amount of sequence data that could have potentially been available if samples were sequenced immediately. We used the same evolutionary models as above but replaced the skyride model with either an exponential or logistic growth model [13]. Here, we use the TMRCA to represent the date of emergence of the virus into the larger human population, assuming a single initial case.
We quantified the relative fit of both growth models by comparing their marginal likelihoods as Bayes factors. The marginal-likelihood measures the average fit of a model to the data and we estimated this using a recently described path sampling procedure [11].

Results
The skyride plot in figure 1 shows that the reconstructed past population dynamics of A(H1N1)pdm09 closely follows the number of newly confirmed A(H1N1)pdm09 cases per week (accessed via FluNet; http://who.int/influenza/gisrs_laboratory/flunet/), used here as a measure of incidence rate. This plot captures the exponential growth phase of the first pandemic wave, the plateau in genetic diversity and the growth phase during the second pandemic wave.
The evolutionary rate and date of emergence estimated from the first 34 sampled genomes have wide 95% Bayesian credible intervals (BCI) under both growth models, representing high uncertainty associated with the small sample size (figure 2). Precision increases when the dataset size is increased threefold with the addition of sequences sampled during May, from which the date of emergence is estimated to be 2 February 2009 (95% BCI: 12 January 2009, 2 March 2009) and evolutionary rate 3.93 Â 10 23 substitutions/site/year (95% BCI: 2.99, 5.53 Â 10 23 ) with a standard deviation of 0.24 (95% BCI: 8.9 Â 10 26 , 4.5 Â 10 21 ) under the exponential growth model. The date of emergence remains roughly consistent at later time-points and any further increase in precision is limited by the lack of alternative independent loci. Conversely, the mean evolutionary rate estimates tend to decrease with the addition of data over time, suggesting that many early deleterious/neutral mutations may have later been purged from the population through purifying selection [3,5,14].  In contrast to either of the other parameters, the choice of growth model has a considerable effect on r 0 estimation (figure 2). By the end of April, both growth models fail to estimate r 0 with sufficient precision to discriminate between slow and rapid epidemic growth because of the small number of sequences sampled. However, uncertainty rapidly reduces by approximately 50% under the exponential growth model with the addition of 66 sequences in May. In comparison, precision remains low during the first three months under the logistic growth model, representing over-parametrization of the model with smaller datasets. The exponential growth model consistently estimates r 0 with greater precision than the logistic model early in the pandemic, once data sampled after May are included in the analysis. Genetic diversity plateaus around June and the exponential growth model inappropriately adjusts for this by lowering the r 0 estimate (figure 1b). As the logistic growth model accommodates for this plateau, the accuracy of r 0 estimates remains largely unaffected by the inclusion of data from the second wave ( figure 1c). A Bayes factor test favours the exponential over the logistic model until June, when support switches and increases as data sampled throughout the following months are included in the analysis (table 1).
Here, r 0 can be used to estimate the basic reproductive ratio (R 0 ), which describes the average number of secondary infections arising from a primary infection [15]. For example, assuming a gamma-distributed generation time [16,17] with m ¼ 2.6 and s ¼ 1.3 (estimates from household data in the USA [18]), and r 0 estimated from the first two months of data (100 sequences) under an exponential growth model, we estimate an R 0 of 1.12 (95% BCI: 1.07, 1.16). This supports previous estimates from phylogenetic analyses of A(H1N1)pdm09 but is towards the lower end of estimates from incidence data sampled over similar temporal and spatial scales [2,19,20].
We investigated the effect of sample size on parameter estimation by constraining each cumulative dataset to 100 randomly selected genomes (see electronic supplementary material, figure S1). Although variation exists between the means of estimates from different random subsamples at each time-point, their 95% BCIs overlap with one another and those from analysis of the complete dataset. Additionally,

Discussion
Widespread genome sequencing and rapid sharing of data during the 2009 H1N1 pandemic enabled real-time characterization of an influenza pandemic for the first time [2,3,5]. Within approximately two months of the first cases (100 genomes), estimates of evolutionary rate, date of emergence and r 0 from sequence data were in agreement with those from analyses of incidence data, where comparison is available [1,2,19,20]. Over a longer sampling period, parameter estimates from 100 genomes maintain similar accuracy and precision to estimates from more intensively sampled datasets. We discuss potential reasons for the general decrease in evolutionary rate observed over time, although the difference in evolutionary rates is not significant and the datasets are not independent so this result should be interpreted with caution. The exponential growth model accurately estimates all three parameters during the exponential growth phase, although precision was low with less than 100 sequences. Once growth begins to plateau, this model should be replaced by the logistic growth model to avoid severely underestimating r 0 . A demographic reconstruction using a non-parametric coalescent model, such as the skyline or skyride model, can be used to reveal when exponential growth ceases [12,21]. However, these models are unable to estimate the change in relative genetic diversity between the most recent coalescent event and the youngest sample. If this time is large, the demographic plot may appear to flatten prematurely [6]. The exponential growth model is unaffected by an absence of recent coalescence events, estimating r 0 from the density of early coalescent events.
Simple parametric coalescent models are powerful tools for early characterization of an epidemic, even while growth remains exponential. More complex phylogenetic models have been developed to estimate epidemiological parameters that cannot be achieved with parametric coalescent models alone [22][23][24][25]. However, the over-parametrization of early A(H1N1)pdm09 data under the logistic growth model highlights the disadvantages of using highly parametrized models during the initial stages of an epidemic. With the increasing capacity of sequencing technologies, the lag between sampling and sequencing viral genomes is expected to decrease, making earlier parameter estimation feasible in future epidemics and before alternative types of data become available. Table 1. Log-marginal likelihoods of both growth models used to analyse the nine subsets of sequences with increasing temporal ranges. The preferred model for each dataset (values in italics) was determined using a Bayes factor test, in which the exponential growth model was the null model.