Avoidable errors in the modelling of outbreaks of emerging pathogens, with special reference to Ebola

As an emergent infectious disease outbreak unfolds, public health response is reliant on information on key epidemiological quantities, such as transmission potential and serial interval. Increasingly, transmission models fit to incidence data are used to estimate these parameters and guide policy. Some widely used modelling practices lead to potentially large errors in parameter estimates and, consequently, errors in model-based forecasts. Even more worryingly, in such situations, confidence in parameter estimates and forecasts can itself be far overestimated, leading to the potential for large errors that mask their own presence. Fortunately, straightforward and computationally inexpensive alternatives exist that avoid these problems. Here, we first use a simulation study to demonstrate potential pitfalls of the standard practice of fitting deterministic models to cumulative incidence data. Next, we demonstrate an alternative based on stochastic models fit to raw data from an early phase of 2014 West Africa Ebola virus disease outbreak. We show not only that bias is thereby reduced, but that uncertainty in estimates and forecasts is better quantified and that, critically, lack of model fit is more readily diagnosed. We conclude with a short list of principles to guide the modelling response to future infectious disease outbreaks.


Appendix A. Simulation study
To demonstrate the differences between fitting to raw incidence vs. cumulative incidence data, we performed a simulation study in which we fit the deterministic model variant to both types of data at three different levels of observation overdispersion: k ∈ {0, 0.2, 0.5}. For each overdispersion treatment, 500 simulated 39-week time series were generated from the stochastic model variant. The basic reproduction number was set to R 0 = 1.4; the incubation and infectious periods were fixed as in Table B1; the assumed population size was taken to be that of the Republic of Guinea. We assumed a reporting probability of ρ = 0.2 and that, at outbreak initiation, 10 individuals were infected. This set of parameter values yields a sample mean simulation visually comparable to the WHO data from Guinea, which display initally slow growth in the number of cases and later acceleration.
For each simulated data set, we estimated the basic reproduction number, R 0 , the reporting probability, ρ, and the negative binomial overdispersion parameter, k. All other model parameters were fixed at their true values. Parameter estimation was accomplished using the trajectory matching algorithm (traj.match) from the R package pomp (King et al., 2010). We constructed likelihood profiles over R 0 and, from these, obtained maximum likelihood point estimates and likelihood-ratio confidence intervals. The full process of obtaining likelihood profiles on model parameters by trajectory matching took approximately 1.2 hr on a 40-cpu cluster.
A second simulation study was performed, in which the deterministic variant of the model was fit to cumulative incidence data by ordinary least squares. This common procedure in effect assumes that measurement errors are independent and identically normally distributed. Results of this exercise are shown in Fig. A2 in a form comparable to that of Fig. 1. As in the results shown in the main text, confidence interval widths are erroneously under-estimated with the result that achieved coverage is far smaller than its nominal value. Figure A1: Twelve randomly-selected simulated datasets from the 1500 used in the simulation study. Four simulations are shown for each of three values of the negative binomial overdispersion parameter, k. Figure A2: Results from simulation study fitting the deterministic model to cumulative incidence data using the method of least squares. The model was fit to both raw (blue) and accumulated (red) simulated incidence data. The same 1500 simulated data sets of length 39 wk used in Fig. 1 were used here. (A) Estimates of R 0 . True value used in generating the data is shown by the dashed line. (B) Estimates of reporting probability, ρ. The dashed line shows the value used to generate the data. (C) Widths of nominal 99% profile likelihood confidence intervals (CI) for R 0 . (D) Actual coverage of the CI, i.e., probability that the true value of R 0 lay within the CI. Ideally, actual coverage would agree with nominal coverage (99%, dashed line).

Appendix B. Model-based inference Trajectory matching
Model parameters were initially estimated using trajectory matching. As in the simulation study, we initially fitted R 0 , ρ, k and the initial conditions. However, profile likelihoods over ρ were flat, indicating a lack of identifiability in the reporting rate due to a trade-off between this parameter and initial conditions. Accordingly, we fixed ρ = 0.2. The flatness of the likelihood profiles indicates that this assumption has no effect on the quality of fit. All other model parameters were fixed at the known values given in Table B1.
Trajectory matching was used to compute likelihood profiles over R 0 and k. For each point in the profile, the other parameters and initial conditions were initialized at 40 points according to a latin hypersquare (Sobol') design. In all, the trajectory matching calculations required approximately 21 cpu hr of computation. Full details of the trajectory matching codes are provided in the Supplementary Material.

Iterated filtering
Model parameters were estimated using the Iterated Filtering algorithm (IF2) (Ionides et al., 2015), implemented as mif in the R package pomp (King et al., 2010). For each country and each type of data, the parameter estimates along the trajectory-matching profiles were used to initialize the IF2 runs. From each initial point, we performed 60 IF2 iterations using 2×10 3 particles, hyperbolic cooling, and a random walk standard deviation (on the log scale) of 0.02 for all parameters and 1 for initial conditions. For the parameters estimated in each IF run, the log-likelihood was computed as the log of the mean likelihoods of 10 replicate filters, each with 5 × 10 3 particles. Approximate confidence intervals were then computed using the profile log-likelihood (Raue et al., 2009). All details of these computations are provided in the Supplementary Material. Computing each of the profile likelihoods in Fig. 2 using iterated filtering took approximately 34 cpu hr of computation; all profile computations were accomplished in roughly 3.6 hr on a 100-cpu cluster. Table B2 shows the maximum likelihood parameter estimates (MLE). Fig. B1 shows a comparison of various summary statistics ("probes") computed both on the data and on model simulations.

Results
To tease apart the consequences of failing to account for stochasticity from those of improperly fitting to cumulative data, we fit both deterministic and stochastic models each to both actual incidence and accumulated incidence data. It is important to recognize that the exercise of fitting the stochastic model to accumulated data is not something one would ever actually do. Indeed, at the outset incompatibility of model assumptions with the data becomes evident. To see this, let H t be the true incidence (i.e., actual number of new infections) in reporting interval t and C t be the number of reported cases in that interval. Because of measurement error, C t = H t +ε t , where ε t is the error. Let h t = t s=1 H t and c t = t s=1 C t be the accumulated true and reported incidence, respectively. Because c t = t s=1 (H t + ε t ), the errors c t − h t are not independent, which is the fundamental problem associated with fitting to cumulative incidence data, irrespective of whether the model for H t is deterministic or stochastic. If one attempts to fit a stochastic model to c t by modeling c t = h t + ξ t , where ξ t are measurement errors, one is confronted with the fact that, even though the accumulated data, c t , and simulations of h t are guaranteed to increase with time, simulations of c t under this model will not in general be monotonically increasing.
Nevertheless, one naturally wonders about the relative importance of the choice to use a deterministic or stochastic model vs. using raw or accumulated incidence data. Although the answer will certainly depend on both model and data, and therefore vary from situation to situation, we present the comparison in the present case to partially satisfy this natural curiosity. For the SEIR model fit to the Sierra Leone outbreak data, Fig. B2 shows likelihood profiles for the four model-data combinations and Fig. B3 shows the corresponding forecasts. Figure B1: Additional summary statistics, or probes, computed on both stochastic model simulations and the data. In each panel, the probability density of the probes on the simulated data are shown in grey; the blue line indicates the value of the probe on the data. Probes include autocorrelation at lag 1 (ACF), standard deviation (SD) on log-transformed data, 90th percentile, the autocorrelation at lags 1, 2, and 3 wk after removing an exponential trend (DACF), and the exponential growth rate as obtained by log-linear regression (TREND).  Figure B2: R 0 likelihood profiles for four model-data combinations for the SEIR model fit to the Sierra Leone outbreak. Figure B3: Forecast uncertainty for the Sierra Leone EBVD outbreak as a function of the model used and the data to which the model was fit. The ribbons show the median and 95% envelope of model simulations for the various models fit to raw and cumulative incidence data from the Sierra Leone outbreak. The data used in model fitting are shown using black triangles.

Appendix C. Data and codes
This appendix describes and displays the data and codes needed to fully reproduce all the results of the paper. All referenced files are provided on datadryad.org: doi:10.5061/dryad.r5f30.

R packages
The following installs the necessary packages, if they are not already installed.

MPI
Most of the codes described below are designed to be run on a cluster using OpenMPI. It will be assumed that an MPI hostfile, hosts, exists, which specifes the hosts to be used. An example hosts file has contents localhost slots=1 max-slots=1 node1 slots=0 max-slots=10 node2 slots=0 max-slots=10 node3 slots=0 max-slots=10 node4 slots=0 max-slots=10 Here, node1, node2, node3, and node4 are the compute nodes in this small cluster. Each has at least 10 cores.

Model implementation
The file ebola.R, shown below, will be sourced by the each of the codes below. It defines a function ebolaModel that constructs a pomp object holding the Ebola data and the model codes.

Forecasting
The codes in file forecasts.R perform all the forecasting computations. In a directory containing ebola.R, profiles.rds, and hosts, a command like mpirun -hostfile hosts -np 101 Rscript --vanilla forecasts.R will result in the execution of these computations.