Estimation in emerging epidemics: biases and remedies

When analysing new emerging infectious disease outbreaks one typically has observational data over a limited period of time and several parameters to estimate, such as growth rate, R0, serial or generation interval distribution, latent and incubation times or case fatality rates. Also parameters describing the temporal relations between appearance of symptoms, notification, death and recovery/discharge will be of interest. These parameters form the basis for predicting the future outbreak, planning preventive measures and monitoring the progress of the disease. We study the problem of making inference during the emerging phase of an outbreak and point out potential sources of bias related to contact tracing, replacing generation times by serial intervals, multiple potential infectors or truncation effects amplified by exponential growth. These biases directly affect the estimation of e.g. the generation time distribution and the case fatality rate, but can then propagate to other estimates, e.g. of R0 and growth rate. Many of the traditionally used estimation methods in disease epidemiology may suffer from these biases when applied to the emerging disease outbreak situation. We show how to avoid these biases based on proper statistical modelling. We illustrate the theory by numerical examples and simulations based on the recent 2014-15 Ebola outbreak to quantify possible estimation biases, which may be up to 20% underestimation of R0, if the epidemic growth rate is fitted to observed data or, conversely, up to 62% overestimation of the growth rate if the correct R0 is used in conjunction with the Euler-Lotka equation.


The simulation model and some interesting results
Model structure The simulated epidemic has been constructed to be close to the parameters reported for the recent Ebola epidemic by the WHO Ebola Response Team [1], with the exception of the incubation period which has been allowed to be slightly different from the latent period. Each infected individual follows a stochastic SEIR model with all time periods following Gamma distributions, the time unit being 1 day. The latent period E is assumed to be Γ(2, 1/5) (mean = 10, sd = 7.1) and the infectious period Γ(1, 1/5) (mean = 5, sd = 5). After the infectious period, the individual may either recover, with probability 30%, or die, with probability 70%. The time until recovery is assumed to be Γ(4, 1/3) (mean = 12, sd = 6) and the time to death Γ(4/9, 1/9) (mean = 4, sd = 6). Furthermore, each individual has an incubation period (time until symptoms) which is assumed to be similar to the latent period, but with some variation. The incubation period is given by E times a uniformly distributed variable in the interval [0.8, 1.2]. It is assumed that a case is reported when symptoms arise. Finally, during the infectious period, new cases are produced with rate 0.34/day, resulting in R 0 = 0.34 · 5 = 1.7.
The stochastic process as such is a Crump-Mode-Jagers branching process, in which the expected incidence of infections b(t) satisfies (see [2]) the renewal equation The solution to this equation quickly approaches exponential growth ∼ Ce rt , where r is the so called Malhtusian parameter. The same exponential growth, although with different constants, will also apply to the total number of infected, reported, recovered, dead, etc., individuals.
Duration of simulations Only "exploding" trajectories, corresponding to "big" outbreaks, are kept. Outbreaks start with 1 infected individual and are rejected if they do not reach 4500 reported cases. At the time of reaching this level, some statistics are collected and then the simulation is continued for 6 weeks further. The purpose of this continuation is that a prediction of the final level 6 weeks later is attempted, based on the first 4500 cases. Statistics are based on 1000 accepted trajectories.
Programming details The simulation program was written in standard C, because of the need to keep links (pointers) between infectors and infectees, in order to simulate contact tracing, and executed on a desktop computer. Results from the simulations were elaborated using the software R to produce the R 0 and r estimates, and the 6 week projections as well as statistical summaries. Random number generation for Gamma distributions with non-integer shape parameter used the algorithm of Phillips & Beightler [4].
Time from introduction of first case to 4500 notified Average and median approximately 200 days (sd = 33), but with a range of [123,358]. A deterministic estimate would probably be ln(4500)/r = 217, where r is the exponential increase rate. The slight difference might be due to the conditioning on non-extinct trajectories. Furthermore, the largest part of variability is in the first part of the epidemic. If one divides the time period in a first part until the first 100 (cumulative) cases are reached and a second part until level 4500 is reached, one finds that the first part has average length 102 days with sd = 28 and the 95% central range [63,174] and that the second part has mean 98 days, sd = 10 and 95% range [83, 119].
Stability of various ratios when 4500 cases have been notified At the time of 4500 cases notified (in total), the numbers of individuals infected, who had died or who had gotten well, were recorded. It should be noted that the total number of infected is not realistically observable, but of interest to estimate. It should also be noted that all numbers above are examples of delayed observations, from infection to notification and from notification to final destiny. One should therefore expect that the ratios should stabilize around values given by the formula (8) in Section 6 of the main paper. The ratio notified/infected was, on average, 0.70 with 95% of values between 0.68 and 0.72, essentially identical to the theoretical prediction based on "knowing" the incubation time distribution, which was assumed to also be the time from infection to notification. The narrow range of the observed ratios indicates that this is a viable method to predict the true actual size of the outbreak from the number of notified cases, given that good estimates of the distribution of time to notification and exponential increase rate are available. An analogous result holds for the ratios of infected to dead or recovered individuals. With the specific parameters of the simulations, at 4500 notifications, on average 3350 of them had either recovered or died, and the remaining 1150 remained between notification and final destiny. At the time of 4500 notifications, an average of 1900 additional individuals had been infected but not yet notified.
Observed generation times and serial intervals In each simulation run, 500 generation times and 500 serial intervals were sampled from the first 4500 notified individuals by systematically taking every ninth individual until the sample was complete. The times and intervals were ascertained "backwards", i.e. the infector of the chosen individual was identified and time distance between the respective infection or symptom times was recorded. The distributions of sample means are reasonably concentrated around the respective central values, 12.5 for serial intervals and 12.6 for generation times. It should be remembered that theory predicts that both should have the same expected value which should be less than the true expected generation time, which is 15. Theory again predicts that the backward generation time should have mean 12.57, which is not far from what is observed. The variance of the true generation time is 75 (sd = 8.7) and both variance estimates from the simulation samples tend to be much less, somewhat above 50 (sd = 7.1). This also leads to the useful conclusion that serial intervals are affected by the same "contraction" as generation times when ascertained "backwards", at least in the chosen parameter setting.
Predicting the size of the outbreak at a later time Finally, we study the performance of the renewal equation (Section 2, Eq. (5), main paper) approach proposed in [1]. This approach is intended to allow estimation of R t (in our simulation, R t is kept constant = R 0 all the time and the method is adapted accordingly) and to further allow prediction of cases 6 weeks (42 days) after the last observed datum. The results, using probabilities derived from observing backward serial intervals, thus biased with respect to the true generation time distribution, indicate that this method seriously underestimates R 0 but has good predictive power anyway. The 95% range of the ratio predicted/true values at 6 weeks after the level 4500 notified has been reached is [−10%, +12%], which is slightly better than just using a growth rate based estimate (i.e. just multiplying with exp(42r), using some good estimate of the growth rate r) derived from the first 4500 cases. This is perhaps natural, since the method amounts to an adaptive regression method which has fitted all the observed data up to level 4500 as well as possible, and then extrapolated this fit. As predicted by theory, the estimate of R 0 that results from this method underestimates the true value, in fact the true value 1.7 is never reached in 1000 simulations. The average estimate of R 0 is 1.57, with 95% range [1.51, 1.63], compared to the true value 1.7, thus having an average bias of −8%.

Delayed observations
Suppose that events occur with expected intensity λ(s) (cumulative intensity Λ(s)) on the time interval [0, T ]. Assume also that each event is observed after a delay which is distributed according to the density h(s) with cdf H(s). Then the expected number of observed events on [0,T] is of the expected number of events on the interval. If the intensity λ(s) is constant or even polynomially growing, it can be shown that π(T ) → 1 as T grows large, while, if the intensity grows exponentially, i.e. λ(s) ∼ e rs , then this fraction quickly approaches (after some simple integration steps) as T grows large, which is equivalent to calculating the expected value E(e −rD ), with D having a probability distribution with density h. If D has distribution Γ(α, λ), then π(∞) = ( It is interesting to note that this is a decreasing function of α, for fixed r and E(D). Thus the case α = 1, i.e. the Exponential distribution, yields the largest possible value among Gamma distributions with given mean.

"Backward" observation of generation times
Observing generation times, i.e. the time between infection of one individual and another one infected by the first one, has been discussed by several authors, e.g. [3,5,6]. In the exponentially increasing phase of a homogeneously mixing model, the distribution of times observed "backwards", starting from a randomly chosen newly infected individual, will have density f where f G denotes the generation time distribution, R 0 the basic reproduction number of the disease and r G the related Malthusian parameter (this result is approximate in the sense that the truncation effect of the time origin is disregarded). The parameter r G satisfies the Euler-Lotka (E-L) equation If one solves the E-L equation for the Malthusian parameter r B using the density f B (t) instead of f G (t), but with the true R 0 , one obtains This equation can be rewritten as where T has the generation time distribution f G . Because both functions of T in the expectation are monotone decreasing, their covariance is positive and thus Since E(e −rT ) is a decreasing function of r and E(e −r G T ) = 1/R 0 , and since the above inequality translates to E(e −r B T ) ≤ 1/R 0 , we have that r B ≥ r G .
Conversely, one may think of solving for the basic reproduction number R B using the backward generation time distribution f G but the correct exponential increase rate r G in the E-L equation. Thus again using the covariance argument. Thus R B ≤ R G By the same kind of argument, denoting a time with distribution f B by T B and one with distribution f G by T G , one can show that, in general, T B is stochastically smaller than T G , in the sense that This is verified by rewriting the wanted conclusion as which can be restated as (expectations according to f G ) which is true, again because the two function on the LHS are both non-increasing in T. This ordering, for instance, implies that E(T B ) ≤ E(T G ).

The E-L equation and the Gamma distribution
If the probability density f in the E-L equation, written in the form is a Γ(α, λ) distribution, the relations in the preceding section can be explicited. The equation becomes λ α (λ + r) α = 1 R and thus r = λ(R 1/α − 1) or R = (1 + r/λ) α .
Thus, if the parameters of the Gamma distribution are known (alternatively, mean and variance), the exponential growth rate can be calculated if R 0 is known or, viceversa, R 0 calculated if the exponential growth rate r is known.
If the generation time distribution is Γ(α, λ) and, denoting the corresponding R 0 and r-values by R G and r G , the "backwards" generation time distribution will be Γ(α, λ + r G ) (see [3]). If this density is used to solve for the exponential growth rate r = r B , assuming R G as R-value, or solving for R = R B , assuming r G as r-value, in the general E-L equation above, then, after some simplification, Finally, if the generation time G has a Γ(α, λ) distribution and another time S has a Gamma distribution with the same mean but larger variance so that the coefficient of variation of S is larger, by a factor c > 1, than the coefficient of variation of G, S will have a Γ(α/c 2 , λ/c 2 ) distribution (this situation relates to the problem treated in Sections 4 and 7.2 in the main paper). Then, straightforward calculations applied to using this density to solve for the exponential growth rate r = r S , assuming R G as R-value, or solving for R = R S , assuming r G as r-value, in the general E-L equation above, yields, after some simplification, It is again straightforward to show that, all other parameters fixed, r S is an increasing function of c for c ≥ 1 and that R S , all other parameters fixed, is a decreasing function of c for c ≥ 1. Thus, use of a Gamma distribution with larger variance than the generation time distribution but same mean always leads to overestimation of the exponential growth rate and underestimation of the basic reproduction number, given the true value of the other parameter, since the value c = 1 corresponds to the true value generated by the generation time distribution.

Multiple exposures and estimating the incubation period
As discussed in Section 5 of the main paper, in order to estimate the parameters of the incubation time distribution and the infection probability per contact p, once assigned a parametric model g with parameters θ, it should be possible to use maximum-likelihood techniques on the relevant part of the model likelihood, conditioning on the number k and times {e i } k 1 , of exposure, with s being the time of symptoms, It is also possible to find moment relations that allow estimation of p and mean and variance of the incubation time distribution, under some further assumptions on the exposure process.
Suppose that the exposure process is modelled as a Poisson process with constant rate µ with t = 0 (= e 1 ) at the first contact with an infective, that the probability of infection per contact is p, independently at each contact, and that the time from infection to symptoms is denoted by T , with mean E(T ) and variance V ar(T ).
Then the index I of the contact that infects will be Geom(p), i.e. P (I = k) = p(1 − p) k−1 , k = 1, . . .. After that, the number of contacts S before symptoms appear will have a Poisson distribution with mean µT , given T . The number of observed contacts will be C = I + S, with summands independent. The relations E(C) = 1/p + µE(T ) and V ar(C) = (1 − p)/p 2 + µE(T ) + µ 2 V ar(T ) will then hold.
Denote by S the time from first contact to symptoms. Then S is the sum of the time until infection and the incubation time. The time until infection is, given I, Gamma distributed with parameters I − 1 and µ. Thus Since E(C), V ar(C), E(S) and V ar(S) can be estimated from data, the four equations can be used for moment estimation of p, µ, E(T ) and V ar(T ).
Both ML-estimation and moment estimation have been implemented in a small simulation experiment, with 1000 replicates of estimation based on 500 individuals, each having a number of contacts C and a time S between first contact and symptoms. In the simulation, it was assumed that p = 1/2, that the incubation period had mean 11.4 (sd = 8.1) and that contacts happened according to a Poisson process with intensity 0.0725 (mean interval between contacts = 13.8 days). To test the stability of the estimation methods, one simulation run was performed with Gamma-distributed incubation times, assuming that the target distribution in the ML-method was effectively Gamma and another run using log-normally distributed times with the same mean and variance but still using the Gamma distribution as target in the ML-estimation. Table 1 one may conclude that the ML-method works well if the correct distribution is assumed, but less well in case of misspecification, while the moment method seems quite stable, maybe with a hint at downward bias with the lognormally distributed data. However, further research is needed about the best moment expressions to use and possible other approaches to this estimation problem.

Generation times and serial intervals
About the distributions of G and S In Section 4 of the main paper, the possibility that the distributions of generation times G and serial intervals S be identical is investigated.
According to the representations in that Section, if the end of the latent period/start of infectious period exactly coincides with the appearance of symptoms (s 0 = 0 ), one then has G = s 0 + g * and S = s 1 + g * , where s 0 and s 1 are the latent periods of infector and infectee, respectively, assumed independent and identically distributed, and g * is a part of the infectious period i 0 of the infector, thus potentially dependent on s 0 . It is then evident that independence of s 0 and g * (or i 0 , depending on how g * is constructed) implies equality of the distributions of G and S. The (theoretical) question remains, however, if this condition is also necessary. If it is not, a presumably weaker sufficient condition would suffice. In the abstract, the question can be answered: it is possible to have three random variables (A, B, C) (representing respectively s 0 , s 1 and g * ) such that A is independent of B and B is independent of C and the distributions of A+C and B+C are the same, but A and C are not independent. The probably simplest construction showing this is a distribution on the "cube" and, finally, that A is independent of B and B independent of C but that A is not independent of C.
In Section 4 of the main paper, it is shown that Cov(s 0 , 0 + g * − s 0 ) = 0, which under the assumption s 0 = 0 reduces to Cov(s 0 , g * ) = 0, is necessary for the distributions of G and S to be equal, but we have not been able to formulate a necessary and sufficient condition.

On the simulation model and results
In Section 4 of the main paper it is shown that E(G) = E(S) but that V ar(S) = V ar(G) + 2Cov(s 0 , 0 + g * − s 0 ) (it should be noted that the distribution of the infectious period and thus of g * depends on a sizebiased version of the model infectious period (see e.g. [5]), but that this does not affect the present argument). Since the simulation model assumes independence between latent period 0 and the infectious period, upon which g * depends and that s 0 = u 0 , where u ∼ Uniform[0.8, 1.2], the difference between V ar(S) and V ar(G) can be written as 2(Cov(u 0 , 0 ) − V ar(u 0 )). Since E(u) = 1, this reduces to 2E( 2 0 )(1 − E(u 2 )). Since E( 2 0 ) = 150 and E(u 2 ) = 1.0133, we should have V ar(S) − V ar(G) = 3.99. Furthermore, in the model we have G ∼ Γ(3, 1/5) and thus V ar(G) = 75. In the simulation results, we find the estimates V ar(G) ≈ 52.4 and V ar(S) ≈ 54.6. However, the generation times are observed "backwards", in which case theory predicts that the observed distribution should change from Γ(3, 0.2) to Γ(3, 0.2387), leading to V ar(G) = 52.7, which is now in good agreement with simulations. It thus appears that the "backward" observation shortening also affects the observed serial intervals and the difference between observed generation time and serial interval distributions. In this case, in order to theoretically predict the observed difference in variances, one would have to calculate the effect of "backward" observation of a generation time on the marginal distribution of the corresponding latent and incubation periods.
Finally, it may be interesting to investigate how large the quotient between the coefficient of variation of S and that of G (the parameter c in a previous Section) is as a consequence of model assumptions. Since the means of G and S are the same, we will have c 2 = V ar(S)/V ar(G) and thus c 2 = 78.99/75 = 1.053 according to the above calculations. Thus c = 1.026 in the simulation model.

Estimating the growth rate
Estimating the growth rate of the outbreak or its doubling time or other equivalent measures (under the assumption of exponential growth) is interesting per se and is useful for predictions of the future size of the outbreak if it is assumed that the current growth rate will not change. There are several possible methods but, unfortunately, it seems difficult to evaluate them theoretically on finite samples. We now concentrate on estimating the growth rate directly from case data. However, there seems to be a lack of systematic evaluations adapted to infectious disease spread data. A sensible approach is then to simulate the performance of the chosen estimation method in simulations as close as possible to the data generating situation. We have therefore evaluated various data-based methods in our simulation model. The compared methods are: a) linear regression on logarithms of cumulative numbers of notified cases, b) linear regression on logarithms of daily numbers of notified cases, c) taking the mean of daily ratios of successive cumulative numbers, d) estimating exp(r), the daily multiplication factor, with a branching process type estimator of the form (n(2) + . . . . + n(K))/(n(1) + · · · + n(K − 1)), where n(i) is the number of cases notified day i, and reporting the logarithm, e) fitting the discretized renewal equation described in Section 2 of the main paper to observed incidence data, using the generation time distribution estimated from backward times. This method produces an estimate of R 0 and not of the growth rate r but can anyway extrapolate values of future incidence.
Using the 1000 simulated epidemic trajectories, the estimators a) -e) of the exponential growth rate r and their usefulness in predicting the epidemic size 6 weeks after reaching 4500 notified cases were tested. The methods a) -d) estimate r using data from the last 6 weeks before reaching level 4500, while the fifth method, based on the discretized renewal equation (Eq. (5), Section 2 of main paper) estimates R 0 , using regression weights derived from the estimated generation time distribution based on observed backward serial intervals, as suggested in [1].
In the simulation model, the true value of r is 0.03870 and of exp(r) is 1.03946. It should be noted that what is then used in the prediction of the situation 6 weeks later would be the factor exp(42r) (with true value = 5.07973), which is also studied, both in isolation and used as predictor in combination with the last datum.
All 4 estimators of r show reasonably concentrated values around the true value 0.0387, as shown in Table 2: However, all slightly overestimate r, since the 95% confidence intervals do not contain the true value, although by less than 1%, on average. If one considers the prediction factor exp(42r), on average all four estimators again overestimate a little. Estimator (a) is the best, with mean 5.14729, but the sd of the distribution is now 0.48000 and a 95% prediction interval ranges from 4.2914 to 6.1641, which means [−16%, +21%] relative to the true value.
However, if one implements the prediction by multiplying the last cumulative datum by the estimate of exp(42r) and divides the result by the true cumulative datum 42 days later, there is still overestimation, by about 1%, but the prediction interval for method (a) has shrunk a little, to [−13%, +18%]. There is no big difference between methods, but (a) seems to have a slight advantage.
Finally, we study the performance of the renewal equation approach prposed in [1]. This approach is intended to allow estimation of R t (in our simulation, R t is kept constant = R 0 all the time and the method is adapted accordingly) and to further allow prediction of cases 6 weeks after the last datum. There are many small details to decide when using this method. We have made the following assumptions and considerations: -the time series of reported cases starts with day 1 when the first case is reported (= becomes symptomatic in our simulation) and goes on until the day the total 4500 is reached. The length of this series is thus different from the one counted from the introduction of the first infective.
-the method uses the serial interval distribution, assumed Gamma, as estimated from data. However, this distribution has to be discretized to daily probabilities. We know from previous discussions that this distribution is a biased estimate of the true generation time distribution.
-assuming the auto-regressive Poisson model for the time series of daily cases, the estimator of R 0 can be explicitly deduced.
-with this estimated R 0 , the time series can be brought forward until the desired prediction date is reached.
The results indicate that this method underestimates R 0 but has good predictive power anyway, as follows: -while the true value of R 0 in the simulation is 1.7, the mean of estimates obtained is 1.566 with 95% confidence limits 1.564 and 1.568, the minimum value in 1000 simulations was 1.467 and the maximum 1.683; -using the quotient predicted/true observed value 6 weeks later as indicator of predictive accuracy, the mean was 1.0040 with 95% confidence limits 1.0006 and 1.0074, with minimum value 0.856 and maximum 1.213.