The ideal reporting interval for an epidemic to objectively interpret the epidemiological time course
The reporting interval of infectious diseases is often determined as a time unit in the calendar regardless of the epidemiological characteristics of the disease. No guidelines have been proposed to choose the reporting interval of infectious diseases. The present study aims at translating coarsely reported epidemic data into the reproduction number and clarifying the ideal reporting interval to offer detailed insights into the time course of an epidemic. We briefly revisit the dispersibility ratio, i.e. ratio of cases in successive reporting intervals, proposed by Clare Oswald Stallybrass, detecting technical flaws in the historical studies. We derive a corrected expression for this quantity and propose simple algorithms to estimate the effective reproduction number as a function of time, adjusting the reporting interval to the generation time of a disease and demonstrating a clear relationship among the generation-time distribution, reporting interval and growth rate of an epidemic. Our exercise suggests that an ideal reporting interval is the mean generation time, so that the ratio of cases in successive intervals can yield the reproduction number. When it is impractical to report observations every mean generation time, we also present an alternative method that enables us to obtain straightforward estimates of the reproduction number for any reporting interval that suits the practical purpose of infection control.
Notifications of infectious diseases occur in regular time intervals to inform infectious disease epidemiologists and public health officials about the magnitude of epidemics (Giesecke 2002). Case notification also gives information about (i) the time trends of infection, i.e. whether the time course of an epidemic is in the upward or downward direction, (ii) an indication of how steep the rise and fall elements are, and (iii) sometimes about the impact of intervention measures, e.g. if the introduction of mass vaccination results in a reduction in the number of infections (Chorba 2001). However, in many instances, the observed data do not permit capturing such a change in the epidemiological time course because the reporting interval is often defined as a time unit in the calendar (e.g. week, month or year) for practical convenience. Guidelines for choosing a specific reporting interval to understand the epidemiological dynamics of infectious diseases are currently lacking.
A statistical method to determine the reporting interval is density estimation, which may suggest a bin width to plot the histogram of case reports (Silverman 1986; Scott 1992). However, we expect that the epidemic curve spikes when successive waves of infections result in successive waves of reported cases, and in this sense, using bin width as recommended by density estimation (i.e. the reporting interval informed by the smoothing method) could suggest too coarse bins that smooth out several generations of cases occurring in a single reporting interval. To interpret the time course of an epidemic, case notifications are used to estimate a key variable that characterizes transmissibility with time. The effective reproduction number at time t, Rt, defined as the average number of secondary cases per primary case at time t (for t > 0), is a useful measure to inform about the transmission potential of a disease and indications of the expected number of secondary transmissions and of control efforts required to curb the epidemic (Ferguson et al. 2001, 2005; Haydon et al. 2003; Wallinga & Teunis 2004; Cauchemez et al. 2006a,b; Fraser 2007; Garske et al. 2007; White & Pagano 2008a,b). There are algorithms for transforming epidemic curves into the time course of Rt (Wallinga & Teunis 2004; Cauchemez et al. 2006a), but these require symptom onsets in fine time scale. Although the most precise reporting interval (e.g. reporting in a continuous time scale) would certainly yield the most ideal interpretation of the transmission dynamics, it is often impractical to report cases on an hourly or daily basis.
The present study proposes guidelines for selecting optimal reporting intervals, demonstrating that the ideal bin width should be determined by the distribution of the generation time, which is defined as the time from infection of a primary case to infection of a secondary case infected by the primary case (Svensson 2007). When it is impractical to report observations every mean generation time, we introduce an alternative simple algorithm to deal with interval censoring. In all cases, we show that the observed data permit obtaining straightforward estimates of the effective reproduction number that are useful for epidemic control. To understand the implications associated with the number of cases in a defined reporting interval, we start our discussion with a brief historical note on the earliest concept of Rt proposed by Clare Oswald Stallybrass (1881–1951).
2. Stallybrass's dispersibility
We first discuss a historical theory by Stallybrass who wrote one of the earliest epidemiologic textbooks, Principles of Epidemiology, in 1931 (Stallybrass 1931), proposing ‘dispersibility’ as one of the epidemiological markers (see electronic supplementary material for detailed historical account of Stallybrass). Dispersibility was defined as a measurement of the ‘total effect of factors affecting the spread of any specific infection at a given time and place’ (Stallybrass 1931), the factors of which he discussed include ‘sometimes intrinsic but more often depending upon either external or secondary factors’. Stallybrass calculated the ‘dispersibility ratio’ using epidemic data given in terms of the number of cases by reporting interval as follows:
Nevertheless, the ratios did not highlight epidemiological characteristics (e.g. generation time) of the disease, not allowing the comparison of ratios obtained for different diseases. To address this issue, Stallybrass introduced a ‘correcting factor’, i.e.
3. Correcting the dispersibility ratio
As a prelude to the estimation of Rt using coarsely reported data, here we correct the dispersibility ratio in the light of the relationship between the reproduction number and generation time (Roberts & Heesterbeek 2007; Wallinga & Lipsitch 2007) and show that this relationship enables us to adjust the reporting interval with respect to the mean generation time. The second earliest concept of the effective reproduction number was proposed by Nold (1979) who defined Rt using the mean generation time, μ, as follows:
Even in situations when the reporting interval is not exactly a multiple of the mean generation time, the relationship between R and μ can be derived. Assuming exponential growth of cases with the intrinsic growth rate r, the ratio of cases in successive reports is given by Jk+1/Jk = exp(rΔt) (appendix A 3). If we further assume that the generation-time distribution follows a delta function with mean μ, R is given by exp(μr) (Wallinga & Lipsitch 2007), which results in the relationship shown in equation (3.6) where n, in this assumption, is a positive real number given by equation (3.5).
4. Estimation of R and ideal reporting interval
4.1. Approximating the epidemic curve
A constant R is limited to the case when exponential (or geometric) growth of cases is continuously observed over time or in an endemic state situation. Nevertheless, only with a slight extension of the model, the ratio of cases in successive reporting intervals would be extremely useful in offering an interpretation of the course of an epidemic, especially when the denominator of the ratio is sufficiently large. Our strategy is illustrated in figure 2. Even when we do not have access to data in fine time scale, the effective reproduction numbers, Rt, in each reporting interval can be estimated, assuming exponential (or geometric) increase in infected individuals in each interval. Assuming different growth rates by reporting interval, this offers an approximated epidemic curve.
Here we extend and correct the theory of the dispersibility ratio, examining two different historical datasets, i.e. epidemics of smallpox and influenza. For the case of smallpox, we examine monthly incidence of smallpox for the entire Netherlands from 1870 to 1873 (Ministerie van Binnenlandsche Zaken, The Netherlands, 1875). The epidemic of variola major started in January 1870 with 20 575 cases reported over a period of 48 months. The original data are available from the electronic supplementary material. The influenza dataset is the daily incidence of the fall wave of the Spanish influenza pandemic in San Francisco from 23 September to 24 November 1918, which was revisited previously (Department of Hygiene 1922; Chowell et al. 2007). A total of 28 310 influenza cases were reported during an observation period of 63 days. We selected these datasets to illustrate two different new methods in the following.
4.2. Smallpox: geometric approximation
Suppose that smallpox cases are reported only monthly. Because the generation time of smallpox is approximately half a month (Lotz 1880; Nishiura & Eichner 2007), it is difficult to estimate Rt by generation using monthly reports alone. Nevertheless, assuming that the reproduction numbers of two generations in a single reporting interval are identical, it is feasible to approximate Rt for each reporting interval. Let the effective reproduction numbers in reporting intervals k and k + 1 be Rk and Rk+1, respectively. We assume geometric growth of cases with a constant growth factor in each reporting interval. In a heterogeneously mixing population, Rk is interpreted as the average number of secondary cases generated by a typical primary case in the reporting interval k, which is given by the dominant eigenvalue of the next-generation matrix in that reporting interval (Diekmann & Heesterbeek 2000). Given an observation of Jk cases in interval k, the expected number of cases in the next interval k+1, E(Jk+1| Jk), is given by
Assuming a mean generation time of 15 days for smallpox, we have exactly two generations (i.e. n = 2) for each monthly report. Applying equation (4.1) to the observed monthly smallpox incidence in The Netherlands, Rk can be estimated (figure 3). As Rk is estimated by the simple ratio of cases in equation (4.1), the model perfectly predicts the coarsely reported number of cases in each interval. The approximated Rt represents the increase and decrease in cases with time (figure 3b). The 95 per cent confidence interval (CI) is derived from the profile likelihood (appendix A 2). As the precision of the estimate is influenced by the observed number of cases (especially, by the denominator of the ratio in equation (4.1)), wide 95 per cent CIs are observed during the early and later stages of the epidemic.
It should be noted that when both Rk and Rk+1 are close to 1, equation (4.1) results in our correction of R in Stallybrass's dispersibility (i.e. equation (3.6)). Moreover, if n = 1 (i.e. if each reporting interval contains exactly one generation), equation (4.1) is reduced to
To assess the approximation, i.e. if we can suggest the mean generation time as the reporting interval, the following condition representing the relationship between variance-to-mean ratio, σ2/μ, of the generation-time distribution and intrinsic growth rate, r0, of an epidemic is useful (appendix A 1):
4.3. Influenza: exponential approximation
The above discussed strategy for smallpox only applies when the reporting interval is an integer multiple of the generation time. When such a strategy is difficult to be applied, we instead assume exponential growth in each reporting interval using different growth rates for each interval. Let the exponential growth rates in intervals k and k+1 be rk and rk+1. Given the number of cases in interval k, Jk, the expected number of cases in the next interval k + 1, E(Jk+1| Jk), is
Figure 4a shows the estimated Rk using the daily incidence of the Spanish influenza pandemic in San Francisco. Although there is not yet a consensus on the generation time of influenza, with estimates ranging from 2.6 to 5.3 days (Fraser 2007), here we assume for simplicity that the mean μ is 3 days following recent studies (Carrat et al. 2008; White & Pagano 2008a). If we further assume that the generation-time distribution is given by a delta function, we can calculate Rk = exp(μrk), and make a comparison between our approximate Rk and another approximate Rt by each generation time (i.e. equation (4.2)). Again, we observe wide uncertainty bounds for Rk, where there are only a small number of cases. Nevertheless, even when we only have weekly reports of influenza in hand (figure 4b), figure 4c visually confirms overall a good approximation of Rk to Rt. Note that Rk is drawn according to the corresponding reporting interval k. Although the precision of Rk is limited for the coarsely reported data (see below), Rk based on weekly reports (or the prediction based on equation (4.4)) perfectly predicts the observed weekly data. It should be noted that the approximate Rk is still useful to observe the threshold condition (where Rk < 1), enabling us to understand the time course of an epidemic. It should also be noted that we get Rk > 1 for the fifth week of the epidemic (figure 4c), even though the number of cases in the next interval (i.e. sixth week) was smaller; this reflects dependency between adjacent reporting intervals (i.e. equation (4.4)). If we assume random mixing of individuals, 1 − 1/Rk suggests a required control effort to contain an epidemic (e.g. required coverage of mass vaccination in a given interval). Although homogeneous mixing is often not the case in reality, figure 4d would inform public health experts about an estimate of required effort and allow an assessment of control measures. Figure 4e shows time variations in the estimated reproduction numbers obtained from weekly data, assuming three different mean generation times (i.e. 2, 3 and 4 days). When Rk > 1, the longer the generation time, the higher the Rk we get; i.e. our analytical understanding in equation (4.5) is maintained even when the observation is coarsely reported. The relationship between the generation time and Rk is reversed when Rk < 1.
Figure 5a compares the approximated epidemic curves with the observed Spanish influenza cases in San Francisco in 1918 (Department of Hygiene, Japanese Ministry of Interior 1922). As the reporting interval increases, the quality of the approximation is diminished. Figure 5b measures the deviation of approximated curves from observed data as a function of reporting interval. The saturated model, which is useful when the number of parameters equals the number of data points, is employed, allowing comparison of the deviance (i.e. lack of fit) between different reporting intervals. Although a more explicit test of significance cannot be employed, figure 5b shows that a reporting interval whose length is two or three times the mean generation time still approximates well the crude picture of the epidemic curve; e.g. a reporting interval of 7 days yields smaller deviance (χ2 = 889.8) than that of 5 days (χ2 = 3436.4). However, when the interval is too long compared with generation time, the deviance is too large, and it is certainly difficult to capture the observed epidemic pattern. Furthermore, it should be noted that the prediction of weekly reports based on our algorithms is not influenced by the true length of generation time; as can be seen from equation (4.4), the linear approximation to the observed epidemic curve is independent of the generation-time distribution. The precision of approximating Rk using a fixed reporting interval is influenced by the length of generation time. Reporting intervals that are shorter or close to the mean generation time yield more precise Rk than longer reporting intervals (§5).
The present study recommends that the reporting interval for case notifications should be taken equal to the mean generation time. This permits estimation of Rt by taking the ratio of cases in successive reporting intervals. If the mean generation time is short, and it is impractical to report observations in every generation time, our alternative algorithm (i.e. equation (4.4)) permits an explicit adjustment of the ratio of cases in successive reports to yield Rt. The method suffers from wide uncertainty when there is only a small number of cases (e.g. during early and late stages of an epidemic), but our approach greatly improves previous similar intent (Honhold et al. 2004) in that our method can yield a strictly interpretable quantity, Rt, to understand the epidemiological pattern of spread. To the best of our knowledge, this study is the first to estimate the effective reproduction numbers from coarsely reported data by adjusting the reporting interval based on the generation time and discussing the ideal length of reporting intervals in relation to the epidemiological characteristics of a disease. Although the interval of case notification may frequently be influenced by administrative factors, we believe that the present study provides a basis to choose the reporting interval, thereby offering a practical guide for the relevant considerations.
With historical reference to Stallybrass's dispersibility ratio (Stallybrass 1931), we have shown that the ratio of cases in successive reporting intervals is an interpretable measure in a special case (i.e. constant R over time), clarifying technical flaws in the original descriptions by Stallybrass. Moreover, explicitly adjusting the reporting interval to mean generation time, we extended Stallybrass's dispersibility ratio to estimate Rt, approximating the observed epidemic curve by assuming constant growth rates in each reporting interval. Approximating Rt can still capture thresholds and suggest required control efforts, helping public health experts understand the time course explicitly.
Our second algorithm (equation (4.4)) is particularly useful when the reporting interval is shorter than the mean generation time (i.e. Δt < μ, or equivalently, n < 1 where n, in this assumption, is a positive real number). When a single reporting interval does not include several different generations, a short reporting interval with n < 1 can more precisely reflect the transmissibility with time, because n < 1 indicates that infection of cases observed in an interval more likely had happened in previous intervals (not within the same interval) than the case for n > 1, allowing us to precisely capture Rt. Both of our proposed algorithms assume linear growth in each interval, by considering the corresponding interval as separate from its adjacent intervals. Theoretically, it is certainly better to have more precise data (e.g. observation in a continuous time scale) than coarsely reported data in order to fully capture the dependency of infected individuals between adjacent time periods. Considering that only the ratio can account for the dependency between infected individuals in our approaches, we get a straightforward conclusion for our second algorithm: the smaller n is and the smaller the variance-to-mean ratio of the generation-time distribution is, the more precise the estimates of Rk that are obtained.
As a technical note, our method is not suitable for infectious diseases with extremely long generation times (compared with the reporting interval, e.g. in HIV/AIDS). This technical limitation is identical to that of the method given previously (Wallinga & Teunis 2004), and therefore a different approach has to be employed for slowly progressing diseases (Gran et al. 2008). Moreover, the distribution of generation times has to be carefully interpreted for approximation, especially when the distribution is right skewed. Although the mean generation time, μ, is used to calculate the correcting factor, the variance-to-mean ratio needs to be examined in relation to the doubling time to precisely suggest the length of the reporting interval (appendix A 1). When the generation-time distribution is skewed, our second algorithm should be used to estimate Rt because equation (4.5) translates the exponential growth rates in each reporting interval to Rt, highlighting the skewed nature of the generation-time distribution. The estimate of Rt is obtained without too apparent deviations from the observed data when the reporting interval is two or three times the mean generation time, but the approximation is worsened as the interval becomes much longer than the mean generation time.
To use our algorithm in various practical settings, it should be noted that our estimation procedures made the following assumptions.
We are dealing with epidemics where demographic stochasticity (i.e. variation in the numbers of secondary transmissions by chance) is negligible. In other words, we have an unbiased estimator of the growth rate rk or the reproduction number Rk.
Deterministic exponential (or geometric) growth of cases is assumed in each reporting interval. The growth rate is the same for all subpopulations.
The number of cases in each interval is measured with perfect accuracy (i.e. no underreporting and no reporting delay).
Whereas we have shown that the distribution of the generation time plays a key role in interpreting the epidemiological time course of an epidemic, it should be noted that the methods for estimating the generation time have yet to be fully established (e.g. clarification of the most useful field data that lead to an estimate of generation time, and thus are useful for estimating R). In particular, the generation time in heterogeneously mixing population in relation to our condition (4.3) is a topic of future research. Moreover, it is impossible to know the generation time of a newly emerging infectious disease in a population. Therefore, besides the need to develop statistical methods for quantifying the generation-time distribution from empirical observation, it is suggested that notification of emerging infectious diseases needs to be reported as precisely as the public health authority can achieve. Moreover, when it is difficult to fully quantify the generation-time distribution, our study emphasizes the importance of quantifying, at least, the mean generation time so that we can understand the epidemiological time course.
Reporting cases in a fine time scale is crucial for many purposes, but it is often impractical in the public health field to collect and report disease data in very precise time intervals owing to financial, logistic and technical constraints. From our exercise, we showed that the use of the mean generation time as the ideal reference length for the reporting interval of surveillance would be extremely useful to estimate the effective reproduction number. To obtain a quick view of the time course of an epidemic, reporting cases every mean generation time would allow the estimation simply by using a ratio of cases in adjacent reporting intervals, yet saving the cost of reporting in a very short interval. Calculation of the ratio does not require difficult computations. Thus, for example, if the current reporting interval of a disease is close to the mean generation time, one may revise the interval to the mean generation time and estimate the effective reproduction number without spending much additional effort.
This work was supported by The Netherlands Organisation for Scientific Research (NWO grant ID: 918.56.620 and 851.40.074). G.C. received funding from the College of Liberal Arts and Sciences of Arizona State University.
A.1. An ideal length of the reporting interval
To suggest a condition in which the mean generation time can be reasonably used as a reporting interval, we consider the relationship between the generation-time distribution (with mean μ and variance σ2), reporting interval, Δt, and the intrinsic growth rate of an epidemic, r0 (Keyfitz 1968). Following the classic mathematical definition of the length of generation in demography by Lotka (Keyfitz 1968), it is deemed ideal if the reporting interval Δt satisfies the following relationship (i.e. if Δt itself exactly corresponds to the length of generation):
Figure 6a,b shows sensitivity of Δt to the different σ2/μ and r0 for the plausible parameter space of human influenza (μ = 3 days and r0 = 0.14 per day; Ferguson et al. 2005; Chowell et al. 2007; White & Pagano 2008a). When σ2/μ is 0 (i.e. σ2 = 0), Δt would be exactly the same as the mean generation time (figure 6a), which is certainly expected from equation (A 1). Δt is in general slightly shorter than the mean generation time for σ2 > 0. Especially, when the distribution is extremely skewed (e.g. σ2/μ = 3.0), the mean generation time is apparently longer than Δt. The tendency of observing shorter Δt than μ is more pronounced when the intrinsic growth rate r0 is large (figure 6b), especially for a skewed generation-time distribution. These figures highlight the critical importance in empirically investigating both the generation-time distribution and the intrinsic growth rate.
A.2. Geometric approximation of Rt
We assume that the reporting interval Δt is a multiple of the mean generation time μ and denote the ratio of Δt to μ by n (where n, in this assumption, is a positive integer given by equation (3.5) in the main text). Supposing that the effective reproduction numbers in kth and (k+1)th reports are Rk and Rk+1, respectively, the numbers of cases in kth and (k+1)th reports, Jk and Jk+1, are
When Rk is constant over time (=R), equation (A 5) results in an interpretation of the corrected Stallybrass dispersibility ratio, i.e. equation (3.6) in the main text. Moreover, when n = 1 (i.e. when the reporting interval is exactly the mean generation time), the likelihood function (A 6) is reduced to
A.3. Exponential approximation of Rt
Let rk and rk+1, respectively, be exponential growth rates in reporting intervals k and k + 1. We assume that the intervals k and k + 1 correspond to periods (t0, t0 + Δt) and (t0 + Δt, t0 + 2Δt), respectively, in calendar time scale, where Δt is the length of the reporting interval. We denote the incidence (i.e. number of newly infected individuals) at time t by I(t) and assume that I(t0) = m, where m is constant. Following the exponential growth, we get I(t0 + t) = m exp(rkt) (=mIk(t)) in interval k and I(t0 + Δt + t) = I(t0+Δt)exp(rk+1t) (=I(t0 + Δt)Ik+1(t)) in interval k + 1. Accordingly, Jk and Jk+1 are given by
Bailey N. T. J.. 1964The elements of stochastic processes with applications to the natural sciences. New York, NY: Wiley. Google Scholar Chorba T. L.. 2001Disease surveillance. Epidemiologic methods for the study of infectious diseases (eds , Thomas J. C.& Weber D. J.), pp. 138–162. New York, NY: Oxford University Press. Google Scholar
- Delater. 1923La Grippe Dans La Nation Armée De 1918 A 1921. Revue d'Hygiene 45, 406–426. [In French.]. Google Scholar
- Department of Hygiene, Japanese Ministry of Interior. 1922Influenza (Ryukousei Kanbou). Tokyo, Japan: Ministry of Interior. [In Japanese.]. Google Scholar
Diekmann O.& Heesterbeek J. A. P.. 2000 In Mathematical epidemiology of infectious diseases: model building, analysis and interpretation. Chichester, UK: John Wiley and Sons. Google Scholar Giesecke J.. 2002Modern infectious disease epidemiology, 2nd edn. London, UK: Arnold. Google Scholar Honhold N., Taylor N. M., Mansley L. M.& Paterson A. D.. 2004Relationship of speed of slaughter on infected premises and intensity of culling of other premises to the rate of spread of the foot-and-mouth disease epidemic in Great Britain, 2001. Vet. Rec. 155, 287–294. Crossref, PubMed, ISI, Google Scholar Keyfitz N.. 1968Introduction to the mathematics of population. London, UK: Addison-Wesley Publishing Co. Google Scholar Lotz T.. 1880Pocken und vaccination. Bericht über die Impffrage, erstattet im Namen der schweizerischen Sanitätskommission an den schweizerischen Bundersrath. Basel, Switzerland: Benno Schwabe, Verlagsbuchhandlung. [In German.]. Google Scholar
- Ministerie van Binnenlandsche Zaken, The Netherlands. 1875 In De pokken-epidemie in Nederland in 1870–1873. s'Gravenhage, The Netherlands: Van Weelden en Mingelen. [In Dutch.]. Google Scholar