A farewell to R: time-series models for tracking and forecasting epidemics

The time-dependent reproduction number, Rt, is a key metric used by epidemiologists to assess the current state of an outbreak of an infectious disease. This quantity is usually estimated using time-series observations on new infections combined with assumptions about the distribution of the serial interval of transmissions. Bayesian methods are often used with the new cases data smoothed using a simple, but to some extent arbitrary, moving average. This paper describes a new class of time-series models, estimated by classical statistical methods, for tracking and forecasting the growth rate of new cases and deaths. Very few assumptions are needed and those that are made can be tested. Estimates of Rt, together with their standard deviations, are obtained as a by-product.


Introduction
The degree of infectiousness of a disease is given by the basic reproduction number, R 0 , defined as the number of infections that are expected to result from a single infectious individual in a completely susceptible population. As an infection spreads, immunity starts to develop and for serious diseases, such as coronavirus disease 2019 (COVID-19), social behaviour may change endogenously, or may be modified, perhaps by the imposition of lockdown and social distancing measures. The progress of an epidemic is then usually tracked by the effective, or instantaneous, reproduction number, R t , which is the number of people in a population who get infected by an individual at any specific time (e.g. [1][2][3]). Such tracking is of considerable importance for planning, but it raises the question of whether estimating R t is to be regarded as an end in itself or as a means to an end, namely tracking and forecasting the number of new cases, hospital admissions and deaths.
Harvey & Kattuman [4]-hereafter HK-developed a class of generalized logistic (GL) time-series models for predicting future values of a variable which, when cumulated, is subject to an unknown saturation level. These models are relevant for many disciplines, but attention in HK was focused on applications for coronavirus. 1 Observations on the cumulative series are transformed to growth rates and the logarithms of these growth rates are modelled with a time trend. Allowing this trend to be time varying introduces flexibility and enables the effects of changes in policy and the environment to be tracked by filters for the level and slope. The filters are functions of current and past observations implied by the model. They can produce nowcasts of the current level of the incidence curve, together with forecasts of its future direction. Estimation is by maximum likelihood (ML) and goodness of fit can be assessed by standard statistical test procedures.
The methods used by epidemiologists to assess the current state of an infectious disease use time-series observations on new infections, together with information on the distribution of the serial interval of transmissions, sometimes called the infection profile (e.g. [5][6][7][8]). A brief description of the method in [5] can be found in appendix A. Bayesian methods are often used to combine the information on the serial interval with the observations on new cases, often smoothed by a simple, but to some extent arbitrary, moving average. These formulae effectively link estimates of R t to the growth rate in new cases, as do the more general formulae given in [1].
In our approach, estimates of the growth rate of new cases are produced directly by the time-series model from the raw data. The nowcasts and forecasts of R t , together with the equivalent of Bayesian credible intervals, therefore emerge as a by-product. The underlying assumptions are clear and are subject to diagnostic tests, so estimates of R t are implicitly validated. In contrast to R t , which is not observed directly, the accuracy of forecasts of future observations can be assessed ex post, providing further testing of the effectiveness of the model.
The HK model is reviewed in §2 and in §3 it is shown how new cases growth rate estimates can be used to nowcast R t and make short-term predictions. The implicit weights in the model-based filter are compared with the weights in the simple moving average ratio estimators used by the Robert Koch Institute, Germany (RKI). In §4, data from Germany and Florida are used to illustrate how the model is able to assess the importance of spikes in new cases and track second waves. Section 5 concludes by suggesting that tracking an epidemic by methods dependent on R t may be neither necessary nor desirable: the focus should be on the growth rates of new cases and deaths, together with their predicted time path.

The dynamic Gompertz model and its implementation
The model in HK uses data on the time series of the cumulative total, Y t , of a target series, such as confirmed cases of a disease or deaths. HK show how the theory of GL growth curves suggests observational models of the form where y t = ΔY t = Y t − Y t−1 is the daily change and ε t is a disturbance term. The model for the Gompertz curve is obtained by setting ρ = 1, but subtracting lnY t−1 from both sides gives a simple time trend regression for the logarithm of the growth rate of the cumulated series, that is, lng t where g t = y t /Y t−1 or ΔlnY t .
Remark 2.1. The growth curve for the ascending phase of an epidemic proposed in [9] implies an observational equation of the form (2.1) with γ = 0 and with ρ a deceleration parameter in the range 0 ≤ ρ ≤ 1; see also [7]. When ρ = 1 the cumulative total grows exponentially. The introduction of a time trend with γ < 0 gives sub-exponential growth.
Deterministic trends are too inflexible for most practical time-series modelling. A stochastic trend may be introduced into the equation for lng t and this extra flexibility allows ρ to be set to 1. The resulting dynamic Gompertz model is ð2:3Þ and the normally distributed and serially independent irregular, level and slope disturbances, ε t , η t and ζ t respectively, are mutually independent. When s 2 z is positive but s 2 h ¼ 0, the trend, δ t , is an integrated random walk (IRW). It is this form of the stochastic trend that turns out to be most useful for tracking an epidemic because it is the movements in slope γ t which are crucial for that purpose. The key parameter is then the signal-noise ratio, q ¼ s 2 z =s 2 1 . A deterministic trend is obtained when q is zero. Other components, such as day of the week effects, may be included in the right-hand side of (2.2).
Stochastic trend models can be estimated using techniques based on state-space models and the Kalman filter (KF) [10,11]. Here the computations were performed using the STAMP package 3 [12]. The KF outputs the estimates of the state vector (δ t , γ t ) 0 . Estimates of the state at time t conditional on information up to and including time t are denoted (δ t|t , γ t|t ) 0 and given by the contemporaneous filter; the predictive filter, which outputs (δ t+1|t , γ t+1|t ) 0 , estimates the state at time t + 1 from the same information set. It may sometimes be useful to review past movements by the smoother, denoted (δ t|T , γ t|T ) 0 , which is the estimate of the state at time t based on all T observations. Estimation of the unknown variance parameters is by ML.
Tests for normality and residual serial correlation are based on the standardized innovations, that is, one-step-ahead prediction errors, v t = lng t − δ t|t−1 , t = 3, …, T. Figure 1 shows data for the logarithm of the growth rate of the cumulated series of new cases of COVID-19 in England from early November 2020 until 17 February 2021. 4 The model, which includes a day of the week component and a signal-noise ratio set to q = 0.005, was fitted to observations up to and including 4 February 2021. The bold line indicates the smoothed estimates of the trend and, as will be shown in the next section, it is the estimates of the trend and slope at the end of the series that provide the information needed to compute the nowcasts of R t . The dashed line shows the forecasts from 5 February 2021 onwards. As can be seen these two-week-ahead forecasts are successful in capturing the trend and day of the week movements. Recursions for making forecasts of the actual number of new cases, that is, the y 0 t s, are given in HK. However, the emphasis in this article is on nowcasting and forecasting of the growth rate of y t and this only requires estimates of δ t and γ t .
Remark 2.2. When y t is small, it may be better to specify its distribution, conditional on past values, as discrete. The usual choice is the negative binomial. The way in which a dynamic model may be constructed is set out in HK; software can be found in [13].

Tracking R
The lag of τ reflects the generation interval, which is the number of days that must elapse before an infected person can transmit the disease.
royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20210179 The length of the moving average, k, determines the degree of smoothing; a value of k = 7 has the advantage of removing the day of the week effect but at the cost of a slower response. The rationale for b R k,t,t comes from [5]. In Germany, the national figure used by RKI is based 5 on setting τ = 4 and k = 4 or 7, but with some prior nowcasting of the data as described in [14].
A little algebraic manipulation shows is an implicit estimator of g y,t , the growth rate in y t , and the exponential approximation applies when b g y,t is small. In a dynamic Gompertz model, the growth rate of g t is tracked by the filtered estimates of the slope, that is, γ t|t , while the growth rate itself is tracked by g t|t = expδ t|t . Following the continuous time argument 6 leads to g y,t being estimated as where t 0 is the time at which the estimates are deemed to be reasonably reliable. The nowcast of R t suggested by equation The RKI estimator for COVID-19 implies τ = 4.
A general formula linking R t to g y,t is given in [1]. When g y,t is estimated by (3.3), the expression is where M(.) is the moment-generating function of the serial interval distribution, defined as the time between the onset of symptoms in a primary case and the onset of symptoms in secondary cases. When the distribution is degenerate, so that all secondary infections occur after exactly τ days, e R M t ¼ expðtg y,tjt Þ, which is the same as e R e t,t . When the serial interval has a gamma distribution with parameters a and b, implying a mean of ab and a variance of ab 2 , e R M t ¼ ð1 þ bg y,tjt Þ a . Keeping the mean constant and letting b → 0 confirms that e R M t ¼ expðtg y,tjt Þ, where τ is the mean generation interval. Setting a = b = 2, which is consistent with some of the estimates of the mean and variance obtained for COVID-19, yields

Sampling variability of nowcasts
When q, the signal-noise ratio in the Gaussian IRW model, is treated as known, the distribution of γ t , conditional on current and past observations, is normal with a mean γ t|t and a variance s 2 g,tjt that are produced by the KF. The growth rate of the incidence curve, g y,t , depends on g t as well as γ t but, as argued below, its contribution to the variability of g y,t is dominated by that of γ t . When the variability in g t is royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20210179 ignored, the probability that R t exceeds 1, that is, Prðg t . Àg tjt Þ where g t|t is treated as fixed, can be obtained directly from the conditional distribution of γ t . This probability does not depend on which equation estimates R t from g y,t|t and it does not depend on τ.
When R τ,t is defined as 1 + τg y,t , its distribution, again conditional on current and past observations, is normal with mean 1 + τg y,t|t and standard deviation (SD) τσ γ,t|t . On the other hand, the conditional distribution of R e t,t is lognormal with mean Note that expt 2 s 2 g,tjt À 1 ≃ t 2 s 2 g,tjt , so, when E t (R τ,t ) is close to 1, SD t ðR e t,t Þ will be very close to the SD t (R τ,t ). Why is the variability in g t|t ignored? From equation (2.2), g t = expδ t and, because δ t is normal, g t is lognormal with mean m g,tjt ¼ expðd tjt þ 0:5s 2 d,tjt Þ and variance Varðg t Þ ¼ m 2 is typically small so μ g,t|t ≃ expδ t|t = g t|t and Varðg t Þ ≃ but, although s 2 d,tjt is usually larger than s 2 g,tjt , the former is multiplied by g 2 tjt to get Var(g t ), whereas Varðg t Þ ¼ s 2 g,tjt ; note that s 2 d,tjt itself does not depend on the value of g t|t . Although g t|t can be high near the beginning of an epidemic, it tends to fall quite rapidly and once the epidemic is underway it rarely exceeds 0.05. The example of Florida, where the second wave increases g t|t , shows that, even in this case, Var(g t ) remains negligible compared with Var(γ t ).

Predictions of R
Predictions of R t in the dynamic Gompertz model can be made from predictions of g y,t , that is, where ℓ is the number of steps ahead. When g y,T|T is positive, so any estimate of R T given by (3.5) is greater than 1, there is still a saturation level for Y t so long as γ T|T is negative; for example as T → ∞, e R e t,Tþ'jT ! expðtg TjT Þ. When γ T|T is zero, the growth of y t is exponential and in this case it is helpful to characterize it by the doubling time, ln2/g y,T|T = 0.693 exp( − δ T|T ). When γ T|T is positive, as can happen at the start of a new wave, predictions of g y,t should not be made from (3.8). However, it may still be useful to quote the doubling time based on g y,T|T .
If, as in the previous sub-section, it can be assumed that g t is relatively small, the predictive distribution of g y,T+ℓ , and hence of R T+ℓ , is available because the conditional distribution of γ T+ℓ given observations up to and including time T is Gaussian with mean γ T+ℓ|T = γ T|T and variance s 2 Tþ'jT , as produced by the predictive equations of the KF. If the observation at time t actually relates to an event ℓ days earlier, the current R t is better estimated by an ℓ-step-ahead forecast. When γ T|T is negative, this forecast will be less than the nowcast.

Weights
The filtered estimates of g t and γ t in the dynamic Gompertz model, equation (2.2), are obtained by discounting past observations, with the rate of discounting depending on the signal-noise ratio, q. Weights implied by the KF and smoother for estimated states in a linear model can be obtained as output from the STAMP package, using a method described in [16]. The forcing variable in the filter is lng t and the weights assigned to it in the contemporaneous filter are the weights for γ t|t plus the weights for g t|t . When Y t is much larger than y t , as will be the case when an epidemic has been underway for some time, g t|t will be relatively small and attention can be focused on γ t|t . Then, if the weights for the slope, γ t|t , are denoted w j , j = 0, 1, 2, …, where the last approximation follows because lnY t−j−1 is assumed to be changing very slowly and P tÀ2 j¼0 w j ¼ 0. When multiplied by τ, the weights in equation (3.9) feed directly into the estimators of R t implied by equation (3.4) which is similar in form to (3.1) but with summations replaced by products. Figure 2 shows the weights for the slope produced when q is 0.001 and 0.01. ML estimates of q are typically between these values. Setting q = 0.005, which has the first four weights positive and the next 17 negative, is a reasonable default. A higher value gives a faster response, which may be appropriate when there is a sharp change in the environment, perhaps because of a change in policy. However, it comes at the cost of making nowcasts and forecasts less stable.

The early phase of an epidemic
Any modelling is very difficult at the start of an epidemic because of the lack of data; see, for example, the remarks in appendix 3 of [5]. In the dynamic trend model initializing the KF in the absence of prior information is generally done with a non-informative (diffuse) prior on the level and slope, as in the STAMP package; alternatively they can be estimated as unknown parameters. However, in the early part of an epidemic the growth is exponential or very close to it; see, for example, the analysis of the 1918 outbreak of Spanish flu in [17]. Thus, we could set γ 0 = 0. The filter for δ t , the level of lng t , can have a prior distribution informed by knowledge about the basic reproduction number, R 0 . A rough estimate of lng 0 is then given from Choosing a suitable variance for δ 0 is more problematic.

Waves and spikes
After an epidemic has peaked, daily cases start to fall and the concern shifts to the possibility of a second wave and the royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20210179 need to deal with outbreaks indicated by spikes in the data so that they do not morph into waves. The monitoring of waves and spikes raises different issues, primarily because a wave applies to a whole nation or a relatively large geographical unit, whereas a spike is localized.

Spikes
When national numbers are low, a localized outbreak can also result in a jump in the national estimate of R t . However, such a jump does not indicate that there has been a sudden change in the way the infection spreads and so has few implications for overall policy. Figures for new cases in Germany show a sharp increase towards the end of June 2020, caused by an outbreak at a meat-processing factory in the Gütersloh area in Westphalia. Estimates produced by RKI at the time showed a big increase in R t , accompanied by what seems to us to be a rather narrow credible interval. Figure 3    royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20210179 numerator. Although b R 7,4,t irons out some of the daily movement, the estimate of R t is still affected. The model-based e R e 4,t evolves more smoothly. After June the data give no indication of a sustained increase in new cases so the jump in estimates of R t , particularly b R 4,t , can safely be classed as a spike. The model was estimated using data from 25 March to 26 June 2020, using cases data sourced from the European Centre for Disease Prevention and Control (ECDC) website. 7 Estimates obtained using RKI's nowcast data are not very different. 8 The fit was good with very little evidence of residual serial correlation; the Q(15) statistic is 9.58. A Gaussian distribution seems a good approximation because the Bowman-Shenton test statistic, which is asymptotically distributed as x 2 2 under the null hypothesis, is only 0.77. The estimate of q was 0.0026.
The SD of the conditional distribution of γ t is 0.0276. Thus the SD of R t = 1 + 4γ t is 0.110. For R e 4,t setting E t ðR e 4,t Þ ¼ 1 gives SD t ðR e 4,t Þ ¼ 0:111, so the probability that it lies in the interval 9 [0.895, 1.117] is 0.68. It makes little difference whether R t is taken to be normal or lognormal. As regards the contribution of g t to the variability g y,t , the 26 June value of g T|T was only 0.0030 and SD(g T ) was less than 1% of the SD of γ T .

Waves
The US state of Florida, the third most populous in the USA with a population of around 20 million, provides an example of a second wave. A graph of daily new cases 10 from early March until 19 July 2020 shows a peak in early April followed by a steady decline. This is similar to the pattern for Germany and reflects the fact that Florida, like Germany, was in lockdown during April 2020. After April restrictions in Florida were eased there was a levelling out in May 2020, followed by a sharp rise in June. Figure 4 shows the logarithm of the growth rate of the number of confirmed cases, deaths and fraction of positives, starting 22 March 2020. (Before 22 March the data are very erratic.) After May there was an increase in testing.
However, the growth rate in tests is roughly constant from the end of May onwards and this shows up in figure 4, where the logarithm of the growth in the proportion of positives follows a similar path to that of the logarithm of the growth in total cases. This suggests that confirmed cases are still a good indicator of the path of new infections.
Fitting the dynamic Gompertz model, with a daily component, to data on confirmed cases from 22 March to 12 July 2020 gave residuals with very little residual serial correlation as the Q(16) statistic was only 8.42. The Bowman-Shenton test statistic was only 0.11 so a Gaussian distribution cannot be rejected. Graphical confirmation for the good fit is provided by figure 5.
The signal-noise estimate, q, was 0.0014. Figure 6 shows the filtered estimates of g t and γ t . At the beginning of June, γ t|t becomes positive and its sharp rise is accompanied by an attendant rise in g t|t . The increase in γ t|t continues until the end of June, when it changes direction and g t|t peaks.
The implied time series of nowcasts of R t follows directly from their sum, g y,t|t . Thus e R e 4,t reaches 1.5 by the end of June 2020 and then falls in July so that ultimately e R e 4,t ≃ 1:1. The filtered estimates of g t and γ t for confirmed cases in Florida are very different from those for Germany in that g t|t no longer becomes negligible with time. Indeed for most of June it is of a similar order of magnitude to γ t|t . Nevertheless its contribution to the variability of g y,t is still negligible. The SD of the conditional distribution of γ t is 0.0275 while that of δ t is 0.1296, translating into a SD of 0.0057 for g t . If the covariance term is ignored, the SD of g y,t is 0.0281, which is only a little above the SD of γ t .
Re-estimating the model with another week of data has γ T|T down to −0.031 but the SD is little changed at 0.027. The estimate of g T is 0.034, giving e R e 4,T ¼ 1:01. Finally estimating using data up to 12 August leaves γ T|T virtually unchanged but, because g T|T is down to 0.011, e R e 4,T is below 1, with a value of 0.91.

Apr 2020
May Jun Jul

Conclusion
New time-series models are able to track the progress of an epidemic by providing nowcasts and forecasts of the daily number of new cases and deaths. Estimates and forecasts of the instantaneous reproduction number R t can be computed as a by-product, using a formula that links it to the estimated growth rate of new cases, based on assumptions made about the serial interval distribution. The availability of the full conditional distribution allows the variability of the estimates to be assessed.
Current methods for tracking R t do not pay due attention to the time-series properties of the data, whereas the method described in this paper is based on time-series techniques that have been shown to be effective in a range of disciplines. The dynamic response depends on a signal-noise ratio that can be estimated from the data rather than being inferred from knowledge about the serial interval of infections. An important element in time-series methodology is diagnostic checking and the fit of the model. We show how diagnostic methods can be applied in the context of epidemics and in doing so we raise questions about some of the assumptions,  Figure 6. Filtered estimates of the growth rate, g t , and slope, γ t , for confirmed cases in Florida.
royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20210179 explicit or implicit, that are currently made in the estimation of R t . The ability of the model to track spikes and waves is illustrated with COVID-19 data from Germany and Florida. We stress again that computing R t is a by-product of our approach. Information on R 0 could be used at the start of an epidemic, but with a dynamic time-series model its impact soon wears off. After that, calculations involving R t play no part in nowcasting and forecasting daily cases and deaths.
Data accessibility. Data employed in this study are publicly available.
The data for reproducing figure 1 in the paper were downloaded from https://coronavirus.data.gov.uk/. The data are provided as electronic supplementary material.
Since R t , like γ t , is a random variable this is not, strictly speaking, a confidence interval. In a fully Bayesian framework, it would be called a credible interval. 10 Data on Florida are sourced from: https://covidtracking.com/data. Appendix A. Cori et al. [5] method for estimating R t Following Cori et al. [5], Thompson et al. [6] generate an estimate of the current level of new cases, y t , that combines the estimate of R t with an estimate, L tjtÀ1 , of the previous level based on the sum of new cases in the previous time period weighted by the infectivity function, or infectious profile after infection, f j , j = 0, 1, 2, …. This estimate can be written L tjtÀ1 ¼ P t j¼1 f j y tÀj , where P t j¼t f j ¼ 1 with f j describing the serial distribution. This distribution is based on prior knowledge, such as data collected from household studies in the early phase of an infection. The estimate of R t is obtained by Bayesian methods. Cori et al. [5, appendix 1] assume a Poisson distribution for y t and a (conjugate) gamma prior for R t−1 with parameters a and b. The posterior mean of R t -its nowcast-is then P j i¼1 f j y tÀi ¼ a þ P kÀ1 j¼0 y tÀj b À1 þ P t j¼1 w j y tÀj , ( A 1 Þ while the associated nowcast for the mean of new cases is It is proposed in [6] that a = 1 and b = 5 at the outset so that both the mean, ab, and SD, ab 2 , are set to 5. With no prior information a = 1/b = 0. The numerator in b R Ã k,t provides an estimate of the level at time t based on the last k observations. The value of k reflects a trade-off between response and stability. The choice of equal weights seems to be arbitrary. The weights in the second term reflect the structure implied by the sometimes imperfect knowledge of the distribution of the serial interval. Cori et al. [5, appendix 2] suggest letting the summation in the denominator of equation (A 1) start at j = τ, where τ is the generation interval. Approximating the weights by a simple moving average of length k and assuming no prior information then gives equation (3.1), which is the formula for the instantaneous reproduction number used by RKI. The value of k may be set equal to the length of the serial interval.
In our time-series approach, the lag structure depends solely on the observations, y t , and the properties of the fitted model. The weights in figure 2 are comparable with those used to construct b R Ã k,t in equation (A 1), except that they are multiplicative in the implied estimator of R e t . The negative weights in figure 2 correspond to the weights in the denominator of b R Ã k,t .