Beyond R 0 : the importance of contact tracing when predicting epidemics

The basic reproductive number — R 0 — is one of the most common and most commonly misapplied numbers in public health. Nevertheless, estimating R 0 for every transmissible pathogen, emerging or endemic, remains a priority for epidemiologists the world over. Although often used to compare outbreaks and forecast pandemic risk, this single number belies the complexity that two di ﬀ erent pathogens can exhibit, even when they have the same R 0 . Here, we show how predicting outbreak size requires both an estimate of R 0 and an estimate of the heterogeneity in the number of secondary infections. To facilitate rapid determination of outbreak risk, we propose a reformulation of a classic result from random network theory that relies on contact tracing data to simultaneously determine the ﬁrst moment ( R 0 ) and the higher moments (representing the heterogeneity) in the distribution of secondary infections. Further, we show how this framework is robust in the face of the typically limited amount of data for emerging pathogens. Lastly, we demonstrate that without data on the heterogeneity in secondary infections for emerging pathogens like 2019-nCoV, the uncertainty in outbreak size ranges dramatically, in the case of 2019-nCoV from 5-40% of susceptible individuals. Taken together, our work highlights the critical need for contact tracing during emerging infectious disease outbreaks and the need to look beyond R 0 when predicting epidemic size.

The basic reproductive number -R 0 -is one of the most common and most commonly misapplied numbers in public health. Nevertheless, estimating R 0 for every transmissible pathogen, emerging or endemic, remains a priority for epidemiologists the world over. Although often used to compare outbreaks and forecast pandemic risk, this single number belies the complexity that two different pathogens can exhibit, even when they have the same R 0 . Here, we show how predicting outbreak size requires both an estimate of R 0 and an estimate of the heterogeneity in the number of secondary infections. To facilitate rapid determination of outbreak risk, we propose a reformulation of a classic result from random network theory that relies on contact tracing data to simultaneously determine the first moment (R 0 ) and the higher moments (representing the heterogeneity) in the distribution of secondary infections. Further, we show how this framework is robust in the face of the typically limited amount of data for emerging pathogens. Lastly, we demonstrate that without data on the heterogeneity in secondary infections for emerging pathogens like 2019-nCoV, the uncertainty in outbreak size ranges dramatically, in the case of 2019-nCoV from 5-40% of susceptible individuals. Taken together, our work highlights the critical need for contact tracing during emerging infectious disease outbreaks and the need to look beyond R 0 when predicting epidemic size.

I. INTRODUCTION
In 1918, individuals infected with influenza typically passed on the virus to between 1 and 2 of their social contacts [1]. The same was true for those infected with Ebola virus during the 2014 outbreak in West Africa [2,3]. Nevertheless, Ebola virus disease infected a tenth of one percent of the number of individuals believed to have been infected by the 1918 Influenza virus [4,5]. While improvements in healthcare and public health measures, as well as changes in human behavior, partially explain the massive discrepancy between Ebola virus disease in 2014 and influenza in 1918 [6], there is another critical difference between these two diseases: heterogeneity in the number secondary cases resulting from a single infected individual. Here, we demonstrate analytically that quantifying the variability in the number of secondary infections via contact tracing is critically important for quantifying the transmission risk of novel pathogens and further show how a lack of publicly available contact tracing data on cases of novel coronavirus (2019-nCoV) prevents the global public health community from determining the true pandemic risk of this novel virus.
The basic reproduction number of an epidemic, R 0 , is the expected number of secondary cases [7], or infections, produced by a primary case over the course of their infectious period in a completely susceptible population [8]. It is a simple metric that is commonly used to describe the transmissibilty of emerging and endemic pathogens [9]. If R 0 = 2, one case turns to two, on average, and two turn to four as the epidemic grows. And if R 0 < 1, the epidemic will die out. However, we are seldom concerned with epidemics that emerge and quickly die out. To observe an epidemic requires some level of sustained transmission, i.e., R 0 > 1. Almost 100 years ago, pioneering work from Kermack and McKendrick [10][11][12] first demonstrated how to estimate the final size of an epidemic for a pathogen with R 0 > 1. Specifically, they consider a scenario such that: 1. the disease results in complete immunity or death, 2. all individuals are equally susceptible, 3. the disease is transmitted in a closed population, 4. contacts occur according to the law of mass-action, 5. and the population is large enough to justify a deterministic analysis. Under these assumptions, Kermack and McKendrick show that an epidemic with a given R 0 will infect a fixed fraction R(∞) of the susceptible population by solving This solution describes a final outbreak size equal to 0 when R 0 ≤ 1 and increasing as 1 − exp(−R 0 ) when R 0 > 1. Therefore, a larger R 0 leads to a larger outbreak which infects the entire population in the limit R 0 → ∞. This direct relationship between R 0 and the final epidemic size is at the core of the conventional wisdom that a larger R 0 will cause a larger outbreak and small variations in R 0 can lead to vastly different total case counts. Unfortunately, the equation relating R 0 to final outbreak size from Kermack and McKendrick is only valid when all the above assumptions hold, which in practice is rarely the case. In fact, seemingly trivial violations of the above assumptions can lead to vastly different outbreak sizes even when R 0 is held constant. As a result, relying on R 0 alone is often misleading when comparing different pathogens or outbreaks of the same pathogen in different settings [13][14][15]. This is especially critical considering that most outbreaks are not shaped by the "average" individuals but rather by a minority of superspreading events [13,16]. For example, public health officials are currently trying determine whether 2019-nCoV can be contained or whether it will cause a pandemic [17]. Despite numerous estimates of 2019-nCoV's R 0 (ranging from 1.5 to just under 4 [18][19][20][21][22][23]), as we discussed above in the comparison of Ebola virus disease and 1918 influenza, whether-like SARS-2019-nCoV can be contained or-like 2009 H1N1-2019-nCoV will cause a pandemic depends largely on the heterogeneity in the number of secondary infections. Indeed, recent work modeling the effectiveness of contact tracing and isolation on preventing 2019-nCoV concluded that the probability of containing the outbreak was significantly lower if 2019-nCoV heterogeneity in secondary infections was low, like influenza, as compared to higher, like SARS [24]. To more fully quantify how heterogeneity in the number of secondary infections affects outbreak size, we turn towards network epidemiology and derive an equation for the total number of infected individuals using all moments of the distribution of secondary infections.

II. ANALYSIS FROM NETWORK THEORY
Random network theory allows us to relax some of assumptions made by Kermack and McKendrick, mainly to account for heterogeneity and stochasticity in the number of secondary infections caused by a given individual.
We first follow the analysis of Ref. [25] and define G 0 (x) as the probability generating function (PGF) of the degree distribution {p k } (number of contacts), i.e.
When following a random edge, we define the excess degree as the number of other edges around that node. Because an edge is k times more likely to reach a node of degree k than a node of degree 1, the excess degree distribution is generated by where k is the average degree and acts as a normalisation constant, and G 0 (x) denotes the derivative of G 0 (x) with respect to x. We now assume that the network in question is the network of all edges that would transmit a disease if given the chance.
Consequently, G 1 (x) generates the number of secondary infections that individual nodes would cause if infected. And, if we infect a random node as the patient zero, its entire connected component (a maximal subset of nodes between which paths exists between all pairs of nodes) will be infected. To calculate the largest possible epidemic, we thus look for the size of the giant connected component (GCC).
To calculate the size of the GCC, we first look for the probability u that following a random edge leads to a node not part of the GCC. For that node to not be part of the GCC, all of its excess edges must also not lead to the GCC. This simple observation leads to the self-consistent equation The size of the GCC is a fraction of the full population N that we will denote R(∞) because it corresponds to the potential, macroscopic, outbreak size. Noting that a node of degree k is not in the GCC with probability u k , R(∞) can be written as because it contains all nodes except those with no edges leading to the GCC. This solution is exact in the limit of large population size.

III. REINTERPRETATION AND RESULTS
The network approach naturally accounts for heterogeneity, meaning that some individuals will cause more infections than others. The network approach also accounts for stochasticity explicitly: Even with R 0 > 1, there is a probability 1 − R(∞) that patient zero lies outside of the giant outbreak and therefore only leads to a small outbreak that does not invade the population.
However, the analysis in terms of PGFs is obviously more involved than simply assuming mass-action mixing and solving Eq. (1). In fact, the PGF G 0 (x) requires a full distribution of secondary cases per primary case, which will in practice involve a polynomial of high order.
To clarify and potentially simplify the approach, we propose to reformulate the classic network model in terms of the cumulant generating function (CGF) of secondary cases. The CGF K(y) of a random variable X can be written as K(y) = κ n y n /n! where κ n are the cumulants of the distribution of secondary infections. These are useful because the cumulants are easier to interpret, i.e., κ 1 is simply the average number of secondary cases R 0 , κ 2 is the underlying variance, κ 3 is related to the skewness and κ 4 to the kurtosis of the full distribution. By definition, a PGF G(x) of a random variable is linked to K(y) through G(x) = exp [K (ln x)]. Therefore, we can replace the PGF G 1 (x) for the distribution of secondary infections by a function in terms of the cumulants of that distribution.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted February 12, 2020. .

A. Analysis of cumulants and derivation of Kermack-McKendrick
Using the cumulants κ n , we can re-write G 1 (x) as which is obviously not as practical as a simple polynomial when solving for u = G C 1 (u). However, in some cases, cumulants can have significant advantage over the polynomial representation. For instance, the cumulants of a sum of independent random variables are simply the sum of the cumulants -convenient when dealing with diseases with multiple modes of transmission.
We can quickly derive Kermack and McKendrick's result from the previous generating function since their solution involves a well-mixed population, which is equivalent to a Poisson distribution of secondary infections in our framework. G C 1 (x) is convenient for a Poisson distribution because its cumulants κ n = R 0 ∀n. Moreover, since G 0 (x) = G 1 (x) in the Poisson case, we know that the final proportion of susceptible individuals is directly given by u = G C 1 (u), or Taking the logarithm of the exponential term from this last equation yields Kermack and McKendrick's formula. For more general distributions, it is useful to rewrite Eq. (6) as The solution to u = G C 1 (u) gives the probability that every infection caused by patient zero fails to generate an epidemic. Importantly, Eq. (8) has an alternating nature because ln x is negative for x < 1 such that its n-th power is positive when n is even and negative when n is odd.
This observation, that the moments of Eq. (8) alternate, can be interpreted as follows. A disease needs a high average number of secondary infections (high κ 1 = R 0 ) to spread, but given that average, a disease with small variance in secondary infections will spread much more reliably and be less likely to stochastically die out (see Fig. 1). Given a variance, a disease with high skewness (i.e., with positive deviation contributing to most of the variance) will be more stable than a disease with negative skewness (i.e. with most deviations being towards small secondary infections). Given a skewness, a disease will be more stable if it has frequent small positive deviations rather than infrequent large deviations -hence a smaller kurtosis -as stochastic die out could easily occur before any of those large infrequent deviations occur.
Clearly, our re-interpretation already highlights a striking result: Higher moments of the distribution of secondary cases can lead a disease with a lower R 0 to more easily invade a population and to reach a larger final outbreak size than a disease with a higher R 0 . We can investigate this conclusion further using a simple example of normally distributed secondary infections.

B. Normal distributions and the impact of variance
A second useful application of the cumulants formulation involves diseases with a large reproductive number R 0 whose distribution of secondary infections can be convincingly modeled by a normal distribution. The raw moments of a normal distribution are quite complicated, but the cumulants are simple: κ 1 is equal to the mean R 0 , κ 2 is equal to the variance σ 2 , and all other cumulants are 0. We can thus write and solving for u = G C 1 (u) yields This equation can then be used for direct comparison of the probability of invasion of two different diseases with normal distributions of secondary infections. Given a transmission event from patient zero to a susceptible individual, disease B will be more likely to invade the population than disease A if For example, a disease with half the basic reproductive number of another will still be more likely to invade a population and lead to a larger outbreak if its variance is only slightly less than half the variance of the other disease.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted February 12, 2020. . https://doi.org/10.1101/2020.02. 10.20021725 doi: medRxiv preprint FIG. 2. Final size of outbreaks with different R 0 and distributions of secondary cases. We use a negative binomial distribution of secondary cases and scan realistic range of parameters. Most importantly, the range of parameters corresponding to the current outbreak of 2019-nCoV is highlighted by a red box. The range of potential R0 comes from a 95% confidence interval using a early data and a classic deterministic models [18,19]. The range of dispersion parameter k comes from analogy with severe acute respiratory syndrome [13,24]. Most importantly, with fixed average, the dispersion parameter is inversely proportional to the variance of the underlying distribution of second cases.

IV. DISCUSSION
From re-emerging pathogens like yellow fever and measles to emerging threats like MERS-CoV and Ebola, the World Health Organization monitored 119 different infectious disease outbreaks in 2019 alone [26]. For each of these outbreaks, predicting both the epidemic potential and the most likely number of cases is critically important for responding efficiently and effectively. This need for rapid situation awareness is why R 0 is so widely used in public health. However, our main analysis, and Eq. (11) in particular, show that not only is R 0 insufficient in fully determining the final size of an outbreak, but having a larger outbreak with a lower R 0 is relatively easy considering the randomness associated with most transmission events and the heterogeneity of physical contacts. To address the need for rapid quantification of risk, while acknowledging the shortcomings of R 0 , we use network science methods to derive both the probability of an epidemic and its final size.
However, these results are not without important caveats. Specifically, we must remember that distributions of secondary cases, just like R 0 itself, are just as much a product of a pathogen as of the population in which it spreads. For example, aspects of the social contact network [27], metapopulation structure [28,29], mobility [30,31], adaptive behavior [32,33], higher-order contact structure [34,35], and even other pathogens [36,37], all interact to cause complex patterns of disease emergence, spread, and persistence. Therefore, great care must be taken when using any of these tools to compare outbreaks or to inform current events with past data.
Three types of data could potentially be used in real time to improve predictions by considering secondary case heterogeneity. First, contact tracing data whose objective is to identify people who may have come into contact with an infectious individual. While mostly a preventive measure to identify cases before complications, it directly informs us about potential secondary cases caused by a single individual, and therefore provides us with an estimate for G 1 (x). Both for generating accurate predictions of epidemic risk and controlling the outbreak, it is vital to begin contact tracing before numerous transmission chains become widely distributed across space [38,39] Second, viral genome sequences provide information on both the timing of the outbreak [40][41][42] and structure of secondary cases [43][44][45]. For example, methods exist to reconstruct transmission trees for sampled sequences using simple mutational models to construct a likelihood for a specific transmission tree [46,47] and translate coalescent rates into key epidemiological parameters [48][49][50]. Despite the potential for genome sequencing to revolutionize outbreak response, the global public health community still struggles to coordinate data sharing across international borders, between academic researchers, and with private companies [51][52][53].
Third, and most often the first available real time information on novel pathogens, are data related to similar past outbreaks. For example, in Fig. 2 we make a range of predictions for the final size of 2019-nCoV in Wuhan based on R 0 estimates from early cases [18,19] and for the underlying distribution of secondary cases by analogy with the severe acute respiratory syndrome in similar population [13]. Based on this uncertainty, we obtain a range of final outbreak size (given as a fraction of the total susceptible population) between 5% and 40%. Critically, this large range stems from uncertainty in the heterogeneity of secondary infections. If the heterogeneity is large, sustained transmission is mostly maintained by so-called "super-spreading" events, then the outbreak is both more likely to end stochastically, less likely to spread extensively, as well as easier to manage with contact tracing, screening and infection control [24]. With less heterogeneity, the outbreak almost certainly cannot be contained and we must prepare for a pandemic of 2019-nCoV [54,55]. Clearly, we are in dire need of contact tracing data and/or high-resolution pathogen genome surveillance for this outbreak.
In conclusion, we reiterate that when accounting for the full distribution of secondary cases caused by an infected individual, there is no direct relationship between R 0 and the size of an outbreak. We also stress that both R 0 and the full secondary case distribution are not properties of the disease itself, but instead set by properties of the pathogen, the host population and the context of the outbreak. Nevertheless, we provide a straightforward methodology for translating estimates of transmission heterogeneity into epidemic forecasts. Altogether, predicting outbreak size based on early data is an incredibly complex challenge but one that is increasingly within reach due to new mathematical analyses and faster communication of public health data.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 12, 2020. . https://doi.org/10.1101/2020.02. 10.20021725 doi: medRxiv preprint