Survival dynamical systems: individual-level survival analysis from population-level epidemic models

In this paper, we show that solutions to ordinary differential equations describing the large-population limits of Markovian stochastic epidemic models can be interpreted as survival or cumulative hazard functions when analysing data on individuals sampled from the population. We refer to the individual-level survival and hazard functions derived from population-level equations as a survival dynamical system (SDS). To illustrate how population-level dynamics imply probability laws for individual-level infection and recovery times that can be used for statistical inference, we show numerical examples based on synthetic data. In these examples, we show that an SDS analysis compares favourably with a complete-data maximum-likelihood analysis. Finally, we use the SDS approach to analyse data from a 2009 influenza A(H1N1) outbreak at Washington State University.


Introduction
Despite their ubiquity in modern epidemiology, mathematical models of epidemics suffer many theoretical and practical drawbacks. Due to the need for mathematical tractability, such models often ignore important characteristics of disease transmission patterns and the underlying populations. This often leads to poor predictions. During the SARS epidemic of 2002-2003, the number of cases in China was predicted to reach 30 000 during the first four months of the epidemic. In fact, there were fewer than 800 cases reported during that time [1]. A more recent example is the Centers for Disease Control and Prevention (CDC) prediction of the 1 400 000 cases of Ebola in West Africa during 2013-2016 outbreak [2,3]. Although the CDC team did indicate that their prediction was the 'worst-case scenario', the inaccuracy of this upper bound prediction has highlighted the need for better mathematical models of epidemics and their control.
A typical challenge in the problem of epidemic control is how to relate the global, population-level dynamics of infection transmission to local, individuallevel intervention (e.g. vaccination). This dichotomy is reflected in two distinct approaches to modelling epidemiological processes. Agent-based models capture individual-level histories of infection and removal. By contrast, ecological models look at the population at an aggregate level, keeping track of summary statistics such as the counts of susceptible, infected and recovered/removed individuals. Although both agent-based and ecological models are routinely used in practice and in the literature, the two scales of analysis are almost always considered separately [4].
The Kermack-McKendrick model [5] is the most fundamental example of an ecological model. It assumes the population is segregated into susceptible (S), infected (I) and recovered/removed (R) compartments. The time evolutions of the population proportions in compartments (denoted by S t , I t and R t ) are described by the following well-known system of ordinary differential equations (ODEs): _ I t ¼ bS t I t À gI t and _ R t ¼ gI t : 9 > > > = > > > ; (1:1) Here, β and γ are the infection and recovery rates, respectively. Solutions to equation (1.1) are often called the susceptible-infected-recovered (SIR) curves (figure 1). The law of mass action has been implicitly assumed, so any infectious individual can infect any susceptible individual. The ODEs model in equation (1.1) averages out individual dynamics, so it does not capture the stochastic fluctuation of epidemic processes in real life. In particular, the practical problems of applying equation (1.1) to data are: 1. Population size. Since the quantities in the SIR equations are proportions, it is not immediately clear how to apply them to real epidemics, which occur in finite susceptible populations. Moreover, the size of the population is often unknown. 2. Likelihood. Since the SIR equations are deterministic, we cannot write a likelihood for epidemic data without further, often ad hoc, statistical assumptions about the form of the likelihood function. 3. Aggregation over individuals. The SIR model represents the mean-field equations for (scaled) population counts, aggregating out individual characteristics. 4. Aggregation over time. The real data are typically aggregated not just over the population but also over observed time periods, leading to interval censoring 1 that cannot be easily incorporated into the SIR equations.
In this paper, we show that simple algebraic manipulation of the SIR equation (1.1) uncovers a precise probability law for the individual transitions between compartments. We refer to this interpretation of the solutions of equation (1.1) as a survival dynamical system (SDS). This new interpretation allows us to apply tools from survival analysis to population-level epidemic data. It directly addresses the first two problems listed above, and it lays a theoretical foundation for addressing the latter two problems. We focus on Markovian mass-action SIR models in this paper, but the SDS approach generalizes to non-Markov and networkbased epidemic models.
The rest of the paper is structured as follows. First, we briefly review the relevant background on mathematical modelling in epidemiological literature. In §2, we make the SDS interpretation of the SIR equation (1.1) precise. In §3, we show how this approach can be used for statistical inference and compare the performance of estimators based on SDS likelihoods to those based on standard complete-data likelihoods. In §4, we use an SDS likelihood to analyse 2009 influenza A(H1N1) outbreak data from Washington State University. Finally, we conclude the paper with a brief discussion in §5. Additional mathematical preliminaries, statistical inference results and other material are provided in the appendices. A list of symbols used in the paper is provided in   royalsocietypublishing.org/journal/rsfs Interface Focus 10: 20190048 such that S i (t) = 1 if he or she is in the susceptible compartment at time t and S i (t) = 0 otherwise. Similarly, define the processes I i for the infected compartment and R i for the recovered compartment. Naturally, S i (t) + I i (t) + R i (t) = 1. For time T ∈ (0, ∞), we assume that the process {(S i (t), where Y 1 , Y 2 , …, Y n+m and Z 1 , Z 2 , …, Z n+m are independent unit-rate Poisson processes. Models of this form are often called agent-based models in the literature [9,10]. An intuitive explanation behind the random time change represetation in equation (1.2) is as follows: consider individual i who is initially susceptible. He or she will change status from susceptible to infected as soon as one of the infected individuals make an infectious contact. Because infected individuals make infectious contacts independently, the amount of time the i-th individual will remain susceptible has an exponential distribution with rate n À1 b P nþm j¼1 I j . Once infected, he/she cannot be infected again. Therefore, the jump of the local process S i from 1 to 0 can be equivalently described by the jump of the process Y i ( Ð t 0 n À1 bS i (s) P nþm j¼1 I j (s) ds), where Y i is a unit-rate Poisson process. Note that when the local process S i jumps from 1 to 0, the process I i also jumps from 0 to 1. When i is in infected status, he/she will recover after an exponentially distributed amount of time with rate γ. Therefore, the jump of the local process I i from 1 to 0 can be equivalently described by the jump of Z i ( Ð t 0 gI i (s) ds), where Z i is a unitrate Poisson process. Similar arguments give the equation for the local process R i . The random time change representation in equation (1.2) for the entire ensemble {(S i (t), I i (t), R i (t))} i=1, …,n+m; t∈[0,T] follows from these considerations.
An equivalent construction of the agent-based model in equation (1.2) was proposed by Sellke [11]. Let T i,I denote the amount of time i remains susceptible, provided he or she was susceptible initially. Given the history of the infection process I(s) ¼ P mþn j¼1 I j (s) up to time t, the conditional probability that individual i remains susceptible until time t is given by Therefore, to each susceptible individual i, we can assign an independent EXPONENTIAL(1) random variable Q i and change his/her status from susceptible to infected when Once a susceptible individual gets infected, he or she recovers after an infectious period that follows an exponential distribution with rate γ. If we denote the recovery time of the i-th individual by T i,R , it follows immediately from equation (1.2) that T i,R − T i,I and T i,I are independent and T i,R − T i,I has an exponential distribution with rate γ. Symbolically, T i,R À T i,I ? T i,I and T i,R À T i,I EXPONENTIAL(g): (1:4) The fate of an individual is entirely described by the statistical distributions given in equations (1.3) and (1.4). The Sellke construction can also be derived using a statistical representation of agent-based models under the law of mass action based on contact intervals [12,13]. In this case, the contact interval distribution is EXPONENTIAL(β). These considerations lead to algorithm 1.1 for simulating the process in equation (1.2), which is known as the Sellke construction [7,14,15]. It can be easily verified that algorithm 1.1 is equivalent to simulating the system in equation (1.2).

Population level: ecological susceptible-infectedrecovered model
The simplest way to derive an ecological model from the agent-based model in equation (1.2) is via lumping or aggregation of states. When the aggregation of states is strongly lumpable [16,17] (also see appendix A), the resulting aggregated process remains Markovian for any choice of the initial distribution. For the SIR process, let X :¼ {S, I, R} denote the state space of each individual. Then, X nþm is the state space of the ensemble of individual-based S i , I i , R i processes. Define the macro-level processes I i (t) and R(t) ¼ . Partition X nþm into X 1 , X 2 , . . . , X L such that any two states in each X l produce the same counts for S(t), I(t), R(t), for l = 1, 2, …, L. It is easy to see that the Markov chain described in equation (1.2) is (strongly) lumpable with respect to the partition {X 1 , X 2 , . . . , X L } (see [10,17,18]). That is, the lumped process (S, I, R) is also Markovian for any choice of the initial distribution. Therefore, we can write where Y and Z are independent unit-rate Poisson processes. This system can be simulated using the Doob-Gillespie algorithm (see algorithm B.1 in appendix B). This ecological model is convenient in that it is amenable to asymptotic analysis. Indeed, for very large populations, we can approximate the scaled stochastic SIR dynamics by a system of ODEs [19,20]. This is sometimes called mean-field or fluid limit of the Markov jump process. For our SIR system in equation (1.6), the scaled process (S n , I n , R n ) := (S/n, I/n, R/n) satisfies   (1:7) By virtue of the Poisson law of large numbers (LLN) [6], which asserts that n −1 V(nt) ≈ t for a unit-rate Poisson process V when n is large, the processes in equation (1.7) converge to the solution of the following system of ODEs as n → ∞ and m/n → ρ ∈ (0, 1): These are identical to the Kermack-McKendrick ODEs in equation (1.1). The introduction of ρ is convenient because it sets s 0 = 1, i 0 ¼ r and r 0 = 0. The rate of convergence to this LLN ODEs limit can be computed using sample path large deviations principle (LDP) of the Markov process in equation (1.7). Standard tools from [21][22][23] as well as related results from [24][25][26] can be borrowed for this purpose.

Survival dynamical systems
The ODEs in equation (1.8) that describe the large-population limit of the ecological SIR model can be given an agent-based probabilistic interpretation. It is convenient to rewrite equation (1.8) as follows: where R 0 ¼ b=g is the basic reproduction number. Here, the first two equations are obtained by partially solving the ODEs system using the integrating factor (first equation) and variation of parameter (second equation) methods.
In the limit of a large population, the time of infection T I of a randomly chosen susceptible individual has the survival function This is a direct analogue of equation (1.3) where the stochastic quantity n À1 Ð t 0 I(u) du is replaced by its deterministic limit Ð t 0 i s ds may be thought of as the cumulative hazard and bi t as the hazard function of the random variable T I . This hazard is sometimes called the force of infection. In the limit of large n, the units become independent due to the phenomenon known as mean-field independence or propagation of chaos [27][28][29].
Because T I is an improper random variable, its survival, cumulative hazard and hazard functions are also improper. The probability that T I = ∞ equals s ∞ , which is the limiting proportion of individuals who remain susceptible. Setting s ∞ = 1 − τ and τ = r ∞ − ρ where r ∞ is the limiting proportion of recovered individuals, we see that τ must satisfy the deterministic final size equation The final size equation is a contraction map, so it is amenable to numerically efficient fixed-point iteration schemes. Because 0 ≤ τ < 1, we may interpret τ as the probability that T I < ∞. Given that T I < ∞, its conditional survival function iss and its probability density is Let T R be the time of removal of an infected individual who is infected at time T I (with T I < T R ), and let be the infected proportion of the population excluding the remaining initial infecteds. From equations (2.1) and (2.5), we obtain Because f t (u) is a density function, the right-hand side above is a convolution of the conditional density f τ of T I royalsocietypublishing.org/journal/rsfs Interface Focus 10: 20190048 and the (exponential) density of T R − T I , the infectious period. It follows that the right-hand side quantity is itself a density of the variable T R , which is the sum of the independent random variables T I and T R − T I . Note the analogy of this result with equation (1.4). Let us denote the infectious period by the random variable W := T R − T I . These considerations give us algorithm 2.1 for simulating individual histories in the SIR model. See figure 2 for a pictorial representation of the idea.
Analysing timepoints (T I , T R ) according to algorithm 2.1 addresses all four issues of macro SIR model in equation (1.1) described in §1. Algorithm 2.1 no longer requires the population size ( problem 1). Generation of individual trajectories according to algorithm 2.1 allows us to specify a likelihood function ( problem 2), account for differences in individual characteristics ( problem 3), and overcome issues with censoring or interval-based data ( problem 4). Algorithm 2.1 brings us back from ecological to agent-based models and completes a conceptual 'micro-macro-micro' loop. The SDS interpretation has similarities with symbolic dynamical systems [30][31][32].

Parameter estimation
Under the stochastic agent-based SIR model equation (1.2) or its aggregated ecological version in equation (1.6), the vector of parameters of interest is θ = (β, γ, ρ) with m = I(0) = ρn. The parameter τ is expressible in terms of θ via equation (2.3). The size of the initial susceptible population (n) is usually unknown and may be considered a nuisance parameter.
The estimation of this nuisance parameter is often problematic, and popular methods such as profile likelihoods do not always yield good estimates. In order to address this problem, we propose a likelihood based on the SDS interpretation of the SIR model in equation (1.1) that does not require n (although n still may be estimated, see algorithm 4.1 in §4). Before going into the details of SDS likelihood, we describe the exact likelihood based on the Doob-Gillespie algorithm (see algorithm B.1 in appendix B). To emphasize the utility of the SDS likelihood, we compare its performance to an exact likelihood that is given the correct value of n.

Exact (Doob-Gillespie) likelihood
Assume we observe a total of R} denotes the type of event. Of these events, z I are infections and z R are removals. Put X(t) = (S(t), I(t), R(t)). Then, following algorithm B.1, the exact log-likelihood for θ is where the last two integrals may be also written as finite sums. It is important to note that the above likelihood is conditional on the initial value X(0) = (n, ρn, 0), which we assume to be known. From equation (3.1), the maximum-likelihood estimate (MLE) for β and γ can be derived aŝ Because we know the population size n and the trajectory X(t) t∈[0,T ] when using the exact likelihood, the parameter ρ = n −1 I(0) is also known exactly.

Survival dynamical system likelihood
Following the discussion in §2, an approximation of the exact likelihood function ℓ 1 (θ) in equation (3.1) can be obtained from equation (1.3) by replacing the process n −1 I(u) with its asymptotic limit i (as n → ∞) and considering the individual trajectories as independent. Since we let n → ∞, the exact value of the initial size of the susceptible population is no longer needed.
with probability 1 -t Figure 2. SDS derived from SIR equation (1.1). To each individual, we assign random variables T I and T R specifying his/her infection and recovery times, respectively. The laws of T I and T R are given by equations (2.5) and (2.8).
Algorithm 2.1. Pseudocode for simulating a single SDS trajectory. royalsocietypublishing.org/journal/rsfs Interface Focus 10: 20190048 Assume we randomly sample N + M individuals of whom N are initially susceptible and M initially infected. We observe these N + M individuals up to the cut-off time T and record their infection or recovery times. Suppose K out of the N initially susceptible individuals get infected at times t 1 , t 2 , …, t K and L of them recover by time T. Pair each infection time t i with the corresponding duration of infectious period w i if the individual recovers by time T. If the individual does not recover by time T, pair t i with the censored recovery period w i = T − t i . Among the M initially infected individuals, supposeL individuals recover by the cut-off T at times e 1 , e 2 , . . . , eL. Then, following algorithm 2.1, we have the following SDS likelihood: where, as described in §2, and τ = r ∞ − ρ satisfies equation (2.3). In the next section, we evaluate the performance of the SDS likelihood from equation (3.3) in MLE and Markov chain Monte Carlo (MCMC) implementations.

Bayesian estimation using Markov chain Monte Carlo
In order to construct a posterior distribution for θ, we assign gamma priors to the parameters β, γ and ρ: and r GAMMA (a r , b r ): The positive quantities a β , b β , a γ , b γ , a ρ and b ρ are appropriately chosen hyper-parameters. The posterior distribution of θ is obtained by Bayes' rule: it is proportional to the product of the likelihood function given in equation (3.3) and the three priors above. Unfortunately, the posterior distribution of the SDS likelihood cannot be written in closed form. Even if a conditional posterior distribution is obtained, any closedform expression for the probability density function would require solutions s t , i t , r t to equation (2.1), which are themselves functions of θ. Thus, we cannot employ a generic Gibbs sampler method [33,34], and we need a more efficient updating algorithm than the standard Metropolis-Hastings algorithm. Here, we adopt the robust adaptive metropolis (RAM) algorithm [35,36], which adapts the tuning constant and the variance-covariance matrix of the proposal distribution to maintain a consistent acceptance ratio in the Metropolis steps, which helps achieve good mixing of the chain. The variance-covariance matrix is updated during the MCMC iterations. In algorithm B.2, in appendix B, we provide pseudocode for implementing an MCMC procedure for drawing posterior samples using RAM.

Simulation study
The SDS likelihood presented in the previous section has several theoretical advantages. Two of the main advantages are: (a) it does not require knowledge of the number of initially susceptible individuals n and (b) it works with partial data in that it requires trajectories of only a randomly chosen sample of individuals. Nevertheless, the SDS likelihood is based on an LLN approximation of a large population, so it is important to evaluate the accuracy of this approximation. In this section, we compare the accuracy of the inference based on the SDS likelihood (without n) to that of the exact likelihood (with n). Though the comparison is deliberately unfair in that exact value of n and full data trajectories are supplied only to the exact likelihood, our objective is to see how much worse the inferences from the SDS likelihood are due to the approximation error as well as lack of n and full data trajectories. The data used for parameter inference are generated according to algorithm 1.1.
We compare three different inference methods:  (3.4) to infer θ. Because of conjugacy of the gamma priors, the posteriors are also gamma distributions [33]. In particular, they are given by royalsocietypublishing.org/journal/rsfs Interface Focus 10: 20190048 For each of the 18 scenarios, we generate 100 sets of synthetic epidemic data using algorithm 1.1. Each generated dataset has n + n × ρ rows (one for each individual in the epidemic) and two columns (one for T I and one for T R ). To ensure the prior distributions in our Bayesian inference are uninformative, we set a i = i × 0.01 and b i = 0.01 for i = β, γ and ρ. For Method 2, we generate 1000 samples without any burn-in phase or thinning because Monte Carlo simulations are sufficient. For Method 3, we iterated the MCMC procedures 11 000 times. The first 1000 iterations are removed as burn-in. After burn-in, every 10th iteration is stored as a posterior sample. In total, 1000 posterior samples are used for estimation. For the Bayesian methods (i.e. Method 2 and Method 3), we estimate the parameters β, γ and ρ by taking the means of 1000 posterior samples. Figure 3 summarizes the numerical results of the parameter setting θ 1 . Figure 4 shows the results of the parameter setting θ 2 , and figure 5 shows the results of the parameter setting θ 3 . In addition to the parameter estimates ( posterior means), error bars (1.96 s.d.) are also provided. These figures show that Method 3 based on the SDS likelihood fares well against Methods 1 and 2 based on the exact likelihood. Barring minor exceptions, Method 3 yielded accurate estimates for all three parameters β, γ and ρ even for relatively small values of n. The results for n = 10 2 are particularly encouraging. Tables 2 and 3 show that the mean squared error (MSE) decreases with increasing n across all three methods. As expected, the quality of inferences for the large cut-off time settings is better than that for the small cut-off time settings.
Since ρ is assumed known for Methods 1 and 2, it is estimated only in Method 3. Figures 3c, 4c and 5c show that the quality of estimation is sometimes poor when n is small. Note the n = 10 2 case in particular. Nevertheless, it is estimated accurately when n is moderately large.  royalsocietypublishing.org/journal/rsfs Interface Focus 10: 20190048 Further numerical results and explanations are provided in appendix C. Method 3 seems to have a slightly larger variance than the other two methods. Even though visual inspection suggests that Method 3 achieves comparable performance against Method 2 and Method 3, a more objective criterion would be useful. Such a criterion should take into account both the biases and the MSE of the methods. For instance, information criteria such as the focused information criterion [37] can be used for this purpose. However, our intention here is not to find which method performs the best, but rather to find how the approximate SDS likelihood performs against the exact likelihoods. Since figures 3-5 and the additional results in appendix C provide satisfactory evidence in favour of Method 3 and give adequate insight into its performance, we do not perform any further comparative analysis.
Instead, we apply the SDS likelihood to a real dataset in the next section.

Data analysis
In the autumn of 2009, a new strain of influenza spread around the world after its initial outbreak in the state of Veracruz, Mexico in April 2009. The influenza A(H1N1)pdm09 virus was a triple reassortment of bird, swine and human flu viruses further combined with a Eurasian pig influenza virus [38]. Unlike most strains of influenza, this influenza A(H1N1) virus did not disproportionately infect adults older than 60 years, and it spread easily among young, healthy adults. This feature of the virus resulted in multiple outbreaks of the disease on college campuses across the continental USA. An outbreak on the campus of Washington State University (WSU) in  royalsocietypublishing.org/journal/rsfs Interface Focus 10: 20190048 Pullman, Washington began in late August 2009, upon the return of students for the autumn semester. Over a period of slightly more than three months, almost 2300 students were seen at the campus health centre with influenza-like illnesses that were treated as influenza A(H1N1) infections. 2 Figure 6 shows daily counts of new infections for 105 days beginning on 22 August 2009. The counts were obtained directly from the cases of 'influenza-like illness' among students who visited or called the WSU Student Health Services seeking care. In our statistical analysis, the collected daily counts were considered as records of 'new infectives'. This particular dataset is interesting because it was obtained from an approximately closed population. The WSU campus is located in a town with a large student population (around 18 000 students) and a relatively small resident population (around 9000 residents). The location is relatively remote, with an average population density of only eight households per square mile in the surrounding rural areas.
As discussed in an earlier analysis of this dataset [38], these data may have been subject to both over-reporting and under-reporting: Some students may have assumed they had H1N1 when they had other influenza-like illnesses, while some students infected with H1N1 may not have sought medical care. However, such misreporting was considered to be relatively minor compared to the overall counts in the dataset [39]. This dataset was analysed earlier using a stochastic SIR model with parameters estimated using both likelihood-based and least-squares methods. Here, we re-analyse it using the SDS likelihood, emphasizing its multilevel nature by showing how the shape of the epidemic curve reflects changes in risk of infection in students who were susceptible. royalsocietypublishing.org/journal/rsfs Interface Focus 10: 20190048 The density of the infection time (conditional on T I < ∞) is given by f t (t) ¼ À_ s t =t (see equation (2.5)). Consequently, for the collection of n individuals at risk out of which k are seen to be infected at times t 1 < Á Á Á < t k < T where T < ∞ in the observation time horizon (i.e. censoring time), we have the log-likelihood function for infection times where θ = (β, γ, ρ) is the vector of free parameters, with τ being an implicit function of θ according to equation (2.3). Note that the above likelihood is conditional on the number of individuals at risk n, which is also typically unknown, and that the value 0 ≤ k ≤ n is a random variable. In particular, if T is sufficiently large, we have approximately k BINOMIAL(n, t).
Note that this implies in particular that if we do not know the value of n but have observed k, a reasonable estimate of the former is k/τ. In general, to impute a value of n, we could take n NEGBINOM(k, t), the negative binomial distribution. Conditionally on the value of k the (unobserved) recovery likelihood is then the usual log likelihood for the exponential survival model. Assuming r individuals have recovered after infectious periods w 1 < · · · < w r < T, we have where H g (Á) and h g (Á) are, respectively, the survival function and the probability density function of the exponential distribution with rate γ. Averaging the infectious periods used in the previous analysis [38,39], we assume here that the recovery times have an exponential distribution with mean γ −1 = 5.5 days (see also [40,41]), so γ was not estimated. The complete log-likelihood conditional on the population size n, the parameters and observables is then ' 0 (t 1 . . . , t k , w 1 , . . . , w r ju, n) ¼ ' I (t 1 , . . . , t k ju, n) þ ' R (w 1 , . . . , w r ju, k): Based on this SDS likelihood, algorithm 4.1 may be used for obtaining the posterior distributions of the parameters θ and n given the WSU dataset. Table 2. Summary of the numerical results for the longer cut-off times. Here, the values of T are 9 for θ 1 , 6 for θ 2 and 7 for θ 3 such that T is near the end of the epidemic process (also see figure 8). Method 3 yields accurate estimates without requiring knowledge of the size of the susceptible population n. Values in italics indicate the results corresponding to the best performing method.  4: Perform Metropolis-Hastings step for the target conditional distribution of (ujn) using the complete log-likelihood 5: Calculate t based on the current value of u using formula (2.3) 6: Sample the conditional distribution of (nju) by drawing n NEGBINOM(k,t). 7: until convergence royalsocietypublishing.org/journal/rsfs Interface Focus 10: 20190048 The results of applying algorithm 4.1 to the WSU dataset are summarized in table 4 and in figure 7. As in previous sections, independent, non-informative gamma priors were used for θ. The uniform (improper) prior was used for n. The maximum a posteriori estimate (MAP) of the effective population size ( population at risk) was found to be n = 7051. This is much smaller than the value of approximately 18 000 (total WSU student body) assumed in the previous analyses [38,39]. Consequently, the MAP value of R 0 % 1:06 is slightly smaller than that obtained in the previous analysis, and the SDS-based MAP for ρ is substantially larger than other estimates of the initially infected. Contrary to previous analysis [39], these values suggest that the high peak of an epidemic in early days of the academic year was not caused by high infectivity among newly infected students but rather by a high number of already infected individuals (high value of ρ). This point was already made in [38].

Discussion
In this paper, we present a new way of using classical SIRtype epidemic models for statistical inference. Our method addresses all four problems identified in §1. Indeed, parameter estimation based on the SDS likelihood (described in §3) does not require the effective population size n, addressing problem 1. The SDS likelihood, being a direct consequence of the SDS interpretation of the SIR equation (1.1), provides a principled way of specifying a likelihood function from epidemiological field data where the effective population size is unknown but large, addressing problem 2. Although we do not explicitly illustrate this here, the independence of individuals' contributions to the SDS likelihood also addresses the problem of aggregation over individuals Table 3. Summary of the numerical results for the shorter cut-off times. Here, we fix T = 3 so that the epidemic process is near its peak at T (also see figure  8). Method 3 yields accurate estimates without requiring knowledge of the size of the susceptible population n. Values in italics indicate the results corresponding to the best performing method.   ( problem 3) and over time ( problem 4). Moreover, due to its product form, the SDS likelihood method is easier to implement and analyse than methods based on partially observed CTMC (e.g. the Doob-Gillespie likelihood). The SDS method allows a novel approach to the monitoring of epidemics. Instead of longitudinally counting the number of infections, a random sample of individuals can be monitored continuously for changes in their health status. This is akin to a sentinel sensor network. Similar ideas have been routinely explored in communication networks literature in computer science (e.g. network probing and monitoring) [42]. The use of individual-level longitudinal data rather than counts allows much greater flexibility in estimating the effects of covariates (e.g. vaccination status) on infectiousness and susceptibility, and it extends easily to non-Markov transmission models.
Using the SDS likelihood, it typically suffices to have much smaller sample of transition data than other inference methods such as the exact likelihood method. Due to the asymptotic independence of infection and recovery times of individuals (see §2), the SDS likelihood takes a particularly simple form, facilitating a convenient implementation of a suitable MCMC scheme. We have made our code implementation of the SDS likelihood and MCMC scheme publicly available [43].
The SDS framework proposed here can be readily extended to accommodate a wide class of compartmental models with some partial ordering among compartments. The classical SIR model has been chosen here as an important example to illustrate the ideas underpinning SDS likelihoods. Indeed, the machinery developed in the present paper goes beyond compartmental SIR models, and it can be applied to more general epidemic processes as well as to many compartmental models arising in physics and chemistry. In particular, we believe SDS likelihoods can be applied to certain subclasses of chemical reaction network models in which the individual species molecules can be tracked as they undergo chemical reactions.
In many studies of epidemiological field data, the effective population size is assumed to be very large. For instance, a total population size of 10 6 was assumed in [44,45]. Our method is particularly appropriate for such settings. For smaller populations, knowledge of the rate of convergence of the scaled processes to the LLN limit is crucial for assessing the quality of inference based on the SDS likelihood. Therefore, to fully evaluate the appropriateness of the SDS approximation, one should first establish an LDP for the scaled process of interest. This is particularly important for small-scale epidemics. Even though our numerical results are encouraging for values of n as small as 100, quantifying the rate of convergence will be useful. Although we did not consider an LDP in this paper, we believe that the standard techniques [21][22][23]25,26,46] can be applied for this purpose in our context.
Another direction of future investigation will be to consider network-based systems and non-Markovian systems. For many epidemiological scenarios, the mass-action assumption is untenable. Several network-based models have been proposed in recent times [47][48][49]. Asymptotic study of those models in the form of various large-graph limits has also been done [50][51][52]. Therefore, extending our method to network-based models appears to be a natural next step.
Data accessibility. This article has no additional data.  Acknowledgements. The authors acknowledge the anonymous reviewers whose constructive feedback significantly improved the expository quality of the paper. In particular, the reference [37] was pointed out to the authors by one of the reviewers. A large part of this research was conducted during the Mathematical Biosciences Institute (MBI) semester-long programme on modelling infectious diseases in spring 2018. The authors thank MBI and its staff for their hospitality.
Endnotes 1 A random variable is said to be interval-censored when it cannot be observed exactly and is only known to lie within an interval. 2 In fact, as described in [39], for the first 10 days of the outbreak, all suspected cases were tested and laboratory-confirmed to be H1N1, after which all cases were considered H1N1. 3 There is also a notion of weak lumpability in the theory of Markov processes.

A.1. Lumpability of a Markov chain
Consider a CTMC C t on a state space Y :¼ {1, 2, . . . , K} for a finite positive integer K. Given a partition {Y 1 , Y 2 , . . . , Y M } of Y, define the processC t such thatC t :¼ i whenever C t [ Y i , for i = 1, 2, …, M. The original CTMC C t is said to be strongly lumpable with respect to the partition {Y 1 , Y 2 , . . . , Y M } of Y if the processC t is also a CTMC for every choice of initial distribution of C t . The processC t is often called the aggregated or the lumped process. Intuitively, lumpability is the property that disjoint sets of states can be identified by representative states such that the induced stochastic process on the representative states (which we call the aggregated or lumped process) is also Markovian for every choice of initial distribution of the original CTMC. In our individual-level model described in §1.1, the representative states are given by the partition {X 1 , X 2 , . . . , X L } of the state space X nþm . The representative states then correspond to the population counts S(t), I(t), R(t). The (strong) lumpability 3 of a CTMC can also be described in terms of lumpability of a linear system of ODEs. Consider the linear system _ y ¼ yA, where A = ((a i,j )) is a K × K matrix (representing the transition rate or the infinitesimal generator matrix of the corresponding CTMC on state space Y).
Definition A.1 (lumpability of a linear system [10,18]). The linear system _ y ¼ yA is said to be lumpable with respect to a The matrix B is often called a lumping of A. The following is immediate: if B is a lumping of A, then there exists an K × M matrix V such that AV = VB.
Refer to [16,17,53] for further reading and numerous characterizations of Markov chain lumpability. For each parameter setting, we chose two cut-off times: one near the peak of the epidemic and one near the end of the epidemic. The vertical lines in figure 8 indicate the smaller cutoff time for each of the three settings of the parameter values.
Our numerical results are summarized in tables 2 and 3. For the longer cut-off times, table 2 provides a summary of the simulation results for the three parameter settings and different initial numbers of susceptibles n. Here, the values of T are 9 for θ 1 , 6 for θ 2 and 7 for θ 3 . In each case, the epidemic is almost at its end by time T (figure 8). The first three columns show estimates of β from Methods 1, 2 and 3. Similarly, the next three columns show estimates of γ. The last column shows Method 3 estimates of ρ (recall that ρ is known exactly for Method 1 and Method 2). The rows of the table are divided into three parts corresponding the parameter settings θ 1 , θ 2 and θ 3 . Each of the three parts is further subdivided into the three different susceptible population sizes n = 10 2 , 10 3 and 10 3 . Finally, in each cell, we show the average of 100 posterior means and the MSE of parameter estimators. As we can see, Method 3 based on the SDS likelihood yields accurate estimates for all three parameters β, γ and ρ even for relatively small values of n (see the results for n = 10 2 ).
Whereas table 2 considers data collected until a T near the end of an epidemic, table 3 considers data with cut-off T = 3 that is close to the peak of an epidemic. (See figure 8 for a visualization of the SIR curves corresponding to these three parameter settings truncated at T = 3 by a vertical line.) The table formats are identical. Since the inference is based on royalsocietypublishing.org/journal/rsfs Interface Focus 10: 20190048 heavily truncated data, the MSE in table 3 are higher than  those in table 2. Also, the sharp decrease in MSE with increasing n in table 2 is less pronounced in table 3. Nevertheless, the estimates obtained are still quite accurate. Also, the MSE for Method 3 are slightly better than those of Method 1 or 2. Interestingly, the parameter ρ is accurately estimated by Method 3.
In figures 9 and 10, we show the posterior distributions of the Method 3 estimators of β, γ and β based on the SDS likelihood. To avoid repetition, we show only two posterior plots: figure 9 shows results for the parameter setting θ 1 under the smaller cut-off time, and figure 10 shows results for the parameter setting θ 2 under the larger cut-off time. As shown in tables 3 and 2, the variances of the posterior distributions shrink drastically as we increase n from 10 2 to 10 3 . We do not show the posterior distributions for the n = 10 4 case because it does not provide any additional insights into the quality of the inference procedure except for the fact that the posterior variance further reduces.
Finally, figure 11 shows additional diagnostic statistics for the MCMC implementation of Method 3. We show the thinned trace of a single Markov chain for n = 10 2 and 10 3 . As expected, the chain mixes faster when n = 10 3 than when n = 10 2 because Method 3 is based on an LLN of the scaled Poisson processes keeping track of the population counts. As before, we omit the n = 10 4 case. For completeness, we consider the third parameter setting θ 3 in figure 11. The Markov chains also converge for the other parameter settings (not shown). royalsocietypublishing.org/journal/rsfs Interface Focus 10: 20190048