Efficient Bayesian inference of fully stochastic epidemiological models with applications to COVID-19

Epidemiological forecasts are beset by uncertainties about the underlying epidemiological processes, and the surveillance process through which data are acquired. We present a Bayesian inference methodology that quantifies these uncertainties, for epidemics that are modelled by (possibly) non-stationary, continuous-time, Markov population processes. The efficiency of the method derives from a functional central limit theorem approximation of the likelihood, valid for large populations. We demonstrate the methodology by analysing the early stages of the COVID-19 pandemic in the UK, based on age-structured data for the number of deaths. This includes maximum a posteriori estimates, Markov chain Monte Carlo sampling of the posterior, computation of the model evidence, and the determination of parameter sensitivities via the Fisher information matrix. Our methodology is implemented in PyRoss, an open-source platform for analysis of epidemiological compartment models.


Introduction
uncertainties in the mechanisms of viral transmission, and the difficulties in determination of numbers of infections and deaths, a Bayesian approach is natural [9][10][11][12][13][14]. This allows the range of likely outcomes to be quantified and characterized. The evidence in favour of different epidemiological models can also be assessed, in the light of data.
Compartment models are widely used as models of epidemiological dynamics [15][16][17]. Within these models, individuals are grouped into cohorts, for example according to their age or location. The key assumption is that the rates of contact between individuals depend only on their cohorts. The resulting models have sufficient complexity to be useful in forecasting, while remaining simple enough that Bayesian analyses are tractable [10,11,13,18].
Such analyses require three main ingredients: the definition of a model, the prior distributions of the inference parameters and an efficient method for the evaluation of the posterior distribution [15,[19][20][21][22][23]. In this work, we derive an approximation to the model likelihood directly from the model definition, via a functional central limit theorem (CLT), similarly to [24][25][26][27][28][29][30]. Hence, for any given model, the approximated likelihood can be derived by a generic and automated procedure. This enables rapid Bayesian fitting of models to data with fully quantified uncertainties, as implemented in the PyRoss package [31]. It also enables sampling from the posterior by Markov chain Monte Carlo (MCMC), and the evaluation of model evidence (also known as marginal likelihood), which enables Bayesian model comparison [32][33][34]. The results presented here build on an earlier technical report [13] which discussed automated fitting of such models to data.
A variety of Bayesian inference approaches are possible in calculations of this type, which make different assumptions (either implicit or explicit) about the role of random fluctuations in the disease propagation and the surveillance of the epidemic. A common approach is to consider a deterministic generative model for the disease, and to treat the data collection (surveillance) as a stochastic process [9,11,18,35]. The disease dynamics is analysed by solving ordinary differential equations (or equivalent equations in discrete time); the likelihood is then computed by a simple formula. This approach is fast and flexible, but the use of deterministic disease models can bias the results, as can assumptions about independence of observed data points: see for example [36]. Other approaches [37][38][39] consider fully stochastic compartment models and estimate parameters using particle filters (or sequential Monte Carlo methods). Such computations avoid the biases mentioned above, but are much more expensive (for any given parameter set, multiple stochastic trajectories must be generated, and one aims to optimize over all parameter choices).
The methodology that we present is intermediate between these two approaches: the aim is to mitigate the biases associated with deterministic disease models, without the computational cost of stochastic simulations. The CLT approximation to the likelihood can be evaluated quickly by solving ordinary (deterministic) differential equations [25][26][27][28][29][30]. The underlying models include stochastic aspects of disease transmission, and the approach avoids any assumption of independent data points. On the other hand, the CLT approximation assumes that the epidemic is spreading in a large wellmixed population. As such, it can suffer from bias if applied to localized outbreaks or small populations. In such cases, the CLT approximation is no longer suitable, but methods for computing the likelihood from stochastic simulation should be applicable [37][38][39].
As an example where the proposed methodology is appropriate, we analyse an age-resolved population-level model of England and Wales, using data for recorded deaths from COVID-19 over the period 6 March to 15 May 2020, and inferring more than 40 model parameters, with priors informed by existing literature. Given the large numbers of cases in this period, the CLT approximation to the likelihood is justifiable. We compare several variants of the model, which differ in their assumed contact structure; we also compare the model evidence [32][33][34] for the different variants. For such large models (with so many parameters), methods that estimate the likelihood by simulation of stochastic trajectories are intractable. To our knowledge, previous work on epidemiological inference within CLT approximations [26][27][28][29][30] have not analysed models of this complexity.
A more detailed picture of the epidemic would be available by combining multiple data sources (for example, positive tests as well as deaths), but the example presented here illustrates the general methodology. The intended future applications of these methods are to similar ( population-level) models with higher complexity, e.g. [40].
In the following, models and definitions are given in §2, the likelihood approximation is discussed in §3, and the inference methods are summarized in §4. The approach is validated in §5 by performing inference on a synthetic dataset for a simple compartment model. The models for England and Wales royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 211065 are defined in §6, while §7 shows the results. We conclude with a discussion in §8. Some technical details and supplementary results are provided in appendices.

Compartment models 2.1. Definition
Consider a compartment model where N individuals are grouped into M cohorts, according to some attributes (for example, age and/or gender). Each cohort is divided into L epidemiological classes, indexed by ℓ = 1, 2, …, L. We assume a single susceptible class, which is ℓ = 1. Other classes may be either infectious or non-infectious: the canonical example is an SIR model which corresponds to L = 3, in which case the recovered (R) class is non-infectious. The analysis presented here is straightforwardly generalized to more complex compartment models, as might be used (for example) to model different pathogen strains, or vaccinated individuals with reduced susceptibility, or testing and quarantining [40].
In the general case, the total number of compartments is M × L and the state of the system can be specified as a vector n ¼ ðn 1 , n 2 , . . . , n MÂL Þ: ð2:1Þ We use boldface notation throughout this work to indicate both vectors and matrices. Each element of n is a non-negative integer, such that the number of individuals in class ℓ and cohort i is n i+M(ℓ−1) . For example, n 1 , . . ., n M are the number of susceptible individuals in each cohort. The disease propagation involves individuals moving between the epidemiological classes, by a Markov population process [41]. (Models may also include immigration or emigration steps where the total population changes.) The parameters of the model are θ = (θ 1 , θ 2 , …), indexed by a label a. The various stochastic transitions are indexed by ξ = 1, 2, …. In transition ξ, the population n is updated by a vector r j with integer elements, that is n ! n þ r j with rate w j ðt, u, nÞ: ð2:2Þ For example, if transition ξ involves a single individual moving from compartment α to compartment β then r j has −1 in the α-th place and +1 in the β-th place, with all other elements being zero. Consistent with the Markovian assumption, the rate w j ðt, u, nÞ depends on the current state, the parameters of the model and the time t. Two common types of transition are infection, and progression from one stage to another. For example, in the simple SIR example (with M = 1), we write (n 1 , n 2 , n 3 ) = (S, I, R) with total population N = S + I + R. Taking infection and recovery parameters as θ = (β, γ), the infection transition has r 1 = (−1, 1, 0) and rate w 1 = βSI/N, while progression for I to R has r 2 = (0, − 1, 1) and w 2 = γI. The general formalism used here covers simple SIR models as well as more complex ones, e.g. §6.

Contact dynamics and the well-mixed assumption
As illustrated by the SIR example, it is a general feature that progression transitions have rates that are linear in n, but infections are bilinear. As usual, we consider compartment models that assume a wellmixed population, in the sense that the typical frequency of meetings between individuals depends only on their cohort. These frequencies are described by the contact matrix [15][16][17]42], which appears in the rates w j for infectious transitions (for an example, see §6 below).
This well-mixed assumption neglects the detailed social structure of the population, for example that friends and family members meet each other much more frequently than other individuals. Despite this (coarse) approximation, compartment models are valuable tools for practical analysis of epidemics, and are useful for inference. Still, it must be borne in mind in the following that these models are not microscopically resolved descriptions of individuals' behaviour, but rather approximate descriptions that capture the main features of disease dynamics, and its dependence on model parameters.

Average dynamics and law of large numbers
We will be concerned with epidemics in large populations, with the well-mixed assumptions described above. We consider an approximate likelihood that is derived by considering a limit of large population, royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 211065 which is controlled by a large parameter V. In models where the total population N is fixed then V ¼ N. If the population is uncertain or subject to change then V is taken as a suitable reference value, for example the prior mean population at time t = 0. (As an example with changing population, we imagine a model that includes birth of new individuals and death by non-epidemiological causes, which can be modelled by transitions that add/remove individuals to/from S (or other) classes, instead of transferring them between classes.) Now define whose elements indicate the fractions of individuals in each compartment. To ensure a suitable largepopulation limit (within the well-mixed assumption), we require that the rates w j have a specific dependence on V w j ðt, u, VxÞ ¼ Vv j ðt, u, xÞ, ð2:4Þ where v j is the transition rate per individual (as opposed to the rate for the population). This assumption corresponds to frequency-dependent transmission, as is commonly assumed in models of human disease [17]. The methodology described here can be generalized to models with density-dependent transmission, but we focus here on the frequency-dependent case, which is the relevant one for application to COVID-19. Given the parameters θ and an initial condition x(0), models of this form obey a law of large numbers in the limit of large population V ! 1 [43][44][45][46]. In this limit, almost all stochastic trajectories x(t) lie close to a single deterministic trajectory, xðtÞ, which can be obtained as the solution of an ordinary differential equation The sum in this equation runs over all possible values of ξ; we do not write the range explicitly in such cases, for compactness of notation. Equation (2.5) is straightforwardly solved by numerical methods, so x can be computed. Note that the initial condition for (2.5) is x(0). As V ! 1, this means that a finite fraction of the population must be infected at t = 0, which is required for the law of large numbers to hold. As a result, this theory does not apply in the very early stages of an epidemic where only a few individuals have been infected.

Central limit theorem
The (approximate) likelihood that we use for Bayesian analysis rests on a functional CLT [43][44][45][46][47] for fluctuations of the epidemiological state about the mean value x. The structure of the CLT is outlined here, it applies in the limit V ! 1. The associated approximation for likelihood is discussed in §3.2, which also discusses its applicability when V is finite.
The CLT is derived for a fixed initial condition x(0). To analyse fluctuations, consider the (scaled) deviation of the epidemiological state x from its average with u(0) = 0. (The factor of ffiffiffiffi V p is standard in CLTs, it is chosen so that typical trajectories of the model have u of order unity, as V ! 1.) By considering the increment in u over a short time-interval and taking the limit of large V, one finds [48, §4.5.9] that u obeys a stochastic differential equation where W 1 , W 2 , … are independent standard Brownian motions (Wiener processes); our notation suppresses the dependence of x and u on the time t, for compactness. The elements of the square matrix J are In the physics literature, the derivation of (2.7) uses the van-Kampen expansion [46,48]; the application in population dynamics is due to Kurtz [43][44][45].
One sees that J and s j depend on the deterministic path x but not on the random variable u, so (2.7) is a time-dependent Ornstein-Uhlenbeck process. The CLT applies as V ! 1, it states that u has Gaussian fluctuations with mean zero, and a covariance that can be derived from (2.7). This result applies to the covariance at any fixed time, and to correlations between fluctuations at different times. The correlations are discussed in appendix A, for example (A 6, A 8).

Data and likelihood
A central task in Bayesian inference is to compute the posterior distribution of the parameters θ, given some observational data. The posterior probability density function ( pdf) of the parameters is [19][20][21] PðujdataÞ ¼ PðdatajuÞPðuÞ where Z(data) is called the model evidence, which is fixed by normalization of the posterior. We now describe how the observed data are incorporated in our methodology, after which we discuss the likelihood. A technical aspect of our approach is that the initial condition of the system at time t = 0 must be parametrized in terms of θ (or explicitly provided).

Data
In practical situations, observations of the epidemiological state are subject to uncertainty. Our methodology includes all random aspects of the observation (surveillance) process directly into the model. This means that measured data can be identified with the populations of certain model compartments; the remainder of the compartments are not observed, and correspond to latent variables. For example, the model of §6 (below) includes an observed compartment for deceased individuals, which should properly be interpreted as a compartment for deceased individuals who were diagnosed with COVID-19. Other compartments-for example susceptible and infected-are latent variables, which are independent of diagnosis. The measurement process is then modelled through the (stochastic) transition from infected compartment to deceased compartment, whose rate depends on the probability of correct diagnosis.
We assume that observations are made at an ordered set of positive times, indexed by m ¼ 1, 2, …. Specifically, at time t m , one observes a vector with m obs elements, which are linear combinations of the compartment populations at that time. That is, where F is a matrix of size m obs × (ML), which we call the filter matrix. Now define a vector Y that contains all the observed data, by collecting the individual observation vectors, This vector corresponds to the data in (3.1).

Approximated likelihood
The likelihood is denoted by Hence we require a computationally tractable estimate of this probability. The formula that we use is based on the CLT for the path x(t), as discussed in §2. 4. royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 211065 Given the parameters θ, the most likely observation is Y, whose elements are n obs ðt m Þ ¼ VFxðt m Þ, ð3:5Þ (3.3). Define also the (scaled) deviation of the data from this value, ð3:6Þ As in (2.6), the scaling is such that elements of Δ are typically of order unity. In the specific case considered here, the functional CLT states that the log-likelihood obeys where the approximate equality is accurate as V ! 1, and G −1 denotes the inverse of a square covariance matrix G, whose form is dictated by the CLT of §2.4: see appendix A, in particular (A 8). Given a compartment model with parameters θ and a filter matrix F, computation of G requires numerical solution of a (matrix-valued) ODE. We note once more that (3.7) is an approximation to the likelihood of the underlying compartment model, valid for large populations. Similar approximations have been applied previously in epidemiological inference, for example [26][27][28][29][30], and in physical sciences [25,49,50]. The results of those studies indicate that inference based on the CLT approximation can be effective in practice, but for any given V, the accuracy of the Gaussian (CLT) assumption is not easy to assess.
To address this, we highlight a few situations where caution is advised, in practical settings. First, the CLT is restricted to typical fluctuations of the stochastic process, so its application requires that the observed data Y lie within a few standard deviations of their most likely values Y. In other words, the likelihood will be only accurate for models with reasonable fit to the data. Second, for computation of the mean trajectory, the error in the CLT approximation comes from nonlinear processes (such as infection of a susceptible individual), while linear processes (such as recovery of an infectious individual) do not require any approximation. For well-mixed models, this means that a necessary condition for the CLT is that the total number of infectious individuals should be large compared with unity, so that the fluctuations in this quantity are not too large. Third, changes in compartment populations are integers, but they are treated as real numbers by the CLT. The associated error is that of replacing a (difference of ) Poisson-distributed integers by a Gaussiandistributed real number.
Among these three factors, the first two must be taken seriously when applying the methodology proposed here. In the examples considered below, the models do fit the data, and the total number of infectious individuals is numerically large at all times considered, giving confidence in the CLT approximation. For the third factor, we note that if some compartment populations are numerically small, observations of these compartments will tend to have little impact on the likelihood (because their mean occupancies will probably be comparable with their variances). In this case, the methodology will correctly infer that this observation has little effect on the posterior distribution: this weak dependence can help to mitigate errors associated with a breakdown of the CLT. We return to this point in §5, below.
Finally, we observe that in the practical context of epidemiology, the approximation error of the likelihood must be considered together with the fact that any compartment model is already a coarse approximation of the real-world disease progression. This is especially true for population-level models, given the well-mixed assumption for contacts within cohorts. In such cases, the aim is not for absolute accuracy in parameter estimation or likelihood computation. Instead, the likely applications would be Bayesian model comparison and forecasting, as discussed in §4. For those applications, it is vital to address sources of systematic bias in the inference process. By incorporating stochastic disease progression and dependency among observed data points, the CLT mitigates at least some of the biases of simpler approaches [36], at manageable computational cost.

Inference methodology
This section briefly describes the inference methodology, as implemented in PyRoss [13,31].

Model estimation
The methodology has been implemented for a general class of compartment models as defined above, including progression and infection transitions. Specifically, if transition ξ involves progression from compartment α, one has w j ðt, u, nÞ ¼ g j ðt, uÞn a , ð4:1Þ with arbitrary dependence of g j on the parameters θ and the time t. For infection reactions, suppose that transition ξ involves a susceptible individual in cohort i being infected by an individual in some infectious class. Denote the population of susceptible individuals in cohort i by S i and the population of individuals in cohort j of the infectious class by I ðkÞ j ; here k is a label for the relevant infectious class. Then the generic infection rate is where K ξ has arbitrary dependence on the parameters θ and the time t. The form of K depends on the rates of contacts between cohorts and on various epidemiological parameters, a specific example is given in §6 below. Once the model is specified, the inference methodology is automated. We outline the method, with details in appendix B and [13]. Given the data and some parameter values θ, the (non-normalized) posterior is computed (up to the normalization factor Z) by combining the prior information with (3.7). This posterior is optimized over θ using the covariance maximization evolutionary strategy (CMA-ES) [51], yielding the maximum a posteriori (MAP) parameters θ Ã . We also compute the Hessian matrix of the log-posterior using finite differences.
We consider the Fisher information matrix (FIM) [19], which measures the information provided by the data about the inferred parameters of the model. It is a matrix with elements where the angled brackets denote an average over the stochastic dynamics of the model, with fixed parameters u. Recalling (3.4), this means that one averages over all possible values of Y according to the model dynamics, instead of using the observed data. The sensitivity of parameter a with respect to the (expected) data can then be estimated as for more detail see [52,53]. The FIM is defined as an average over the stochastic dynamics, but the Gaussian structure of the likelihood (3.7) means that the FIM can be estimated by a deterministic computation, see appendix B.1.

Posterior sampling and the role of priors
To go beyond the MAP, we sample the posterior for θ by MCMC, using the emcee package [54]. In what follows, the results depend significantly on the prior, as well as the likelihood. This is natural in our (Bayesian) approach, because there are many sources of uncertainty in epidemiological modelling, and we incorporate available knowledge into prior distributions, informed by whatever expert judgement is available. Posterior sampling reveals which parameters are identifiable (constrained by the data) and which are only weakly identifiable (their posterior distribution remains close to the prior). The result of this process is that identifiable parameters are determined by the data, while weakly identifiable ones are determined by expert judgement, through the prior. (For experiments in the physical sciences, one might hope for enough data that the inferred parameters depend weakly on the prior, but that is unlikely in the epidemiological context.)

Model comparison
A significant advantage of Bayesian approaches is the ability to compare the evidence for different models in the light of data [32][33][34]. Several criteria exist for choosing among different models. We royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 211065 consider here the model evidence: this is not as easy to compute as some other criteria, but it has a firm theoretical basis, see for example ch. 28 of [21]. Recalling (3.1), the evidence in favour of any model may be expressed in terms of the likelihood L and the prior P as which is also known as the marginal likelihood. It tends to be large if the model corresponds to high likelihood, but the integral over parameters θ means that Z is strongly suppressed in cases where fitting the data requires fine-tuning of the parameters. This ensures that overfitted models have low evidence. Hence, models (or hypotheses) with larger Z are to be preferred (at least in the absence of prior information about which model is more likely). In practice, it is more convenient to work with the log-evidence. We compute Z using thermodynamic integration, see appendix B.2. The evidence is useful for Bayesian model comparison and model averaging [33], an example of model comparison is given in §7.3 below.
To interpret the model evidence, it is also useful to compute the deviance D [55], which is related to the posterior average of the log-likelihood as D ¼ ÀE post ½log L. Noting that the posterior distribution is P post ðuÞ ¼ LðuÞPðuÞ=Z one has This may be rearranged as where D KL ðP post kPÞ is the Kullback-Leibler (KL) divergence between prior and posterior. Hence, the evidence is large for models with high likelihood (low deviance), but subtracting the KL divergence means that the evidence is penalized for models where the posterior distribution is too sharply peaked, or too different from the prior assumptions. This avoids overfitting [34].

Forecasts and nowcasts
Given samples from the posterior, several kinds of forecast and nowcast are possible. The time period over which data is used for inference is called the inference window. In a deterministic forecast or nowcast, we compute the average path xðtÞ for a given set of parameters. This allows prediction of the population of unobserved (latent) compartments. If this is performed for times t within the inference window, we refer to it as a nowcast. The path xðtÞ can also be computed outside this window, this is a forecast. By sampling parameters from the posterior, the range of behaviour can be computed. However, this computation only captures the role of parameter uncertainty, it neglects the inherent stochasticity of the model.
In a conditional nowcast, we use the functional CLT to derive a (Gaussian) distribution for the population of the latent compartments, conditional on the observed data. Samples from this distribution can be generated, which allow the role of stochasticity to be assessed, see appendix B.3. We emphasize that the nowcast requires sample paths that are conditional on the data, for times within the inference window. Such conditional distributions cannot be sampled by direct simulation of the model, but the functional CLT enables sampling (under the assumption of large V).
Finally, we consider stochastic trajectories that extend beyond the inference window, which we call a stochastic forecast. In this case, we first use a conditional nowcast to sample the latent compartments at the end of the inference window, after which we simulate the stochastic dynamics (by Gillespie [56] or tau-leaping methods [57]). This yields trajectories of the full stochastic model, with integer-valued populations. (Contrary to the conditional nowcast, these sample paths are only conditional on data from the past. Hence they can be sampled directly, due to the Markov property.) These processes are analogous to computations with hidden Markov models (HMMs) [58]. The latent compartments correspond to the hidden variables, which are to be estimated. Also, nowcasting corresponds to sampling from the filtered distribution of the HMM, and the stochastic forecast is an HMM method for prediction. Inference based on the functional CLT leads to a multivariate Gaussian distribution for the latent variables which provides directly the filtered distribution (at this level of approximation).

Inference validation with synthetic data
To validate the methodology described so far, we consider a simple example model of SEIR type, with an additional compartment (D) for deceased individuals. There is a single age cohort, with population V. The compartment populations are denoted by S (susceptible), E (exposed), I (infectious), R (recovered) and D (deceased). The rates for the stochastic population model are Here, f is the infection fatality ratio (IFR), c is the rate of contacts, β is the infection probability per contact and γ E , γ I are rates for progression from E and I, respectively. Note that n = (S, E, I, R, D) is a vector of integer-valued populations, and recall from (2.3) that the corresponding fractions of the total population are x ¼ n=V. Hence (5.1) is consistent with (2.4), the rates w correspond to the numbers of individuals that are transferred (on average) between the compartments, per unit time. We take (β, f ) = (0.035, 0.02) and rates (c, γ E , γ I ) = (20, 0.35, 0.25) per day. For inference, we generate synthetic data by direct simulation of the stochastic model using Gillespie [56] or tau-leaping methods [57], depending on the population (see below). The simulation runs over an 80-day period which spans the course of the epidemic, the initial condition has 10 −3 of the population in the E compartment, and 4 × 10 −4 in the I compartment, with all other individuals being susceptible.
We take the daily numbers of deaths as observed data from this synthetic trajectory and we attempt to infer the 'true model' (the model that generated the data). We perform inference using data from a time window that starts when the total number of deaths first exceeds 0:2% of the total population (this is a random time which depends on the stochastic trajectory). We use the methods described above to infer the rate β and also initial conditions for the compartments S, E, I, denoted S 0 , E 0 , I 0 . (Note, these are the initial conditions at the beginning of the inference window, not the initial conditions at time zero.) The total population V and the values of (c, γ E , γ I ) are fixed at their true values. (Since the initial population D 0 of the D compartment is observed, the initial condition for R is computed as For the analysis of this section, the prior for β is a Gaussian whose mean is 0.8 of the true value, with standard deviation one half of its mean. The prior means for S 0 , E 0 , I 0 are obtained by considering the fastest growing linear mode of the deterministic dynamics (see appendix C).
The approximate likelihood of (3.7) is accurate for large populations. As initial validation we take V ¼ 10 8 . A trajectory of the true model is generated by tau-leaping method (stochastic generation of a full trajectory by the Gillespie method would already take a significant computational effort, comparable with the total time taken for inference of MAP parameters). We consider observed data from a 20-day inference window. We maximize the posterior over the four inferred parameters following §4.1, and we sample the full posterior distribution by MCMC following §4.2. For MCMC sampling of this model, we use an ensemble of eight walkers [54] (twice the number of inferred parameters); we take several thousand MCMC iterations per walker, resulting in a sampling time more than 50 times larger than the autocorrelation time of the underlying Markov chain. We discard one-third of the samples for burn-in of the chain.
Results are shown in figure 1. The inference machinery accurately infers all four parameters from just one stochastic trajectory. The posterior uncertainty is low-this is expected because the population of the model is very large so the CLT is an accurate description of its dynamics, and the fluctuations between trajectories are very small. (In particular, the number of deaths on each day is more than 10 4 , and the populations of S, E, I classes are more than 10 6 , so the natural scale for fluctuations in these numbers is N À1=2 0:1 À 1%.) To illustrate model forecasting, we consider a similar situation, but now with population V ¼ 10 6 . In this situation, daily deaths are in the 100s, so one may expect significant day-to-day fluctuations as well as some deviations from CLT behaviour. We perform inference using data from increasingly long time windows with lengths of 4-20 days; for each dataset we run independent computations of the MAP, and MCMC sampling. Figure 2 shows stochastic forecasts, as described in §4.4, as well as posterior distributions of β. As the data used for inference increases, the posterior uncertainty is reduced, as does the forecast uncertainty. Each forecast includes 40 trajectories, so the range of outcomes in each forecast can be used as a rough estimate of a 97.5% credible interval. One sees that the synthetic data royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 211065 fall inside the forecast uncertainty for all time windows considered. This shows that the forecast uncertainty of the model is a reliable guideline for future behaviour.
The method validation of this section aims to establish two things. First, that the numerical implementation is adequate; and second that the approximate likelihood (3.7) yields reliable results for inference and forecasting. In this particular example, the values of the observed data are in the 100s, and we verify that the approximate likelihood (3.7) yields reliable forecasts and posterior uncertainties. Since the CLT is valid when compartment populations are large, an important question is how the performance of this methodology behaves as one considers smaller populations, especially  royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 211065 in models with more compartments. This question is a subtle one: some results are shown in appendix C, with a discussion. As a general point, it is important that our proposed applications are for inference and forecasting based on single stochastic trajectories, as observed in epidemics. We expect in general-and the example of appendix C confirms-that parameter inference is challenging for small populations. In particular, for models with small compartment populations, the CLT approximate likelihood (3.7) will break down, which contributes to biased parameter estimates. However, we also expect large posterior uncertainties in such cases. In this situation, the results of appendix C indicate that the true model parameters are well inside the inferred posterior uncertainty, as they should be.
Finally, we remark that while inference of model parameters from synthetic data is a useful exercise, well-mixed models of real epidemics are abstractions that make strong assumptions about the disease (and surveillance) dynamics (recall §2.2). In this context, there is no 'true model'-the success of inference cannot be judged by its accuracy, but rather by its ability to fit (and forecast) the behaviour of observed time series in a consistent way, similar to figures 1 and 2. To assess this, we now apply a similar methodology to a model for COVID-19 in England and Wales.

COVID-19 in England and Wales: Model
We analyse a well-mixed compartment model for England and Wales, using data published by the Office for National Statistics (ONS), for numbers of deaths where COVID-19 was mentioned on the death certificate [59]. We consider the period 6 March to 15 May 2020, which covers the imposition of lockdown, and the associated peak in weekly deaths. (The first recorded deaths took place in the week ending 6 March, the lockdown was imposed on 23 March, and the peak in deaths was in late March and early April.) In numerical data, time is measured in weeks, starting from 6 March.
The model uses time-dependent contact structures to model non-pharmaceutical interventions (NPIs), which include the lockdown as well as other behavioural changes (mask wearing, additional hand washing, etc). For consistency with the well-mixed assumption of the model, our data excludes deaths taking place in care homes, since these individuals probably have unusual contacts, which are primarily inside their own institutions.
More precisely, individuals in the model are defined to exclude care-home residents, and we assume negligible transmission of infection from care homes to non-residents. (Note, there is no such assumption on transmission in the opposite direction, from non-residents into care homes.) We also assume (i) that royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 211065 all deaths in care homes were for individuals aged 75+, and (ii) that the care-home population is small in comparison to the total, so that the N i are fixed at the total cohort populations, without being adjusted to exclude care-home residents. These assumptions simplify the model; they are not perfectly accurate, but we argue that the associated approximations are negligible compared with the (coarse) well-mixed assumption discussed in §2.2, and the uncertainties in the identification of COVID-related deaths. Before embarking on the details of the model, we point out that it includes 128 compartments and we will infer either 46 or 47 parameters, depending on the variant. This is a challenging numerical task. It is likely that fits of similar quality could be achieved by a model with significantly fewer parameters; the dependence of parameters on age is also not very strong, so the number of age cohorts might also be reduced without much loss of accuracy. However, one purpose of this example is to test the capacity of the approach to handle models of this complexity, with a view to future work with (for example) compartments for quarantined/vaccinated individuals [40] and/or multiple variants of the virus. In this example, the inference computations are within the capability of desktop workstations, although long runs were required for MCMC and evidence computations, see below for details.

Definition and epidemiological parameters
We consider M = 16 age cohorts, which correspond to 5-year age bands from 0-4 to 70-74, and a single cohort for all individuals of age 75+. The population of cohort i is N i and V ¼ P i N i . Given the short time period considered here, we neglect vital dynamics (birth, ageing and death by causes other than COVID-19).
The disease model is broadly consistent with other studies such as [8,[10][11][12], although the treatment of individuals in the later stages of (more severe) disease is different, as discussed below. There are L = 8 epidemiological classes, illustrated in figure 3. Susceptible individuals (S) move to the exposed class E when they become infected. The exposed class represents the latent period so these individuals are not infectious; they progress with rate γ E to an activated class A, which is infectious but nonsymptomatic. We sometimes also denote this class by I (0) . From A, all individuals progress to class I (1) , with rate γ A . Hence I (1) includes cases that never develop symptoms, as well as paucisymptomatic and severe cases. (Paucisymptomatic cases are defined as those with very mild symptoms, following [60].) These situations are distinguished by their progression from stage I (1) -the total progression rate is γ 1 , with an age-dependent fraction α i of individuals (asymptomatic/paucisymptomatic cases) recovering into class R; the remainder progress to a symptomatic infectious stage I (2) . There is progression from I (2) to I (3) with rate γ 2 . After this, the (total) progression rate from I (3) is γ 3 , of which an age-dependent fraction f i of individuals die (transition to D) while the remainder recover to R. Hence f i corresponds to the case fatality ratio (CFR), and the IFR is (1 − α i )f i .  Figure 3. (a) Epidemiological classes of the example model, with rate constants indicated, the infection rate λ i is given in (6.2). Here, a i ¼ 1 À a i and similarly f i ¼ 1 À f i . The colouring distinguishes susceptible, infectious and non-infectious compartments. The transition I (1) → R represents rapid recovery of asymptomatic/paucisymptomatic cases, see main text. (b-d ) Contact matrices for the different model variants. The colour indicates the rate of contacts between individuals in different agegroups. Specifically, each row corresponds to an age cohort for susceptible individuals who make contacts with infected individuals of various ages (corresponding to the different columns). See §6.2 for further details.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 211065 Individuals in R are immune, we assume no reinfection within the period considered in this work. The inclusion of several infectious stages allows flexibility in the model as to the distribution of times between infection and recovery or death.
The infection process for cohort i depends on a contact rate matrixC, the susceptibility to infection of that cohort β i , and on how infectious is the infected individual (based on its infectious stage). Specifically, the rate for infection of individuals in cohort i by those in infectious stage k is where S i is the population of the relevant susceptible compartment, also I ðkÞ j ðtÞ is the population of the infectious stage for cohort j, and ν k is the infectiousness of stage k. There are separate transitions ξ for infection of every cohort i, and for every infectious stage k. Comparing (4.2) and (6.1) shows that K j ðt, uÞ ¼ b iCij ðtÞn k V=N j for this transition. The choice of contact (rate) matrix is discussed in §6.2, below. The (deterministic) equations that describe the average evolution of this model are given in appendix D.1; the force of infection for individuals in cohort i (the infection rate per susceptible individual) is denoted by λ i and can be deduced from (6.1), should be identified as A i and N j is the total population of cohort j.) Since we only consider data for numbers of deaths, it is not possible to infer all epidemiological parameters. For example, the data do not provide information about absolute numbers of cases, nor on the relative numbers of symptomatic and asymptomatic cases. For this reason, we fix the α and f parameters to estimated (age-dependent) values based on surveillance data from Italy in the early stages of the pandemic [60]. These estimates are discussed in appendix D.2; they are subject to considerable uncertainty, but the resulting model is still flexible enough to fit the data. All remaining parameters are inferred. The β parameters are age-dependent, all other epidemiological parameters are assumed independent of age. As noted above, the initial condition x(0) must be determined from the inference parameters u. Details of this procedure and full specification of all prior distributions are given in appendix D.2.
Compared with other models such as those of [8,[10][11][12], the main difference in our approach is that individuals in the later stages of the disease (I (2) and I (3) ) can still pass on the infection, albeit with reduced probabilities given by ν 2 , ν 3 in (6.2). Such individuals have high viral load but low levels of (viable) virus in the respiratory tract [61,62], indicating ν 2 = ν 3 = 0 might be the most realistic choice as in [8,[10][11][12]. Still the model considered here is suitable for illustrative purposes (in practice, ν 2 , ν 3 ≈ 0.1 are small, see also figure 11 below, and the associated discussion).

Model variants (contact matrices and NPIs)
We consider a Bayesian model comparison, based on several variants of the model described above, which differ in their contact structure.
In the absence of any NPI, infective contacts are described by (bare) contact matrices C, such that C ij is the mean number of contacts per day with individuals in cohort j, for an individual in cohort i. To account for NPIs we assume that individuals in cohort i have their activities multiplied by a timedependent factor a i (t) ≤ 1, so that the mean number of contacts per day during the NPI is changed to a i (t)C ij a j (t). In the absence of any intervention then a i = 1. Note also, C ij is a number of contacts, but the quantityC ij that appears in (6.1) is a contact rate; hence we takẽ where η is a basic rate of 1 day À1 . Our numerical implementation measures time in weeks, so h ¼ 7 week À1 . We consider three possibilities for the bare contact matrix C, see figure 3 and appendix D.3. Two of the choices are the matrices proposed by Prem et al. [63] and Fumanelli et al. [64], which are both based on the POLYMOD study [65]. The third is a simple proportional mixing assumption, which is that royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 211065 13 individuals meet each other at random where c 0 is a constant chosen to have a total number of contacts comparable to that of [63]. We refer to the resulting contact matrices (and the associated model variants) as C F , C P , C M for the models of Fumanelli et al. [64], Prem et al. [63], and proportional mixing, respectively. We do not distinguish at all between different types of contact (for example, home, work, school), the reasons for this are discussed in §7.1, below. We also consider two possibilities for the NPI parameters a i (t), as shown in figure 4. These were chosen to mimic the patterns of activity in the UK, based on data published by Google [66]. The first possibility is a steplike-NPI, with a linear decrease from a i = 1 to a i ðtÞ ¼ a F i over a time period W lock , after which a i (t) remains constant at a F i . The mid-point of the step-like decrease is at time t lock , the parameters of the NPI are t lock , W lock and the various a F i . The second possibility is an NPI-with-easing, it involves the same step-like decrease, followed by a linear increase, such that the value at the end of the period considered is where r is an additional lockdown-easing parameter (larger values correspond to more contacts). We emphasize that the Google data informed the functional forms chosen for a i (t), but all numerical parameters in this function are inferred. Priors and further model details are given in appendix D.2.

COVID-19 in England and Wales: results
We have applied the methodology of §4 to the models of §6. The total number of inference parameters (for initial conditions, epidemiological parameters and contact structure) is either 46 or 47, depending on the NPI. This number could be reduced by considering a smaller number of age cohorts, but we retain them here to illustrate that the methodology is applicable in models of this complexity.
As a baseline, we perform inference using data for the seven week period 6 March to 24 April 2020, with the remaining three weeks of our data period used to assess the resulting Bayesian forecast. For this model, converged estimates of MAP parameters are available within a few minutes on a desktop computer. For posterior sampling, we use the emcee package [54] with a number of walkers equal to twice the number of inferred variables. The estimated autocorrelation times of the underlying Markov chains were in the range 3000-5000 and sampling runs were in the range 3 × 10 4 to 10 5 to ensure  royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 211065 convergence, with the initial one-third of samples discarded to allow for burn-in. Each sampling run took several days on a single desktop workstation.

7.1.
Step-like-NPI . The model variants behave almost identically and fit the data used for inference. However, the forecasts are not accurate. We attribute this primarily to lockdown easing-this is neglected within the model shown (which has r = 0), so an accurate forecast should not be expected. Forecasting is explored further in §7.3, including more realistic models with r > 0. Figure 6a shows inferred values of latent (unobserved) compartments, using a deterministic nowcast with parameters from the posterior. As expected, they show a rise and fall in the number of infected individuals, with different stages having their peaks at different times. An important set of (agedependent) parameters are the β i , which determine the susceptibility to infection. Figure 6b shows inferred values of β i for the C F model, including the range of posterior samples, and the posterior mean, which are compared with the MAP estimate and the prior. The inferred values of β are quite far from the prior mean; these parameters are very uncertain a priori. (This uncertainty is incorporated by using lognormal priors for the β i with a standard deviation one half of the mean, see appendix D.2.) The main feature in the inferred result is the large value of β i for the oldest cohort (75+). The inferred values of other parameters are discussed in appendix D.4; they are generally consistent with the prior assumptions.
To rationalize the inferred β, it is easily verified that for a model with the assumed contact structure, CFR and α, the inferred value of β for the oldest cohort must be larger than all other cohorts, in order to capture the age-dependence of deaths in England and Wales, which are very skewed towards the older age groups. There are at least two reasons why inference might lead to such a large β: either the assumed CFR (or α) has too weak an age-dependence which is being compensated by an age-dependent β; or the contacts of elderly individuals are indeed more likely to result in infection, perhaps for medical reasons, or because of increased time in high-risk environments (such as hospitals). This distinction could be settled if accurate data for numbers of infections were included in the analysis, but it is not possible with the data considered here. The results of [11] suggest that susceptibility is age-dependent, but the dependence is weaker than we infer, indicating that both effects are in play. Figure 6c shows the (MAP) inferred β parameters for the models with different contact matrices. While the trend is similar, there are significant differences. Nevertheless, the behaviour of the inferred models is almost identical, recall figure 5c. The reason is that the behaviour of the model is dominated by the infection rates of (6.1)-different contact matrices can still lead to similar model behaviour, because of the freedom to adjust the β i . In this sense, our results can be interpreted as inference of an royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 211065 'infective contact matrix' whose elements are β i C ij . It is notable from figure 3 that the C P contact matrix includes some large differences between cohorts with similar ages, particularly in contacts with the 75+ cohort. These can be traced back to the finite dataset of the original POLYMOD study [65]. For the C P model variant, these large fluctuations lead to an inferred β i with a complicated dependence on age, for cohorts in the 60+ group. In the C F variant, the dependence on age is much smoother, both for contacts and for β. Compared with the contact matrices that are based on POLYMOD [63,64], the C M variant has (much) more contacts for older individuals, so the inferred β is lower in the older cohorts.

Fisher information matrix and model evidence
We now discuss the FIM (4.3) for the C F model variant with step-like-NPI. Two items of particular interest are parameters θ a whose inferred values are very sensitive to the data, and soft modes of the parameter space along which the likelihood varies slowly. These modes indicate aspects of the model that are mostly determined by the prior. The sensitivities of (4.4) provide useful information on the first point. Figure 7 shows the results. The parameters most sensitive to the data are the rates γ E , γ A and γ 1 , the probability of the oldest age cohort to get infected β 75+ , and the time of lockdown t lock , consistent with the discussion so far. These parameters have s a > 100, indicating that changes of order 1% in their values are sufficient to change the loglikelihood by an amount of order unity.
Soft directions around the MAP parameters, in which the model behaviour is expected to change very little, do exist. They arise from small eigenvalues of the FIM, and the corresponding eigenvectors. One example of such a soft mode is discussed in appendix D.4. The existence of soft modes speaks in favour of a Bayesian approach, in that prior information about the disease is used to fix those parameters which are not determined by the data. This makes best use of all information sources, including expert-derived priors.
We have also computed the evidence for these models, see figure 8. The C F and C P contact matrices lead to similar log-evidences, with the C P variant higher by around 3 units (we use natural logarithms throughout). The contact matrix with proportional mixing leads to log-evidence that is smaller by around royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 211065 8 units. We conclude that this model can still fit the data with reasonable accuracy, but the inference computation is sensitive enough to infer that the contact structure has some assortativity. Also shown is the posterior mean of the log-likelihood E post ½log L which is the negative of the deviance, recall (4.7). This similar behaviour of the evidence and deviance indicates that the differences between the models are primarily in the quality of the fit, rather than the amount of fine-tuning required for the parameters. Given the very naive assumptions of the proportional mixing model, we argue that the difference of 8 units in log-evidence should be regarded as a mild effect. Our conclusion is that the inference computation is not extremely sensitive to prior assumptions on the contact structure. Based on this result, it seems that more detailed modelling of contacts (for example, separation by work/home/ school) will have relatively little impact on the quality of inference, given the very large uncertainties within the model about the values of β i .

NPI-with-easing
We now consider NPI-with-easing. Since the behaviour with different contact matrices is very similar, we restrict to the C F model variant. Figure 9a is a deterministic forecast analogous to figure 5c; it shows how the easing parameter r leads to increased uncertainty in the forecast, in a way that is more consistent with the data. By contrast, figure 9b shows a stochastic forecast as defined in §4.4. This accounts for stochasticity in the epidemiological dynamics, it automatically matches the data within the inference window. The results of the two kinds of forecast are similar, indicating that the dominant source of uncertainty is coming from the model parameters.  royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 211065 To explore the effects of lockdown easing in more detail, we consider the effect of increasing inference window, always comparing the model forecasts for the same 10-week period. Results are shown in figure 10. The agreement between inferred model and the data increases, as expected-we find that this model can accurately fit the data, with reasonable parameter values.
For the seven-week inference window, figure 11 shows that the distribution of r is still close to the prior. This is consistent with the result of figure 8, that the evidence of the variant with easing is comparable to the variants with step-like-NPI. That is, the additional parameter r leads to a mild improvement in the fit to the data, and fine-tuning of its value is not required.
When considering longer time windows, we note that deaths are lagging indicator of the number of cases, which means that r is still not fully determined by the data. That is, these results still depend significantly on the prior (for details see appendix D.2). Nevertheless, increasing the inference window causes the posterior distribution of r to shift towards larger values, leading to improved agreement with the data. There are also significant differences in these posterior distributions, for example if 9 or 10 weeks data are used for inference-this limits the robustness of the forecasting and indicates a possible  royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 211065 tendency to overfitting. We attribute this primarily to the simple linear easing assumed in our NPI. Most other parameters depend weakly on the period used for inference (see appendix D.4). The posterior distributions for ν L in figure 11 are also similar to the prior, showing that this parameter is weakly identifiable. As noted above, an alternative modelling hypothesis would be that ν L = 0, as in [8,[10][11][12], consistent with the results summarized in [62]. In this work, that possibility is suppressed by the (lognormal) prior for ν L . The expert judgement of [62] might be used to refine the model by adjusting this prior-this illustrates the adaptability of the Bayesian framework.
In evaluating these results, we note that both the model and the likelihood assume a well-mixed population. In practice, individuals have correlated behaviour, which can be expected to enhance stochastic fluctuations. For this reason, it is likely that the functional CLT underestimates the variance of the data, given the model. This can lead to an overfitting effect. There are also uncertainties in the data that are not accounted for in the likelihood, such as possible under-detection of COVID-related deaths in the early period of the epidemic. Recalling that the deceased population in the model includes only those individuals who were diagnosed with COVID-19, such an under-detection might be modelled (in this framework) by a time-dependent CFR.

Discussion
We have described a methodology for inference and forecasting in epidemiological compartment models, where all stochastic aspects of disease propagation and measurement are modelled on an equal footing, and the likelihood is justified from first principles and derived directly from the model. This means that the likelihood can be computed directly from the model definition, given appropriate data.

Example model
This methodology has been used to calibrate a model for the COVID-19 pandemic in England and Wales, based on death data. We have compared models with different contact structures, showing that fine details of the contact matrix have very little effect on model behaviour and forecasts. Indeed, the model with proportional mixing behaves very similarly to those with contact matrices derived from the POLYMOD dataset [63][64][65]. This may be surprising at first glance, but the fact that the β i parameters are inferred separately for each cohort means that the model has enough flexibility to infer how many infective contacts are made by each group. More specifically, it is the infection rate constant K of (4.2) that determines the model behaviour, so one sees from (6.1) differences in contact matrices can be partially compensated by changes in β. (The compensation is only partial because while β i controls the relative numbers of infections, the contact matrix also determines the assortativity of mixing.) In contrast to the details of mixing among cohorts, modelling assumptions about time-dependence of the contact structure have a significant impact on forecasting, as one should expect. This is illustrated by the dependence of the behaviour on the easing factor r.
Within the time period considered, the model gives forecasts that are reasonably accurate and robust. However, we have identified a possible tendency to overfitting, some of which may be due to the wellmixing assumption that is used in the likelihood. Another common approach uses negative binomial distributions in the likelihood [10,11,18], this corresponds to a larger variance for numbers of deaths (overdispersion), compared with the CLT. It would be interesting to consider inclusion of an overdispersion factor in the likelihood used here, as a way of accounting for correlations in the contact structure.
In terms of model calibration, the main limitation of this study is the fact that we do not use data for case numbers, which means that the CFR cannot be inferred. In the UK, the rates and policies for testing for COVID-19 have had complex time-dependence, which means that robust estimation of case numbers is challenging. Incorporation of a time-dependent testing capacity into this framework is a direction of ongoing research. The extension of this framework to geographically resolved models is also under active investigation.

Methodology: strengths and weaknesses
The example models of §5 and 6 show that the methodology is effective in population-level models of large epidemics. These are situations in which the CLT approximation to the likelihood is expected to royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 211065 be valid, so they should fall within the applicability of these methods. We repeat that for models with small populations (where demographic noise becomes very large), models that do not rely on the CLT approximation should be preferred [37][38][39]. Compared with inference methods with deterministic disease dynamics [10][11][12], the approach is somewhat more expensive, because of the requirement to compute the CLT covariance for the trajectory. Still, the examples show that relatively complicated models are still within reach.
For the simple model of §5, accurate parameter estimation is possible when the population is very large, based on a single stochastic trajectory. For smaller populations, experiments on single trajectories show that the posterior uncertainty grows, which is again consistent with theoretical expectations. As this happens, the true model parameters remain inside the posterior credible intervals, as they should. Hence, while the posterior distributions may suffer some bias due to deviations from CLT behaviour, they are still reasonable estimates of parameter uncertainty.
For the model of England and Wales in §6, the scheme infers parameters that fit the data, and the forecast of figure 10 indicates that the posterior distributions are reasonably accurate, even when considering more than 40 parameters. Given the modelling assumptions ( particularly the well-mixed assumption, without over-dispersion), this result shows that the method has promise. Further work on more detailed and accurate models [40] will provide new and stringent tests of its applicability. Data accessibility. The example models of §5 and 6 were analysed using the PyRoss library [31]. The analysis codes and the resulting data are available at https://github.com/rljack2002/infExampleCovidEW (this includes the example with synthetic data, and the example model for England and Wales). This data will also be available at https://doi.org/ 10.17863/CAM.72839.
where Bðt, u, is a square matrix. Equation (A 2) can be solved (numerically) for S. Second, let 〈·〉 u(s) denote an average, conditional on the value of u(s). Taking the first moment of (2.7) one arrives at a linear equation for the average value of u, hence for t ≥ s one has where the matrix U(s, t) is the time-evolution operator, which solves the linear differential equation with initial condition U ij (s, s) = δ ij . This equation is readily solved (numerically) for U.
The third result can then be obtained by noting that the covariance matrix for u between two times s, t can be obtained (for t ≥ s) by composing the covariance SðsÞ with the propagator U(s, t). That is, To exploit this last result, consider the vector obtained by concatenating the full state of the system over observed time points, analogous to (3.3) Its mean is clearly X ¼ ðxðt 1 Þ, xðt 2 Þ, . . .Þ. Now define a (scaled) deviation from the mean as D ¼ ðX À XÞ ffiffiffiffi V p , and denote the covariance of this vector byG. From (A 6), this symmetric matrix is formed of blocks that depend on S and U: Since (2.7) is an Ornstein-Uhlenbeck process, it can be shown additionally [44][45][46] that the distribution of D is asymptotically Gaussian, with the given covarianceG. Since the observed data are related linearly to X according to (3.2, 3.3), one then obtains (3.7), with the covariance of Δ given by Since all elements ofG can be evaluated, this allows computation of the likelihood (3.7).

Appendix B. Implementation details for inference B.1. Fisher information matrix
For a multivariate normal distribution, such as the likelihood obtained in §3, with the mean vector YðuÞ and the covariance matrix G(θ) the elements of the FIM (4.3) are [67] This form is advantageous since its computation only requires first-order derivatives, which are estimated as finite differences.

B.2. Evidence computation
We use a thermodynamic integration method [68,69] to compute the log-evidence where AðuÞ ¼ log LðuÞ is the log-likelihood, and the domain D is the support of the prior P(θ). Compared with alternatives (for example, nested sampling [70]), thermodynamic integration allows robust estimation of convergence, and its numerical uncertainties. We summarize the method, which is based on an integration path from a tractable (and normalized) distributionP To estimate this integral, we take a sequence 0 = z 0 < z 1 < · · · < z n = 1 and we use trapezoidal quadrature For each quadrature point, f 0 (z i ) is estimated by an MCMC computation of the expectation value (B5). These are independent MCMC estimates, which facilitates analysis of numerical uncertainties.
To select a suitableÃ, the idea is that the closer isP to the posterior, the shorter is the integration path, and the easier the computation. However,P must be a normalized distribution on D. A suitable choice for P is therefore a truncated Gaussian approximation to the posterior, around the MAP parameters θ Ã . Let H be the Hessian matrix of the (negative) log-posterior, whose elements are Then the truncated Gaussian approximation of the posterior distribution is for u [ D, andP ¼ 0 otherwise. Rejection sampling is used to sample this distribution and to simultaneously obtain its normalization constant, so thatÃðuÞ ¼ logðPðujYÞ=PðuÞÞ can be computed, consistent with (B 3). Hence (B 7) can be computed.

B.3. Conditional nowcast
To sample latent compartments during the inference period, we use the CLT for u discussed in §2.4. It is convenient to assume that the latent populations are to be inferred at the times t m where data was collected. (This assumption is easily relaxed, at the expense of some heavier notation.) Hence, the populations of the latent compartments are encoded in the vector X of (A 7). Given the model parameters, the distribution of X obeys a CLT analogous to (3.7), withG as in (A 8). Since this distribution is (multivariate) Gaussian, and the data depend linearly on X, it is straightforward to condition on the data and obtain a Gaussian distribution for the latent compartments, which can then be sampled.
Appendix C. Details and additional results for simple SEIR model C.1. Priors (including use of linearized dynamics) For the example of §5, the Gaussian prior for β was described in the main text. In addition, we note that all priors are truncated to avoid negative values for parameters, as well as very large values. (This truncation has very little effect on the inferred parameters.) Priors are also required for S 0 , E 0 , I 0 , which are compartment populations at the start of the inference period. Denoting this time by t 0 , this means that the vector x(t 0 ) must be determined from the inference parameters θ (together with the observed value of the D compartment).
A convenient estimate of x(t 0 ) is available by linearizing the average dynamics (2.5) about the state x S where all individuals are susceptible. The behaviour of the resulting equation (at early times) is dominated by the largest eigenvalue of the matrix J(0, θ, x S ), as obtained from (2.8). The corresponding eigenvector dominates the evolution of the early stages of the epidemic, up to transient effects of the initial condition, which are controlled by the smaller (sub-dominant) eigenvalues. A suitable baseline estimate for the initial condition is then where v Ã u is the dominant eigenvector (which depends on the epidemiological parameters), and κ is a parameter. The normalization of the eigenvector is P a jv Ã u,a j ¼ 1 where v Ã u,a is the α-th element of the vector v Ã u ; this means that the value of κ is approximately one half of the non-susceptible (infected + recovered + deceased) fraction of the population, at t = t 0 .
In the example of §5, this linearization is used to fix the prior for the initial condition parameters S 0 , E 0 , I 0 : the eigenvector v Ã u is computed for the true model epidemiological parameters. The fraction of deceased individuals at t 0 is known, which is used to fix κ, leading to estimates for S 0 , E 0 , I 0 . These estimates are used for the prior mean. The prior distributions are taken to be Gaussian: the prior standard deviations for E 0 , I 0 are one-third of their means; the prior standard deviation for S 0 is equal to that of E 0 . Figure 12 shows results of an inference computation for a total population V ¼ 10 4 , similar to figure 1 of the main text. One sees that the numbers of daily deaths are mostly in single digits, so the CLT approximation (3.7) is not expected to be fully accurate. Still, the MAP trajectory provides a very reasonable fit to the data. The posterior distribution of β is significantly narrower than the prior, and both posterior mean and MAP values are close to the true value. The posterior distributions for initial conditions follow quite closely the prior, indicating that the data are not sufficient to identify their values. However, the inference machinery does find significant posterior correlations among the initial conditions, even if their marginals are broad. This shows that the data do constrain these parameters significantly. Figure 12 also shows that the method infers significant posterior correlations between the parameters (which are independent under the prior). There is a correlation between E 0 and I 0 because the model is more sensitive to the total number of initial infections E 0 + I 0 than to the difference between E 0 and I 0 . Also, models with similar likelihood to the true model can be obtained by assuming a significant recovered population (R 0 ) at time t 0 , this reduces the susceptible population, which can be compensated by an increased β. The resulting models also provide reasonable fits to the data.

C.2. Additional results (effect of smaller populations)
Finally, we consider the possibility that the approximate likelihood (3.7) might lead to biased (or misleading) estimates of parameter values when populations are small (so the CLT breaks down). A Bayesian analysis of bias in this context would consider the effect on inference of providing increasing quantities of data. However, such a situation is not realistic for epidemiological applications in which one typically has data from a single epidemic (or outbreak), to be used for inference and forecasting.
To explore this situation, we mimicked a practical application of the method, as follows. We repeated the numerical experiment of figure 12 with 16 independent sets of synthetic data. For these computations, the prior mean for β was set to its true value, to avoid trivial bias on its posterior estimate. For each experiment, we computed the posterior mean β, and its 95% credible interval (CI) from the sampled posterior histogram. royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 211065 From these 16 experiments, we took the average of the posterior mean and the average CI, which are shown in figure 13, together with similar experiments with larger populations, up to 10 6 . The average posterior mean is close to the true value, even for small populations-this indicates that the approximate likelihood does not lead to large systematic errors in estimates of this parameter. Instead, the main effect of reducing the population is that the inferred CI on β becomes increasingly wide. This is expected because the approximate likelihood (3.7) is proportional to V, leading to sharp parameter estimates at large population, but large uncertainty when numbers are smaller. The same message is apparent from figure 12: reduction of the population leads to broad posterior histograms with the true value well inside the credible range.
Of course, these results do not establish that the approximate likelihood (3.7) will not lead to biased (or misleading) estimates in some situations, because of breakdown of the CLT in small populations. On the other hand, these results serve as a stress-test for the method: it does not lead to systematic errors or misleading estimates of uncertainty in this example, even when daily observations are in single digits, outside the strict range of validity of the CLT.
Appendix D. Details and additional results for the example model of COVID-19 D.1. ODEs for average dynamics For the (stochastic) model defined in §6, the deterministic equations for the mean (2.5) can be written in terms of the compartment populations. For this section alone, let S i be the average population of susceptible individuals in cohort i, and similarly for all other classes. Our notation omits the dependence of these populations on time, for compactness. Using dots to indicate time derivatives, (2.5) becomes where we do not indicate the dependence of the compartment populations on time (for compactness of notation), while a i ¼ 1 À a i and f i ¼ 1 À f i , also λ i is given by (6.2).

. Parameters, priors and initial conditions
This section gives additional details of parameters in the model of §6, and the priors used for inference. The infectiousness parameters ν k are defined relative to the first infectious stage, so ν 1 = 1. (This does not lose any generality because the ν parameters only appear through the combination β i ν k .) In practice, we parametrize ν in terms of a single inference parameter ν L : we take ν 0 = ν 1 = 1, with ν 2 = ν 3 = ν L . The early stages of a COVID-19 case are much more infectious than later stages so ν L < 1. As discussed in §6, a common modelling assumption [8,[10][11][12] is that the infectious period is (approximately) independent of whether the individual is symptomatic or not, which corresponds to ν L = 0. This approach is supported by some medical data [61,62]. For this illustrative computation, we retain ν L as a parameter: its prior mean is 0.1 which means that most infections do happen in the early stages of the disease. The γ and ν parameters are assumed independent of age, while β, α, f, a are age-dependent. The fixed (age-dependent) CFR was estimated from the data for 18- 24 March in table 4 of [60], by fitting an exponential dependence for ages 40-90 and then extrapolating this function to younger age groups. This leads to f = Ae (age)/ξ with A = 1.43 × 10 −4 and ξ = 13.1 years. The fraction of asymptomatic/paucisymptomatic cases is taken from table 2 of [60], using linear interpolation to obtain values for the cohorts considered here. The relevant numbers are shown in figure 14. As discussed in the main text, these numbers are subject to considerable uncertainty, but the resulting model is flexible enough to fit the data used here. It would be valuable to incorporate additional data, to constrain these variables, for example through testing data, which can provide information on the number of cases.  The priors for inferred epidemiological parameters are summarized in table 1. The γ parameters are fixed by the disease itself and can be constrained based on medical data, see e.g. [71] for a discussion. We take Gaussian priors for these parameters with standard deviation 10% of the mean. Other parameters like β i , ν L , a i , r are associated with disease transmission, and are much less well characterized, hence the use of less informative priors, which are lognormal with standard deviation 50% of the mean. (For positive parameters with large uncertainty, the lognormal prior is much more heavy-tailed than a Gaussian with the same standard deviation, while still penalizing very small values.) The lockdown parameters t lock and W lock have a prior standard deviation of 1 day.
To determine the initial condition for the model in terms of the inference parameters, we use the linearized dynamics to obtain (C 1). As a first estimate, it is reasonable to take x(0) = x lin with κ taken as an inference parameter. In practice, our initial condition is obtained by modifying this x lin . First, the D (deceased) compartments are initialized from the observation data, which overrides the value from the dominant eigenvector. Second, the E, A, I (k) compartments for the oldest cohort are determined by a separate procedure (detailed in the next paragraph), which allows extra flexibility in the inference. Finally, the S compartments (for all age cohorts) are chosen to enforce the (fixed) total population of each cohort. (Using (C 1) automatically ensures the correct cohort populations, but modifying x(0) from x lin (0) means that compensation is required, to enforce this constraint.) The modified initial condition for the oldest cohort is based on a hypothesis that infections started in younger age groups, before spreading into the elderly population. The inferred result is consistent with such a hypothesis. Since initial conditions are unknown a priori, we take broad prior distributions, which are lognormal with standard deviation one half of the mean. The prior mean values were chosen based on preliminary computations, to obtain reasonable agreement with deaths in the first few weeks. Denoting these mean values by μ we take m k ¼ 5 Â 10 À4 and for the oldest cohort ðm E , m A , m I 1 , m I 2 , m I 3 Þ ¼ ð2000, 1200, 300, 60, 40Þ, see also figure 16 below.

D.3. Contact matrices
The contact matrices for C P and C F model variants are based on [63,64], as we now explain. On general grounds, one expects contacts to obey a reciprocal relation: the matrix Q with elements should be symmetric, Q ij = Q ji . For C P , the contact matrix is taken from [63], by summing the contributions from home/work/ school/other. It is notable (e.g. figure 3) that the data do not satisfy (D 2), this can be traced back to the reporting of contacts by the participants of [65].
For C F , the data in [64] are provided as a (non-normalized) estimate of Q, based on n Q single-year age cohorts. As for C P , we sum the contributions from different environments to obtain a single matrix. Let A i royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 211065 be the set of single-year age cohorts corresponding to the (5-year) cohort i, and denote the reported estimate of Q by Q 0 . Then we take The constant χ is included because Q 0 is not normalized, we take χ = 3Mn Q so that the typical numbers of contacts are comparable with [63]. This scaling is somewhat arbitrary but errors/uncertainties in this factor can be compensated by rescaling the β parameters of the model. The matrix Q 0 is symmetric so the resulting contact matrices obey (D 2).

D.4. Additional results
This section shows additional results from the inference methodology.
We have emphasized that our Bayesian analysis accounts for all sources of uncertainty in the model, including parameter uncertainty, and the stochasticity inherent in the compartment model. As a direct measure of this stochasticity, figure 15 shows a conditional nowcast, as defined in §4.4. We show results summed over age cohorts, and for one representative cohort. These results illustrate the fluctuations that are captured by the functional CLT. For the summed data, the fluctuations are small, consistent with the relatively large numbers of individuals. At the level of specific cohorts, the fluctuations are significant. Figure 16 shows the posterior distributions of parameters for the C F model with step-like NPI. This complements figure 6b of the main text where similar results are shown for the parameter β. In most cases, the posterior distributions overlap strongly with the priors. We note, however, that the parameters have significant correlations under the posterior, which are not apparent here since we only show the marginals for individual variables. (For example, the initial rate of epidemic growth depends on a particular combination of the γ and β parameters; this growth rate is tightly constrained by the data, even if individual γ and β parameters remain uncertain.) For the C F model variant with step-like-NPI, we evaluated the FIM at the MAP parameters. Hence, (i) we gain understanding of how sensitive our model is expected to be to small parameter perturbations, and (ii) we understand whether there are soft directions along which the likelihood depends weakly on the parameters. The sensitivities for model parameters are discussed in §7.2, see figure 7. There are corresponding sensitivities for the parameters that determine the initial condition, see figure 17a. The parameter κ is sensitive to the data, as expected since it determines the size of the epidemic at early times.
The soft modes of the likelihood are determined by the eigenvalues and eigenvectors of the FIM. The eigenvectors corresponding to small eigenvalues define directions in which the likelihood is expected to change very little. Let v FIM be an eigenvector of the FIM I with a small eigenvalue-its elements correspond to differences of the parameters θ from their MAP values. It is convenient to normalize these as fractional changes with respect to the MAP by defining a vectorṽ FIM with elements which is normalized such that P a jṽ FIM a j 2 ¼ 1. Hence large values ofṽ FIM a indicate parameters that are significantly affected by the soft mode.
An example is given in figure 17b. If one increases γ A and reduces γ 1 appropriately, the result is a model with the same (mean) infectious period. A similar effect is obtained by increasing γ 2 and reducing γ 3 . Due to these degenerate directions in the parameter space, a pure maximum-likelihood estimation (MLE) approach to inference of any model displaying soft modes potentially leads to wrong results. Bayesian inference on the other hand has the natural ability to remove soft modes by virtue of the additional information provided by priors. . This mode illustrates that the model behaviour (and hence the likelihood) is almost unchanged if one increases the parameters γ 1 , γ 2 and simultaneously reduces γ A , γ 3 . Other model parameters have small contributions to this mode, as illustrated by the a F i parameter, which is the next largest element in magnitude.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 211065 29 Finally, we consider the effects of the inference window. To complement figures 10 and 11, we show in figure 18 the dependence of inferred parameters on the time period used for inference ( figure 11 shows similar results for the parameters r, ν L ). Most parameters depend weakly on the time window, which indicates a robust forecast. royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 211065