Accuracy of parameter estimation for auto-regulatory transcriptional feedback loops from noisy data

Bayesian and non-Bayesian moment-based inference methods are commonly used to estimate the parameters defining stochastic models of gene regulatory networks from noisy single cell or population snapshot data. However, a systematic investigation of the accuracy of the predictions of these methods remains missing. Here, we present the results of such a study using synthetic noisy data of a negative auto-regulatory transcriptional feedback loop, one of the most common building blocks of complex gene regulatory networks. We study the error in parameter estimation as a function of (i) number of cells in each sample; (ii) the number of time points; (iii) the highest-order moment of protein fluctuations used for inference; (iv) the moment-closure method used for likelihood approximation. We find that for sample sizes typical of flow cytometry experiments, parameter estimation by maximizing the likelihood is as accurate as using Bayesian methods but with a much reduced computational time. We also show that the choice of moment-closure method is the crucial factor determining the maximum achievable accuracy of moment-based inference methods. Common likelihood approximation methods based on the linear noise approximation or the zero cumulants closure perform poorly for feedback loops with large protein–DNA binding rates or large protein bursts; this is exacerbated for highly heterogeneous cell populations. By contrast, approximating the likelihood using the linear-mapping approximation or conditional derivative matching leads to highly accurate parameter estimates for a wide range of conditions.

Accuracy of parameter estimation for auto-regulatory transcriptional feedback loops from noisy data Zhixing Cao and Ramon Grima School of Biological Sciences, the University of Edinburgh, Max Born Crescent, Edinburgh EH9 3BF, UK ZC, 0000-0003-2600-5806; RG, 0000-0002-1266-8169 Bayesian and non-Bayesian moment-based inference methods are commonly used to estimate the parameters defining stochastic models of gene regulatory networks from noisy single cell or population snapshot data. However, a systematic investigation of the accuracy of the predictions of these methods remains missing. Here, we present the results of such a study using synthetic noisy data of a negative auto-regulatory transcriptional feedback loop, one of the most common building blocks of complex gene regulatory networks. We study the error in parameter estimation as a function of (i) number of cells in each sample; (ii) the number of time points; (iii) the highest-order moment of protein fluctuations used for inference; (iv) the moment-closure method used for likelihood approximation. We find that for sample sizes typical of flow cytometry experiments, parameter estimation by maximizing the likelihood is as accurate as using Bayesian methods but with a much reduced computational time. We also show that the choice of momentclosure method is the crucial factor determining the maximum achievable accuracy of moment-based inference methods. Common likelihood approximation methods based on the linear noise approximation or the zero cumulants closure perform poorly for feedback loops with large protein -DNA binding rates or large protein bursts; this is exacerbated for highly heterogeneous cell populations. By contrast, approximating the likelihood using the linear-mapping approximation or conditional derivative matching leads to highly accurate parameter estimates for a wide range of conditions.

Introduction
In recent years, it has been shown that a significant percentage of genes in bacteria and yeast are auto-regulated [1 -3], i.e. a transcription factor activates or represses the expression of its own gene. We here choose to focus on negative auto-regulation (repression) because this motif confers significant advantages to cellular function including the reduction of intrinsic noise [4] and the speeding up of the response time [5]. It is also the case that the molecular mechanism of circadian oscillators relies on negative autoregulation of gene expression [6,7]. Given the widespread availability of experimental data on the number of mRNAs and proteins at the single cell level [8][9][10][11], a natural question is how can we use these data to infer the rate constants and other relevant parameters of negative auto-regulatory transcriptional feedback loops.
A number of early studies used rate equations to identify the underlying network structure of gene regulatory networks or to infer rate constants [12,13]. However, clearly this is not the ideal framework since rate equations are deterministic while it is well known that gene expression is highly stochastic [14]. Thus, there has been considerable effort at devising methods to infer parameters of auto-regulatory gene regulatory networks from noisy time course data using the chemical master equation (the discrete state and continuous time stochastic description of reaction kinetics [15]) or one of its numerous approximations [16][17][18][19][20][21][22]. These studies can be distinguished according to the type of kinetics used to describe autoregulatory networks (mass-action or non-mass-action) and by the choice of method used to perform parameter inference (approximate Bayesian computation, Markov chain Monte Carlo (MCMC) algorithms and maximum likelihood methods).
Studies assuming mass-action kinetics, such as [16][17][18][19], describe the interactions of DNA, mRNA and protein using the first-and second-order reactions while those using nonmass action kinetic models [20][21][22] employ Hill or logical functions to describe effective interactions between mRNA and protein without an explicit description of the DNA. Approximate Bayesian computing approaches perform exhaustive stochastic simulations using the stochastic simulation algorithm (SSA) [23] and accept parameter values if the differences between simulation and experimental data are sufficiently small [19,24,25]. These methods are asymptotically exact, but they suffer from poor computational efficiency due to the very large number of required SSA runs. Inference using the Finite State Projection (FSP) algorithm is usually more efficient than that using the SSA; however, this is limited to small reaction networks [26]. A different approach, which is relatively more computationally efficient, involves approximating the likelihood (by approximating the chemical master equation) and then using a random walk scheme (MCMC), to explore parameter space and thus to finally obtain the posterior distributions of parameters [16,18,21,22,27,28]. The most common approximations used are the linear-noise approximation (LNA) and the two-moment approximation (2MA), presumably because these are the simplest and most well-known approximations of the chemical master equation in the literature of stochastic chemical kinetics [29]. A third (non-Bayesian) approach is typically the most computationally efficient of the approaches mentioned thus far and involves a direct maximization of the approximate likelihood using numerical optimization techniques [17,30,31]. We collectively label the aforementioned MCMC and maximum likelihood methods under the umbrella of moment-based inference because they involve solving a closed set of ordinary differential equations for the approximate moments. We emphasize that approximations are necessary because the chemical master equation can rarely be solved for all times when the reaction system has bimolecular reactions [29], and such reactions are very common in vivo, e.g. the protein -DNA binding reaction in an auto-regulatory transcriptional feedback loop.
All auto-regulatory networks have two properties in common: (i) they are typically very noisy particularly as proteins are produced in short bursts due to translational bursting [32] and (ii) they all have at least one protein -DNA bimolecular reaction which controls the strength of feedback. Unfortunately, common approximation methods, such as the LNA and the 2MA, are valid in the limit of small noise [33] and the error between their predictions and the exact solution of the chemical master equation increases with the size of bimolecular rate constants [29,33,34]. The question of how accurate the parameter estimates are is thus a pressing one and it has not been addressed properly because published studies to date have focused on method development and only verified the method's accuracy on a few parameter sets. In this article, we fill this gap in the literature by performing an exhaustive systematic analysis to understand the factors affecting the accuracy of parameter prediction in auto-regulatory transcriptional feedback loops using moment-based inference methods. In particular, we study how the accuracy of parameter estimation, using both MCMC and maximum-likelihood methods, varies across large swaths of parameter space and how the accuracy is affected by the number of time points of the observed data, the number of cells from which data are collected, the highest order of the moments used for inference and the choice of moment-closure method used to approximate the likelihood. Our results show that for cases where large bursts in protein production are evident and/or where strong feedback is suspected, approximation of the likelihood using the LNA and 2MA leads to large errors in the parameter estimates; this can be avoided by the use of more sophisticated moment-closure techniques.

Model of an auto-regulatory transcriptional feedback loop
The auto-regulatory (repressive) genetic feedback loop which is the centre of this study is shown in figure 1a. When a gene is in the ON state (G), proteins are produced and subsequently degraded via a first-order reaction. The protein can bind to the gene and turn it OFF (denoted as the state G*); in this state, the protein can only be degraded. The proteins are produced in bursts with a mean burst size b. Note that the latter is the mean number of proteins produced per mRNA during its lifetime. The burst size distribution is chosen to be geometric; this distribution was previously derived for the common case of  Figure 1. (a) Schematic of the negative auto-regulatory transcriptional feedback loop and the parameters to be inferred: the protein production rate r u , the mean protein burst size b, the degradation rate d, and the promoter switching rates s b and s u . The burst size distribution is geometric and given by c(i) (see text for justification). (b) Five (independent) single cell trajectories generated using the SSA for the parameter set: fast mRNA decay (translational bursting) [35] and has been also verified experimentally [32]. Hence, while mRNA is not explicitly described in our model, its effects are implicitly described through the protein burst size distribution. Note that transcriptional bursting (bursts of mRNA occurring when the promoter spends a long time in the OFF state) is also implicit by the same reasoning. If the cells are identical, then the model has five parameters to be estimated: r u (rate of protein production), d (rate of protein degradation), b (mean protein burst size), s b (the binding rate of protein to gene in the ON state) and s u (the rate at which the gene switches from OFF to ON). If the cells are non-identical, then we assume a lognormal distribution in r u (see §3.2 for justification and detailed discussion of cellular heterogeneity) and there are six parameters to be determined: the mean and standard deviation of r u , b, d, s b and s u .

Synthetic data
Consider an experimental set-up where the number of molecules of a certain protein is measured for N cells at L different time points. This is usually done by using an empirical formula to convert the fluorescence of a tagged protein in a cell to the number of molecules in that cell. If x i (t l ) is the number of proteins in cell i at the lth time point t l , then we can calculate the set of kth central moment measurements, i.e.m k ¼ {m k (t 1 ), . . . ,m k (t L )} where:m The experimental set-up can be of two types: (i) fluorescence from the same N cells is measured at each time point, i.e. single cell data where individual cells can be tracked or (ii) population snapshot data whereby N cells are randomly selected from a much larger cell population such that the chances that the same cell is measured at different time points are negligible, e.g. flow cytometry. For both cases, we will make the simplifying assumption that there is no correlation between fluorescence measurements at any two different points in time. This assumption is naturally enforced when collecting population snapshot data. For single cell data, this assumption holds provided the interval between consecutive time points is much larger than the autocorrelation time of protein fluctuations. We simulate an experiment and generate synthetic data using the SSA. The time series data for the auto-regulatory circuit shown in figure 1a (the number of proteins sampled at a number of equidistant time points) are generated for a certain set of values of the parameters using the SSA. Specifically the algorithm simulates the following set of reactions: where m is a discrete random variable sampled from the geo- The initial condition is zero proteins in state G. Each realization of the SSA simulates temporal data measured from a single cell (figure 1b shows typical single cell trajectories). For each time point, we then compute the moments of the molecule numbers across the population of cells using equation (2.1). This is the data input to the inference methods which are described next.

Bayesian inference
We will assume that the number of cells in our experiments is quite large such that by the central limit theorem the sample moments are approximately Gaussian distributed. This assumption is readily fulfilled in flow cytometry experiments where measurements of tens of thousands of cells or more [27,36] are routine. It is less clear if the assumption is valid for microfluidic set-ups which collect single cell data and which typically can at most sample of the order of a thousand cells [37]. The simplest method of inference would involve using only mean data but unfortunately for our auto-regulatory circuit this method does not enable the identification of all parameters-this is since r u and b appear as a product in the rate equations for the mean concentrations (see equation (B 3) in appendix B) which makes their individual estimation impossible (the implicit reason is that the effective mean rate of protein production at any time is r u b). Hence at least the mean and variance of protein numbers at each time point are needed to identify all parameters. Now it is known that the covariance between the sample mean and the sample variance at each time point tends to zero as the sample size increases [38]. Hence for large cell numbers, the likelihood that at time point t l we measure the first and second central moments {m 1 (t l ),m 2 (t l )} given the parameter vector u, can approximately be written as the product of two Gaussians, one for the mean and one for the variance: where the variance s 2 k (t l ) is related to moment measurements by the equations: andũ k (t l , u) is the kth moment at time t l as predicted by the chemical master equation given the parameter vector u. Since most master equations cannot be solved when there are protein-DNA binding reactions [29], an approximation of the master equation is necessary to calculate the likelihood above. Zechner et al. [27] used the 2MA whereby one obtains closed approximate equations for the first two moments from the chemical master equation by assuming that the third-order cumulants are zero [33,39]. However, generally the approximation method used can be any type of moment-closure method (see next section). Due to the independence of fluorescence measurements at any two different points in time, it then follows that the likelihood that we measure the moment vectorsm 1 ,m 2 (the first two moments measured at L time points) given the parameter vector u, is as follows: Thus within a Bayesian framework the posterior distribution of the parameter vector u is given by where p(u) is the prior distribution on u. A parameter search can then be performed to maximize the parameter posterior using an adaptive Metropolis -Hastings MCMC sampler (see appendix A for a description of the algorithm, choice of prior and proposal distributions, burnin time, etc.). MCMC samples converge in distribution to the posterior, and as such any statistics computed using a finite sample (after the burnin time) is an approximation to the posterior. We define the highest mode of the posterior distribution (the maximum a posteriori, MAP) to be the parameter estimate and the width of the distribution is a measure of uncertainty. The use of an adaptive sampler prevents the chain getting easily stuck by adapting to the global covariance of the posterior distribution. The MCMC was coded in the Julia language [40] and its typical runtime (to achieve convergence of the chain) for the applications discussed in this paper was many hours, in some cases as high as 20 h (all simulations run on a single core of an Intel w Xeon w Silver 4114 CPU @ 2.20 GHz). TheR ratio [41] was very close to 1 for times larger than the burn-in time which is a strong indicator of chain convergence. We note that this method can be easily extended to include information about higher-order moments than two. For example, if we wished to use the first three central moments of the protein number data for inference, then equation (2.5) would be replaced by where the variance s 2 k (t l ) is related to moment measurements by equation (2.4) and one further equation

Maximum-likelihood estimator
An alternative frequentist method of estimation involves finding the parameter vector that maximizes the likelihood. It is immediately clear from the form of equation (2.5) that this is tantamount to minimizing the negative logarithm of the likelihood. To be specific, the parameter vector is found by solving the optimization problem: : This is the maximum-likelihood estimator (MLE) that we use throughout this paper. Note that the pre-factors are neglected due to their constant values. The positivity constraint on the parameter values can be easily handled by the ln 2 exp transformation. Specifically, equation (2.9) is equivalent to: allowing u e to be the optimization variables over the entire real space, and the actual parameters can then be deduced from exp (u e ). Of course, this estimator can also be extended to include information about higher-order moments than two by changing the upper limit of the sum over k. The MLE estimator here used can be seen as a special case of the generalized method of moments estimator used in [42]. Since the variances s 2 1 (t l ) and s 2 2 (t l ) converge to 0 when the number of cells N tends to infinity, the normal distributions in the likelihood equation (2.5) turn to Delta functions which are only non-zero form k (t l ) ¼m k (t l , u). Thus it follows that in the infinite cell number limit, the MAP estimate from MCMC will be equal to the value of u which minimizes the mismatch between the predictions and measurements of moments, which is the same value obtained from the MLE equation (2.9). This is of course only true if the support of the prior distribution is wide enough.
MLE is computationally very efficient compared to momentbased Bayesian inference using an adaptive MCMC sampler; this is its main advantage. A main difference from Bayesian inference is that it leads to a point-wise estimate of the model parameters (rather than a posterior distribution). The MLE was computed using an adaptive differential evolution Algorithm [43,44] implemented in the Julia language [45]. This leads to an efficient global numerical optimization with a typical runtime under a minute for the applications discussed in this paper.

Computation of error in parameter estimates
The set of synthetic moments generated by the SSA is the input to the MLE and MCMC algorithms described in the previous sections which subsequently output predictions for the parameter values. We then compute two types of fractional errors for each parameter u i : (2:12) The first error quantifies the difference between the MLE and the mode of the MCMC-derived posterior (the MAP estimate), while the second and third errors quantify the error between the MAP estimate (or the MLE estimate) and the true parameter value, i.e. the parameter values input into the SSA and used to generate the synthetic data.

Choices for the moment-closure approximation method
Since the chemical master equation of the feedback loop can only be solved in steady state [46], the likelihoods need to be approximated by a moment-closure method. There are a wide variety of such methods [47], each with their own advantages. We shall consider six types of approximations: LNA [15], the three moment approximation (3MA) [33], derivative matching (DM) [48], conditional derivative matching (CDM) [49], conditional Gaussian approximation (CG) [49] and the linear-mapping approximation (LMA) [50]. The LNA was described in the Introduction. The 3MA is an elaboration of the 2MA explained earlier; while in the latter we assume the third cumulant is zero, in the former we assume that the fourth cumulant is zero. The 3MA gives a closed set of equations for the first three moments and is a more accurate approximation of the chemical master equation than the 2MA [33,51]. Hence, in this article, we use the 3MA instead of the more common 2MA. DM involves matching time derivatives of the exact (not closed) moment equations with that of the approximate (closed) moment equations at some initial time. CDM is a conditional version of DM, i.e. where DM is performed conditional on the state of the low abundance species, e.g. the promoter states. CG is a special case of the conditional method of moments developed earlier by Hasenauer et al. [52]; it can also be seen as a conditional version of the 2MA, again where the conditioning is on the promoter state. The LMA is a not a true moment-closure method in the usual sense of the word because it actually gives approximate expressions for the timedependent probability distributions of a wide class of gene regulatory networks (which moment-closure methods cannot give). The LMA is based on an approximate mapping of the dynamics of a gene regulatory system with protein -DNA binding reactions to a system with no binding reactions. Appendices B and C contain the equations defining each of these closures for the auto-regulatory transcriptional feedback loop for the case of identical and non-identical cells, respectively. (ii) the number of time points; (iii) the highest-order moment used for inference; (iv) the moment-closure method used for likelihood approximation.
Testing the independence assumption of the likelihood function. We first explicitly confirm the assumption behind our method of inference, namely that the sample mean and sample variance are independent at each time point such royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 16: 20180967 that we can write the likelihood as a product of likelihoods for each moment (see §2.3). We fix the parameters in the SSA to r u ¼ 13, b ¼ 3, d ¼ 1, s b ¼ 0.001 and s u ¼ 0.1, generate the synthetic data for a number of N cells at 30 time points (interval 1), compute the central moments using equation (2.1) and the correlation coefficient of sample mean and sample variance for time t l using the following: r(m 1 (t l ),m 2 (t l )) ¼ cov(m 1 (t l ),m 2 (t l )) s 1 (t l )s 2 (t l ) ¼m 3 (t l ) Ns 1 (t l )s 2 (t l ) : (3:1) Note that the last step uses an exact result for the covariance of sample mean and sample variance derived in [38]. In figure 2a, we show r(m 1 (t l ),m 2 (t l )) averaged over all 30 time points (denoted as L) as a function of N. The very small value of the time-averaged correlation coefficient of sample mean and sample variance is practically negligible for populations with more than a hundred cells and hence the assumption of independence of sample mean and sample variance in our inference methods holds. This was found to be the case for all parameter values explored in this study. Quantifying the differences between the MLE and MAP estimate as a function of N. We now fix the moment-closure method of approximating the likelihood to be the 3MA and fix the number of time points to 30 (with interval 1). In figure 2b, we plot FE MLE2MAP (see equation (2.10)) as a function of N which quantifies the difference in the parameters obtained using the MLE and the MAP estimate of MCMC. The fractional errors decrease rapidly with increasing N showing rapid convergence of the two estimates. The number of cells in flow cytometry measurements (we shall refer to this as the sample size from now on) tends to be much larger than 10 4 and hence the difference between the two estimates (for all five parameters) is less than 3%. In figure 2d, we show the corresponding posterior distributions obtained from MCMC for each of the five parameters as a function of N, while the vertical green dashed line shows the MLE. Note that these results do, of course, depend on the choice of prior distribution but are independent of the particular choice, we always found that the fractional error between the MLE and MAP estimates decreases rapidly with N (this is, of course, expected as the amount of informative observations increases, the effect of the prior is diluted). The very small differences between the two estimators make a strong case for the use of the MLE rather than the MCMC method for typical flow cytometry sample sizes since the computational time of the former is at most a few minutes, while of the latter is many hours.
Quantifying the differences between the inferred and true parameter values as a function of N. In figure 2c, we plot FE MAP2True (see equation (2.11)) as a function of N which quantifies the difference between the MAP estimate of the parameter and the true value of the parameter. This figure clearly shows that the percentage error does not significantly decrease with N and can be as high as 40% for typical flow cytometry sample sizes. Now the sampling error due to the finite cell number N and the error due to assuming independence of sample mean and sample variance rapidly go to zero as N ! 1. The only error remaining in this limit is the systematic error which is the error due to likelihood approximation by the moment-closure method. Hence, figure 2c shows that the systematic error due to likelihood approximation by the 3MA by far dominates the other errors. The differences between the MAP estimate and the true value (red vertical line) can be better appreciated in figure 2d where we plot the posteriors of the parameter distributions as a function of N. In particular, the case N ¼ 10 5 (last row of figures in figure 2d) is remarkable since the red vertical line (the true value) is way off from the narrowly peaked (converged) posterior. Note that since the differences between MLE and the MAP estimate are very small (figure 2b), the error computed between the MLE and the true value, FE MLE2True , is very similar to that reported in figure 2c for FE MAP2True .
Quantifying the systematic error in the inferred parameter values as a function of the type of moment-closure approximation for the likelihood. We have previously found that the systematic error was very large using the 3MA. Next we investigate how this error varies with the choice of moment-closure approximation. Since it is computationally unfeasible to generate a very large number of cell samples using the SSA, for this study we use the FSP algorithm [53]) to directly obtain the time-dependent probability distribution of the genetic feedback loop, from which we calculate the moments for 30 time points (interval one). Because we truncated the FSP to a very large protein number compared to the mean protein number, the results obtained from FSP are practically the same as the exact solution of the master equation, i.e. the limit N ! 1 of the SSA. In particular, by comparing the mean and variance from FSP with that from SSA, for various parameter sets, we estimated that the relative error in FSP's moments is less than 1% for all times. We then use the moments of the probability distribution at the 30 time points to generate the MLE of the five parameters. These and the true parameter values are used to calculate the fractional error for each parameter using equation (2.12). In figure 3, we show a heat map of the fractional error averaged over all five parameters as a function s b , b and r u for the six different types of moment-closure approximations mentioned in the §2.6. The heat map shows that the systematic error increases rapidly with increasing rate of protein -DNA binding s b and with increasing mean burst size b (there is only a weak dependence on the translation rate of proteins in the ON state r u ). This dependence is to be expected since (i) s b controls the strength of the only bimolecular reaction in the feedback loop and we know that it is the presence of this reaction which necessitates the use of moment-closure approximation (from the master equation of a system with only zero or first-order reactions, one can derive a closed set of moment equations and hence no approximation is necessary in this case [29]). (ii) b controls the size of protein number fluctuations and we know that most approximations are valid for small noise only [29,33]. Note that the maximum systematic error using the 3MA and the LNA is of order 1, while the maximum systematic error using the LMA, CDM, DM or CG is of order 10 22 . The 2MA (the lesser accurate version of the 3MA) and the LNA have been the methods of choice for inference in the literature, presumably because of their simplicity. Hence our results make a strong case for the use of the more sophisticated LMA, CDM, DM or CG for cases where large bursts in protein production are evident and/or where strong feedback is suspected.
We further corroborated the results in figure 3 by generating synthetic SSA data for two points in parameter space, r u ¼ however, provides additional important information: it shows the error for each parameter and the posterior distributions obtained from MCMC. The error in the protein -DNA binding rate s b is the largest or the second largest error among the five parameters when the LNA and 3MA are used to approximate the likelihood. It is visually clear that the posteriors generated using the LMA and CDM (blue and red distributions, respectively, in the first and third rows of figure 4) are centred or almost centred on the true parameter value (red vertical line)-this is obviously not true for the posteriors generated using the LNA and 3MA (yellow and green distributions, respectively, in the second and fourth rows of figure 4). However, the posteriors from the LMA and CDM are not necessarily narrower than those from the LNA and 3MA and hence the choice of moment closure scheme does not appear to significantly impact the uncertainty in the MAP estimate. In tables 1 and 2, we compare the MAP estimate of MCMC in figure 4 with the MLE for the same synthetic SSA data as well as with the MLE using synthetic FSP data (reported in figure 3). All three are in good agreement for the four moment-closures tested, thus providing another verification of the superiority of LMA/CDM over LNA/ 3MA for moment-based inference.
In the electronic supplementary material, we also demonstrate the accuracy of distribution reconstruction from inferred parameters, the robustness of the MLE estimates to external noise and the convergence of the MCMC chain. A short description of each follows. In electronic supplementary material figure S1, we reconstruct the time-dependent distribution of molecule numbers (using FSP) based on 3MA and CDM inferred kinetic parameters reported in table 1. We find that both methods lead to a distribution that is visually close to that generated using the true parameter values, with the accuracy being highest for the CDM-reconstructed distribution which is virtually indistinguishable from the true distribution. We have also tested the robustness of the MLE inference method to noise added to the measured moments; this additional noise mimics sources of noise other than intrinsic noise inherent in the synthetic SSA data. In electronic supplementary material, figure S2, we show that the fractional error averaged over all parameters increases linearly with the size of added noise. In electronic supplementary material, figure S3, we plot the Gelman -RubinR ratio as a function of the number of iterations of the MCMC chain where the likelihood is approximated using the LMA moment equations-the ratio quickly tends to 1 after the burn-in time demonstrating chain convergence. Note that the same quick convergence is seen for all MCMC results reported in this article.
Quantifying the differences between the inferred and true parameter values as a function of the number of time points L and the highest-order moment used for inference. Thus far, we have fixed the number of time points to L ¼ 30 and the highest-order

Inference from non-identical cells
Thus far, we have assumed inference from a population of identical cells, but, of course, this is an ideal which does not exist in nature. Variability between cells can be modelled by choosing rate constants to vary from one cell to another one. Generally, rate constants might even change with time in a single cell, but this is likely a secondary effect compared to cell-to-cell variation in the rate constants. In particular, previous experimental studies have shown that one of the major sources of gene expression variability in yeast is cell-to-cell variation in transcription factor expression [54]. In our model, the protein is the repressing transcription factor and its expression is controlled by the rate constant r u . Hence we choose this constant to vary from cell to cell, while the     other rate constants are identical across cells. In agreement with previous studies, the distribution of r u across cells is chosen to be lognormal [55]. Specifically we fix the parameter set to b ¼ 5, d ¼ 1, s b ¼ 0.001, s u ¼ 0.1 and r u to be lognormally distributed (across cells) with mean 13 and standard deviation 0.1 or 0.3. Hence the parameters to be inferred are now six: the mean and standard deviation of r u , b, d, s u and s b . Figure 6 shows the MCMC posterior distributions for these six parameters using the 3MA, LMA and CDM moment-closure approximations. The mean percentage error across all parameters is 57-95% using the 3MA, 7-8% using the LMA and 2% using the CDM. In comparison, the mean percentage error across all parameters is 25 -41% using the 3MA, 0.6-2% using the LMA and 0.6-1% using the CDM for the case of identical cells ( figure 4). The main conclusions to be drawn are as follows: (i) inference for non-identical cells leads to parameter predictions with significantly larger errors than that for identical cells; (ii) the LMA and CMD closure leads to much more accurate results than the 3MA-the CDM is particularly accurate and seems the best choice. We did not test the LNA, but since for identical cells the LNA and 3MA always fared very similar, we expect the same in this case too. In tables 3 and 4, we compare the MAP estimate of MCMC in figure 6 with the MLE using the same synthetic SSA data; as expected, we find the MLE and MAP estimates to agree very closely for all six parameters and using all moment-closure approximations.

Discussion and conclusion
In this article, we have reported the results of an exhaustive study of the factors influencing the accuracy of momentbased MCMC and MLE methods for an auto-regulatory transcriptional feedback loop. Using the Bayesian method devised in [27] and its corresponding MLE, we showed that using only the first two moments of synthetic protein data, the accuracy of parameter estimation for large sample sizes is largely controlled by the choice of moment-closure method used to approximate the likelihood. The errors were found to increase with the size of the protein -DNA binding rate, the mean protein burst size and the heterogeneity of transcription rate across the cell population. We showed that using only mean data is not sufficient to identify all parameters and that at least mean and variance are needed to perform such a task. Using more than two moments of synthetic protein data does not necessarily lead to better accuracy-in particular this does not affect the accuracy    boasted very small maximum errors of about 1%. Our study also confirms that for sample sizes typical of flow cytometry (tens of thousands of cells) MLE approaches are more favourable than Bayesian methods since both methods lead to virtually indistinguishable estimates (if the same likelihood approximation is used), but the computation of MLE takes a few minutes, while MCMC takes many hours. Of course, MCMC approaches have the additional advantage of estimating the uncertainty in the parameter estimates; however, this could also be computationally efficiently estimated using normal approximations to the posterior [41].
Our study was specifically for a negative auto-regulatory feedback transcriptional feedback loop which does not incorporate cooperativity in the protein -DNA binding reaction nor protein dimerization reactions, as some previous studies did. Incorporating both of these would lead to a higher degree of nonlinearity in the law of mass action (since both cooperativity and dimerization imply more second-order reactions in our model). Under such conditions, one would expect even larger errors from the prediction of momentbased inference methods than what we have found because moment-closure approximations naturally perform best for systems with weakly nonlinear mass action laws [29]. It would also be interesting to investigate (i) whether the results here found for negative feedback loops extend to positive feedback loops and (ii) how the present inference method can be extended for use with spatially extended data [56]. These are topics for a future study.
In few instances [16], some studies have used the chemical Fokker-Planck equation (CFPE) as a means to compute the approximate likelihood. This method cannot be used in our moment-based inference because the moment equations of the CFPE are not closed. However, given that it was proved in [33,57] that the 3MA is more accurate than the CFPE (in the limit of large system sizes) and that the CFPE's predictions for the protein distributions of autoregulatory gene regulatory networks [58] can be very different from those of the chemical master equation, it appears highly likely that Bayesian inference methods based on the CFPE cannot outperform the LMA, CDM, DM and CG moment-based methods described in this article.
It remains to be seen whether particular moment-closures are more advantageous compared to others when one is interested in the more general problem of inferring both the network connectivity and the parameter values. However, given the large translational mean protein bursts measured in vivo (in the approximate range of 1 to 1000; see fig. 5a in [8]) and the rapid increase in estimation error with mean burst size that we identified in this study, it seems likely that new techniques (such as those based on the concept of convergent moments [59]) may be needed to ensure accurate inference of complex noisy gene regulatory networks with multiple interconnected feedback loops. Competing interests. We declare we have no competing interests. Funding. Z.C. gratefully acknowledges support of the UK Research Councils' Synthetic Biology for Growth programme and of the BBSRC, EPSRC and MRC (B/M018040/1). R.G. acknowledges support from BBSRC grant no. BB/M025551/1. Acknowledgements. R.G. thanks Christoph Zechner, Barbel Finkenstadt, Christian Fleck, John Fricks and Guido Sanguinetti for their useful discussions.

Appendix A. Implementation of adaptive Markov chain Monte Carlo
Our initial exploration used a traditional non-adaptive Metropolis-Hastings algorithm (as in [27]) but we noted that this often resulted in the chain getting stuck for very long periods of time. This problem can be traced to the very narrow likelihoods at large sample sizes. Since it is well known that the choice of proposal distribution strongly affects the time taken for the chain to converge, we opted to instead use an adaptive MCMC which removes the need of a particular choice by automatically tuning the proposal distribution using the history of the process [61]. This led to convergence in a reasonable amount of time. The particular updating mechanism we used is a variant of algorithm 4 in [62]. The pseudocode of the adaptive MCMC is presented below.
1. Extract statisticsm k (t l ) (sample mean and sample variance) and s 2 k (t l ) (variance of sample mean and variance of sample variance) for k ¼ 1, 2 for all time points using the formulae equations (2.1) and (2.4). 2. Select wide lognormal prior p(u) in equation (2.6) and initialize u old [ R J . 3. Select lognormal proposal distribution q(u new ju old ) and initialize covariance matrix as identity matrix C I [ R JÂJ . 4. Initialize the adaptive parameters: (i) the starting point and termination point of adaptation I 1 ¼ 100 and I 2 ¼ 5 Â 10 5 , respectively; (ii) the optimal acceptance rate opt (0.23 in this paper); (iii) learning gain K (100 in this paper); (iv) Robbins-Monro order 1 [ (0, 1] (0.9 for this paper); (v) s d is initially set equal to (2.38) 2 /J; (vi) parameter for preventing singularity e d ¼ 2 Â 10 25 I. (* Do Adaptive Metropolis -Hasting MCMC *) 5. For i from 1 to Number_of_samplings 6.
Sample u new from proposal distribution q(u new ju old ); (* Calculate Probability *) 8.
Calculate moment predictionsm k (t l , u) from moment equations for all t l and k and for both u new and u old ; 9.
Calculate p(ujm 1 , . . . ,m n ) as per equation (2.6) for both u new and u old in which p(m k (t l )ju) is approximated by using normal distribution equation (2.5); 10.
Calculate both q(u new ju old ) and q(u old ju new ); 11.
If u a 13.
u old u new and record u old in u r ; 14.
Set (* Adaptation initiates at the I 1 -th trial and terminates at I 2 -th trial *) 23.
if i . I 1 and i , I 2 24.
C s d cov(ur ) þ s d e d ; 25. end 26. end 27. Calculate the probability distribution of all the recordings of u old in the last 2/3 chain length-this is the posterior. 28. Pick the mode of the posterior for each parameter as the MAP estimate for that parameter.

Remark 1 (early adaptation termination)
In practice, we use the early-adaptation-termination adaptive MCMC in which the adaptation is terminated after 5 Â 10 5 runs after the acceptance a reaches steady state (the change on a is below the tolerance). The motivation of early termination on adaptation is to circumvent the ever-increasing computation demand on the calculation of covariance in step 24.

Remark 3 (u old initialization)
The adaptive MCMC's u old is initiated with the estimates of MLE obtained using the same dataset.

Remark 4 (convergence)
Generally, the posterior calculated up to chain length equal to 10 6 is virtually indistinguishable from that calculated with a chain length which is an order of magnitude longer and hence we chose to terminate the entire chain at 10 7 . In all cases this guaranteed a converged posterior. The computation time of the full chain takes 10 h or more for the identical cell case and at least 20 h for the heterogeneous cell case.