Numerical method for parameter inference of systems of nonlinear ordinary differential equations with partial observations
Abstract
Parameter inference of dynamical systems is a challenging task faced by many researchers and practitioners across various fields. In many applications, it is common that only limited variables are observable. In this paper, we propose a method for parameter inference of a system of nonlinear coupled ordinary differential equations with partial observations. Our method combines fast Gaussian process-based gradient matching and deterministic optimization algorithms. By using initial values obtained by Bayesian steps with low sampling numbers, our deterministic optimization algorithm is both accurate, robust and efficient with partial observations and large noise.
1. Introduction
Many problems in science and engineering can be modelled by systems of ordinary differential equations (ODEs). It is often difficult or impossible to measure some parameters of the systems directly. Therefore, various methods have been developed to estimate parameters based on available data. Mathematically, such problems are classified as inverse problems, which have been widely studied [1–4]. They can be also treated as parameter inference in statistics. Statistical inference enables the construction of data-efficient learning machines and shows advantages in dealing with noisy data, which is increasingly being studied and discussed (e.g. [5–7]).
For nonlinear ODEs, standard statistical inference is time consuming as numerical integration is needed after each update of the parameters [8,9]. Recently, gradient matching techniques have been proposed to circumvent the high computational cost of numerical integration [7,10–12]. These techniques are based on minimizing the difference between the time derivatives (gradients) and the right-hand side of the equations. This usually involves a process consisting of two steps: data interpolation and parameter adaptation. Among them, non-parametric Bayesian modelling with Gaussian processes is one of the promising approaches, which includes data interpolation by Gaussian process and parameter adaptation by matching the solution or model. An adaptive gradient matching method based on a product-of-experts approach and a marginalization over the derivatives of the state variables was proposed by Calderhead et al. [8] and extended by Dondelinger et al. [10]. Barber & Wang [13] proposed a Gaussian process-based method in which the state variables are marginalized. Macdonald et al. [9] provided an interpretation of the above paradigms. Wenk et al. [14] proposed a fast Gaussian process-based gradient matching (FGPGM) algorithm with theoretical framework in systems of nonlinear ODEs, which was indicated more accurate, robust and efficient.
For many practical problems, the variables are only partially observable, or not at all times [15]. As a consequence, parameter inference is more challenging, even for a coupled system where the parameters are uniquely determined by data of partially observed data under certain initial conditions. It is not clear whether the gradient matching techniques can be applied to the case when there are latent variables. The Markov chain Monte Carlo algorithm has the ability to sidestep the issue of parameter identifiability in many cases, but convergence remains a serious issue [7]. Therefore, the graphical model, density formula and algorithm for inference with partial observations, especially how to deal with the latent variables, are worth investigation. Moreover, we need to pay attention to the feasibility, accuracy, robustness and computational cost of numerical computations for such problems.
In this work, we focus on the case of parameter inference with partially observable data. The main idea is to treat the observable and non-observable variables differently. For observable variables, we use the same approach as proposed by Wenk et al. [14]. We will provide three approaches to deal with the unobserved variables, including numerical integrations and Gaussian process sampling, and compare their performances. To circumvent the high computational cost of sampling in Bayesian approaches, we also combine the extended FGPGM with least square optimization method. The remaining part of the paper is organized as follows. In §2, we give the numerical method to deal with parameter identification problems with partial observation. Numerical examples are presented in §3. Some concluding remarks are given in §4.
2. Algorithm
The main strategy of Gaussian process-based gradient matching is to minimize the mismatch between the data and the ODE solutions in a maximum-likelihood sense, making use of the property that Gaussian process is closed under differentiation. In this section, we will extend the FGPGM method proposed in [14] to the situation that contains unobserved variables.
In this work, we would like to estimate the time-independent parameters θ of the following dynamical system described by
The idea of Gaussian process-based gradient matching is as follows. Firstly, we put a Gaussian process prior on xM,

Figure 1. Chain graph with partial observable variables.
Theorem 2.1.
Given the modelling assumptions summarized in the graphical probabilistic model in figure 1,
The proof is given in appendix B.
In our computation, can be obtained by integrating the ODE system numerically with proposed θ and initial values of xM and xL. Then, the target is to maximize the likelihood function ρ(xM, θ|y, ϕ, σ, γ).
The present algorithm is a combination of a Gaussian process-based gradient matching and a least square optimization. In the GP gradient matching step, the Gaussian process model is first fitted by inferring the hyperparameter ϕ. Secondly, the states and parameters are inferred using a one-chain MCMC scheme on the density as in [14]. Finally, the parameters estimated above are set as initial guess in the least square optimization. The algorithm can be summarized as follows.
Input: y, f(x, θ), γ, NMCMC, Nburnin, t, σs, σp |
Step 1. Fit Gaussian process model to data |
Step 2. Infer xM, xL and θ using MCMC |
|
Propose a new state value using a Gaussian distribution with standard deviation σs |
Accept proposed value based on the density (equation (2.8)) |
Add current value to |
|
|
Propose a new parameter value using a Gaussian distribution with standard deviation σp |
Infer xL by integration or sampling (detailed in §2). |
Accept proposed value based on the density (equation (2.8)) |
Add current value to |
|
Add the mean of to |
Discard the first Nburnin samples of |
Return xM, xL, θ |
Step 3. Optimization of equation (2.11) using θ from Step 2 as initial guess. |
In Step 1, the Gaussian process model is fitted to data by maximizing the log of marginal likelihood of the observations y at times t
For the inference of the unobserved variables, there are the following possible ways. The first method is to integrate the whole ODE system, which only depends on initial conditions and the proposed value of parameters. An alternative way is to integrate partially the system, that is, integrate the equations for the unobservable variables only. The coupled observable variables are regarded as known coefficients whose values are taken from the observations. In order to ensure the stability and convergence of the integration, smoothing and interpolation may be necessary if the data of the observable variables are sparse and noisy. The numerical integration of in these two methods are implemented only after each update of θ. The third approach to deal with the latent variables is applying the same sampling process as the observable states under the assumption of Gaussian process and doing the gradient matching, in which the posterior probability is adapted as
3. Experiments
For the Gaussian regression step for observable variables, the code published alongside Wenk et al. [14] was used. The gradient matching part should then be adapted to the partial observation case according to figure 1. In the following, we refer to FGPGM as the modified fast Gaussian process gradient matching algorithm that is adapted for partial observation case.
3.1. Lotka Volterra
The Lotka Volterra system was originally proposed in Lotka [16]. It was introduced to model the prey–predator interaction system whose dynamics is given by

Figure 2. Reconstruction and inference results for the Lotka–Volterra system, showing the state evolution over time and parameter distributions. x1 is observable and x2 is latent variable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared.

Figure 3. The state evolution over time for Lotka–Volterra system and parameter inference results. x2 is observable and x1 is latent variable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared. (a) x1, (b) x2, (c) θ.
FGPGM | 1.69 × 10−2 | 0.83 × 10−2 | 3.47 × 10−2 | 2.44 × 10−2 |
FGPGM + Opt. | 1.40 × 10−2 | 0.49 × 10−2 | 4.26 × 10−2 | 3.63 × 10−2 |
FGPGM | 8.00 × 10−2 | 4.79 × 10−2 | 8.32 × 10−2 | 2.35 × 10−2 |
FGPGM + Opt. | 9.82 × 10−2 | 1.88 × 10−2 | 8.39 × 10−2 | 1.17 × 10−2 |
FGPGM | 2.99 × 10−2 | 3.08 × 10−2 | 1.98 × 10−2 | 1.92 × 10−2 |
FGPGM + Opt. | 0.80 × 10−2 | 0.65 × 10−2 | 1.37 × 10−2 | 1.07 × 10−2 |
FGPGM | 7.16 × 10−2 | 0.06 × 10−2 | 1.34 × 10−1 | 0.07 × 10−2 |
FGPGM + Opt. | 1.79 × 10−2 | 1.76 × 10−2 | 7.77 × 10−2 | 1.88 × 10−2 |
The sensitivities of the variables to the parameters are listed in table 5.
Sij | x1 | x2 |
---|---|---|
θ1 | 0.20 | 0.61 |
θ2 | 0.52 | 1.13 |
θ3 | 0.40 | 0.33 |
θ4 | 1.27 | 0.98 |
The sensitivity indexes at the true parameter set θ* are defined as
The cases with larger measurement noise level (with standard deviation 0.5) are shown in figures 4 and 5, corresponding to x1 and x2 observations, respectively. The errors of reconstructed solution and identified parameters are summarized in tables 6–9. It can be seen that the prediction of the unknown variable can deviate far from the ground truth if we use FGPGM method only (without Step 3 in the algorithm above). The inferring of states and parameters can be improved after further applying the deterministic optimization. This improvement is more obvious than the low noise case.
Figure 4. Reconstruction and inference results for the Lotka–Volterra system with x1 being observable and x2 latent. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared. The observation noise has a standard deviation 0.5 (large noise case). (a) x1, (b) x2, (c) θ. Figure 5. The state evolution over time for Lotka–Volterra system. x2 is observable and x1 is latent variable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared. The observation noise has a standard deviation 0.5 (large noise case). (a) x1, (b) x2, (c) θ.
FGPGM | 1.56 × 10−1 | 7.30 × 10−2 | 4.03 × 10−1 | 3.81 × 10−1 |
FGPGM+Opt. | 3.02 × 10−2 | 3.08 × 10−2 | 2.70 × 10−2 | 2.68 × 10−2 |
FGPGM | 6.37 × 10−2 | 2.76 × 10−1 | 2.01 | 5.75 × 10−1 |
FGPGM+Opt. | 1.78 × 10−1 | 1.06 × 10−1 | 2.75 × 10−1 | 5.51 × 10−2 |
FGPGM | 3.22 × 10−1 | 1.89 × 10−1 | 8.80 × 10−2 | 5.34 × 10−2 |
FGPGM + Opt. | 6.88 × 10−2 | 6.68 × 10−2 | 8.40 × 10−2 | 7.71 × 10−2 |
FGPGM | 6.40 × 10−1 | 3.44 × 10−1 | 1.10 | 3.63 × 10−1 |
FGPGM + Opt. | 4.35 × 10−2 | 4.35 × 10−2 | 4.72 × 10−1 | 2.16 × 10−1 |
3.2. Spiky dynamics
This example is a system proposed by FitzHugh [17] and Nagumo et al. [18] for modelling the spike potentials in the giant squid neurons, which is abbreviated as FHN system. This system involves two ODEs (3.4) and (3.5) with three parameters.
The FHN system has notoriously fast changing dynamics due to its highly nonlinear terms. In the following numerical tests, the Matern 5/2 [19] kernel was used and γ was set to 0.3, the same as that in Wenk et al. [14]. We assume one of the two variables is observable, which was generated with θ* = (θ1, θ2, θ3) = (0.2, 0.2, 3) and added by Gaussian noise with average signal-to-noise ratio SNR = 100. There were 100 data points uniformly spaced in [0, 10]. The burn-in number and valid number of samplings in the FGPGM step are Nburnin = 7500, NMCMC = 10 000, respectively.
In this case, if we merely use FGPGM step, the reconstructed solution corresponding to the identified parameters may deviate significantly from the true time series (see figure 6, where data of x1 are observable). It was pointed out [14] that all GP-based gradient matching algorithms lead to smoother trajectories than the ground truth. This becomes more severe with sparse observation. The least-square optimization after FGPGM may well reduce this effect and give a better reconstruction of the solution (figure 7).
Figure 6. Results for FHN system obtained from FGPGM method, without further optimization. x1 is observable and x2 is latent variable. The ground truth, FGPGM result and reconstructed solution (integration of ODEs with inferred parameters) are compared. (a) x1, (b) X2. Figure 7. The state evolution over time and identified parameters for FHN system. x1 is observable and x2 is latent variable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared. (a) x1, (b) x2, (c) θ.
Figures 7 and 8 present the results with single x1 and x2 observations, respectively. In both cases, the identified parameters are more accurate than using FGPGM only. From the sensitivity check in table 10, it is expected that θ1 is most accurate because it is most sensitive among these three parameters, whereas θ2 is most insensitive and would be harder to be identified. The numerical results agree with that. It is worth mentioning that in the FGPGM step, only 3500 samplings were taken and the time for optimization step was much less than FGPGM step. This means the time needed for the whole process can be greatly saved compared with that in Wenk et al. [14], where 100 000 MCMC samplings were implemented.
Figure 8. The state evolution over time and identified parameters for FHN system. x2 is observable and x1 is latent variable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared. (a) x1, (b) x2, (c) θ.
Sij | x1 | x2 |
---|---|---|
θ1 | 2.33 | 1.24 |
θ2 | 0.44 | 0.31 |
θ3 | 1.01 | 0.55 |
In this example, we also notice that if we merely use least-square optimization method, the local minimum effect would lead to reconstruction being far from the ground truth, which is even less robust than FGPGM method. For example, if we choose initial guess of the parameters near (θ1, θ2, θ3) = (1.51, 2.2, 1.78) then the cost functional will fall into the local minimum during gradient-based search (figure 9). The existence of many local minima in the full observation case has been pointed out in e.g. [7,20]. These results clearly illustrate the performance of the combination of FGPGM and least-square optimization to avoid local minimum solution.
Figure 9. Results of FHN system with x1 being latent, obtained by merely using least-square optimization with initial guess of parameters near a local minimum point. (a) x1, (b) x2.
3.3. Protein transduction
Finally, the protein transduction system proposed in Vyshemisky & Girolami [21] was adopted to illustrate the performance of the method in ODEs with more equations. The system is described by
Figure 10 gives the result with x3 being unobserved. In fact, the situations with one of other variables being unknown have better results than the case illustrated here, which will not be presented here. We can see that x3 was not well fitted by merely using FGPGM step (Nburnin = 7500, NMCMC = 10 000), whereas further applying the least-square optimization generates satisfactory results, with the parameters θ2 and θ4 being significantly improved. The sensitivity check is summarized in table 11, from which we can see that θ2 is less sensitive and thereby harder to infer accurately.
Figure 10. The state evolution over time and inferred parameters for protein transduction system. x3 is unknown and other variables are observable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared. (a) x1, (b) x2, (c) x3, (d) x4, (e) x5, (f) θ.
Sij | x1 | x2 | x3 | x4 | x5 |
---|---|---|---|---|---|
θ1 | 2.86 | 9.78 | 1.73 | 1.77 | 3.33 |
θ2 | 0.70 | 0.98 | 0.22 | 0.59 | 0.41 |
θ3 | 1.35 | 2.11 | 0.47 | 0.92 | 0.90 |
θ4 | 0.26 | 0.43 | 0.03 | 2.64 | 0.62 |
θ5 | 1.53 | 2.58 | 24.48 | 0.90 | 49.38 |
θ6 | 0.04 | 0.07 | 0.60 | 0.02 | 1.21 |
It was mentioned in Dondelinger et al. [10] that θ5 and θ6 in this system are difficult to fit. θ5/θ6 is relatively sensitive but is still likely to be overestimated. It was pointed out in [14] that the parameters of the protein transduction system are unidentifiable. In the present case with partial observation, θ6 is far beyond the truth with the inferred value being around 3.0, which is therefore not presented in the figure. However, the observation can still be acceptably reconstructed.
It would also be of interest to see the performance of our method for the cases with more latent variables. In this model, although x2 is not involved in equations for other variables, the data of x2 helps infer θ1. We also notice that , which means with known initial conditions the histories of the three variables can be known from any two of them. If x1 and x2 are both missing, it is impossible to identify θ1. Therefore, in the following test we choose data of x2, x4 and x5 as observations. The data have a Gaussian noise with standard deviation 0.01, the same as the previous case with one latent variable. It can be seen from figure 11 that the result from FGPGM step is worse than the case with only one latent variable, but the final reconstruction of latent variables and parameter identification after Step 3 is not significantly different from the case with one latent variable.
Figure 11. The state evolution over time and inferred parameters for protein transduction system. x1 and x3 are unknown and other variables are observable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared. (a) x1, (b) x2, (c) x3, (d) x4, (e) x5, (f) θ.
In order to imitate real situations, we considered the case with abnormal measurements, in which some observations have significant deviation from the ground truth. Moreover, only x1 and x5 are observable. Note that further reducing the observable variable will make the system unidentifiable. The results are shown in figure 12. At t = 40 and t = 60, the measurement errors are abnormally large (figure 12a,e). The identified parameters and reconstructed state series seem not severely impacted by the large measurement errors, which shows some robustness. This also illustrates that the sigmoid kernel is a proper choice for this example. The performance in more real situations is worth further investigation, especially for complex dynamical systems.
Figure 12. The state evolution over time and inferred parameters for protein transduction system. x2, x3, x4 are unknown and other variables are observable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared. Observations at t = 40, 60 are abnormal with significant deviation from ground truth. (a) x1, (b) x2, (c) x3, (d) x4, (e) x5, (f) θ.
Besides, the dependence of errors of inferred parameter on the noise levels have also been carried out. Table 12 gives the comparison among observation noise levels. The measurement errors are Gaussian and independent and identically distributed. Here x1 and x5 are observable. On the whole the identified parameters become less accurate when the observation noise level increases. When the standard deviation increases to 0.02, some parameters show quite large errors such that the identification should be regarded as failed. We need to mention that small errors in part of the parameters does not necessarily imply a better reconstruction of solution or prediction. It relates to the sensitivities of parameters and variables, which need further investigation.
noise level (s.d.) | method | |||||
---|---|---|---|---|---|---|
0.005 | FGPGM | 0.0127 | 0.1009 | 0.0007 | 0.0312 | 0.0372 |
0.01 | FGPGM | 0.0055 | 0.1312 | 0.0590 | 0.1316 | 0.0308 |
0.02 | FGPGM | 0.0193 | 0.1730 | 1.1257 | 1.4630 | 0.0272 |
0.005 | FGPGM + Opt. | 0.0253 | 0.0489 | 0.0327 | 0.0638 | 0.0602 |
0.01 | FGPGM + Opt. | 0.0267 | 0.0702 | 0.0270 | 0.0736 | 0.0704 |
0.02 | FGPGM + Opt. | 0.0121 | 0.1063 | 1.0824 | 1.4883 | 0.0477 |
3.4. Comparisons among treatments for unobservable variables
Now we illustrate the performances of the three ways to infer the unobservable variables proposed in §2. We take the spiky dynamics in §3.2 as example and choose x2 as unobservable variable. Here the observation noise for x1 is increased with a standard deviation of 0.1 and the time resolution is reduced to 50 points. The results of these three methods are compared in figures 13–17. In all the cases, 2000 burning cycles and 3000 sampling cycles are implemented. In the full integration case, the ode45 solver is applied, while in the partial integration case Eular formula is adopted. Other computational parameters are unified as γ = 1.0 × 10−3, standard deviations that control the sampling steps are 0.001 and 0.12 for x1 and θ, respectively. The initial guess for the unobservable variable are plotted in dashed line in figure 15b.
Figure 13. The state evolution over time and identified parameters for FHN system. x2 is unobservable and inferred by the integration of the whole ODE system. (a) x1, (b) x2, (c) θ. Figure 14. The state evolution over time and identified parameters for FHN system. x2 is unobservable and inferred by integrating the equation for x2 with inferred x1 as known coefficient. (a) x1, (b) x2, (c) θ. Figure 15. The state evolution over time and identified parameters for FHN system. x2 is unobservable and inferred by sampling under the assumption of Gaussian process. (a) x1, (b) x2, (c) θ. Figure 16. The state evolution over time and identified parameters for FHN system. x2 is unobservable and inferred by sampling under the assumption of Gaussian process. The initial guess of x2 is different from that in figure 15b. (a) x1, (b) x2, (c) θ. Figure 17. The state evolution over time and identified parameters for FHN system. x2 is unobservable and inferred by sampling under the assumption of Gaussian process. The initial guess of x2 at t = 0 is also far from the ground truth. (a) x1, (b) x2, (c) θ.
The total CPU times corresponding to the full integration, partial integration and sampling approaches are 207.6 s, 180.7 s and 529.4 s correspondingly. In the partial integration case, if denoise and interpolation for observed variables or high-order integration scheme are needed to ensure the stability and convergence, then the computational cost will increase. If most of the variables in a high-dimensional system are unobservable, then the partial integration may not have a significant time saving compared with the full integration case. At present, the system is so simple that the computational expense for integration is very low compared with sampling and probability calculation. Therefore, the CPU time is much higher in the third approach where the sampled states are doubled. However, when the system is complex such that the computational cost for solving the system becomes an issue, the sampling approach with proper initial guesses may show advantages since it depends on the time points, and sparse approximation works well in Gaussian process methods [8]. It can be seen from the results that the sampling approach gets a better parameter approximation than the integration methods (figure 15c), though the inference of x2 is worse than them in terms of the curve features. It should be noted that the results for the full sampling method depend on the initial guess. An inappropriate initial distribution may lead to large deviation from the truth at the same computational parameters (figure 16b,c). If the system is large and most of the variables are unobservable, then the initial guesses of them may become an issue. To investigate the case of unknown initial conditions, we propose an initial guess such that x2 at t = 0 deviates a lot from the ground truth (figure 17b). It can be seen that the sampling is still able to find a solution close to the truth for this system. We should point out that lack of initial conditions may lead to non-uniqueness of identification for some systems [22].
According to the above discussions, we suggest that if the system is very small such that solving ODEs is cheap in time, one can adopt the integration approaches since they are independent of the prior information of the unobserved variables. If the integration requires fine time step and most of the variables in the system are unobservable, then the full integration is preferred since denoise and interpolation of the observable variables which appear in the equations for unobservable ones can be avoided. For large-scale problems with reliable prior information, the full sampling method may have advantages in both accuracy and efficiency.
3.5. Indirect observation
In this section, we consider the case that the observable output y(t) is a function of the original system variable x(t). Then, the observation with noise becomes Y(t) = y(x(t)) + ɛ. Take the following system as example:
For such problem, we take time derivative of the output function

Figure 18. The state evolution over time and identified parameters for example 4. Only is observable. (a) x1, (b) x2, (c) y, (d) θ.
4. Discussion
In the work, we proposed an effective method for parameter inference of coupled ODE systems with partially observable data. Our method is based on previous work known as FGPGM [14], which avoids product of experts heuristics. We provide the treatment of latent variable and graphical frame for such problems. The likelihood formula is then derived. In order to improve the accuracy and efficiency of the method, we also use least-square optimization in our computation. In our numerical tests, we use only 10% of the sampling number that is suggested in the literature for the FGPGM step, which can already provide a good initial guess for the least optimization. Owing to the existence of latent variables, three approaches for treatment of unobservable variables are provided with the performances discussed. Suggestions for choices of the approaches are given according to the system scales and reliability of prior information. The present algorithm is robust and accurate with large data noise and partial observations. At the same time, thanks to the step of Gaussian process-based inference, it is easier to get rid of local minima for the least-square optimization step and achieve the global minimum, which reduces the dependence on the initial guess of parameters.
Data accessibility
Data and relevant code for this research work are stored in GitHub: https://github.com/yuchensufe/PartialObservation-GPGM/tree/v1.0 and have been archived within the Zenodo repository: https://doi.org/10.5281/zenodo.4501573.
Authors' contributions
Y.C. and S.X. designed the algorithm. Y.C. conducted numerical simulations. Y.C., J.C., A.G., H.H. and S.X. contributed to the research conceptualization and prepared the manuscript.
Competing interests
We declare we have no competing interests.
Funding
This work was supported in part by Kunshan Government Research Fund (grant no. R97030012S 2020DKU0032 (S1310)), Nature Science and Research Council (NSERC) of Canada and the Fields Institute for Research in Mathematical Sciences.
Acknowledgements
The authors also express gratitude to Prof. Xin Gao, Nathan Gold and other members of the Fields CQAM Lab on Health Analytics and Modelling for very beneficial discussions.
Appendix A. Preliminaries
In the following, we list some preliminaries on derivatives of a Gaussian process that are used in this work, the proofs can be find in e.g. [14,19,23]. Denote a random process Xt, its realization x and its time derivative .
Definition A.1. [23]
The random variable Xn converges to x in the first-mean sense (limit in mean) if for some X,
Definition A.2.
The stochastic process Xt is first-mean differentiable if for some
Definition A.3.
For given random variable X, the moment generating function (MGF) is defined by
Proposition A.4.
If is the MGF, then
1. | where m is the moment of X. | ||||
2. | Let X and Y be two random variables. X and Y have the same distribution if and only if they have the same MGFs. | ||||
3. | We say X ∼ N(μ, η2) if and only if . | ||||
4. | If X and Y are two random variables, then the MGF |
By the above propositions, one has
Lemma A.5.
If X, Y are two independent Gaussian random variables with means μX, μY and covariances , then X + Y is a Gaussian random variable with mean μX + μy and covariance .
Definition A.6. [19]
A real-valued stochastic process {Xt}t∈T, where T is an index set, is a Gaussian process if all the finite-dimensional distributions are a multivariate normal distribution. That is, for any choice of distinct values t1, t2, …tN ∈ T, the random vector has a multivariate normal distribution with joint Gaussian probability density function given by
The Gaussian processes only depend on the mean and covariance functions. Usual covariance functions could be squared exponential , where l is a hyperparameter and represents the non-local interaction length scale.
Let t0, δt ∈ R and Xt be a Gaussian process with constant mean μ and kernel function kϕ(t1, t2), assumed to be first-mean differentiable. Then and are jointly Gaussian distributed
The above derivation shows that and are jointly Gaussian distributed. Using the definition of first-mean differential and the fact that rth-mean convergence implies convergence in distribution, it is clear that and are jointly Gaussian
In general, and are jointly Gaussian
Lemma A.7. (Matrix Inversions Lemma)
Let Σ be a p × p matrix (p = n + m):
Lemma A.8. (conditional Gaussian distributions)
Let , , be jointly Gaussian random vectors with distribution
According to above lemma, we have the condition distribution
Lemma A.9.
Appendix B. Proof of theorem 2.1
Proof.
The joint density over all variables in figure 1 can be represented as