Royal Society Open Science
Open AccessResearch articles

Numerical method for parameter inference of systems of nonlinear ordinary differential equations with partial observations

Yu Chen

Yu Chen

School of Mathematics, Shanghai University of Finance and Economics, Shanghai, People’s Republic of China

Centre for Quantitative Analysis and Modeling (CQAM), The Fields Institute for Research in Mathematical Sciences, 222 College Street, Toronto, Ontario, Canada

Google Scholar

Find this author on PubMed

,
Jin Cheng

Jin Cheng

School of Mathematical Sciences, Fudan University, Shanghai 200433, People’s Republic of China

School of Mathematics, Shanghai University of Finance and Economics, Shanghai, People’s Republic of China

Google Scholar

Find this author on PubMed

,
Arvind Gupta

Arvind Gupta

Computer Science, University of Toronto, Toronto, Ontario, Canada

Google Scholar

Find this author on PubMed

,
Huaxiong Huang

Huaxiong Huang

Joint Mathematical Research Centre of Beijing Normal University and BNU-HKBU United International College, Zhuhai, People’s Republic of China

Department of Mathematics and Statistics, York University, Toronto, Ontario, Canada

Centre for Quantitative Analysis and Modeling (CQAM), The Fields Institute for Research in Mathematical Sciences, 222 College Street, Toronto, Ontario, Canada

Computer Science, University of Toronto, Toronto, Ontario, Canada

[email protected]

Google Scholar

Find this author on PubMed

and
Shixin Xu

Shixin Xu

Duke Kunshan University, 8 Duke Ave, Kunshan, Jiangsu, People’s Republic of China

[email protected]

Google Scholar

Find this author on PubMed

    Abstract

    Parameter inference of dynamical systems is a challenging task faced by many researchers and practitioners across various fields. In many applications, it is common that only limited variables are observable. In this paper, we propose a method for parameter inference of a system of nonlinear coupled ordinary differential equations with partial observations. Our method combines fast Gaussian process-based gradient matching and deterministic optimization algorithms. By using initial values obtained by Bayesian steps with low sampling numbers, our deterministic optimization algorithm is both accurate, robust and efficient with partial observations and large noise.

    1. Introduction

    Many problems in science and engineering can be modelled by systems of ordinary differential equations (ODEs). It is often difficult or impossible to measure some parameters of the systems directly. Therefore, various methods have been developed to estimate parameters based on available data. Mathematically, such problems are classified as inverse problems, which have been widely studied [14]. They can be also treated as parameter inference in statistics. Statistical inference enables the construction of data-efficient learning machines and shows advantages in dealing with noisy data, which is increasingly being studied and discussed (e.g. [57]).

    For nonlinear ODEs, standard statistical inference is time consuming as numerical integration is needed after each update of the parameters [8,9]. Recently, gradient matching techniques have been proposed to circumvent the high computational cost of numerical integration [7,1012]. These techniques are based on minimizing the difference between the time derivatives (gradients) and the right-hand side of the equations. This usually involves a process consisting of two steps: data interpolation and parameter adaptation. Among them, non-parametric Bayesian modelling with Gaussian processes is one of the promising approaches, which includes data interpolation by Gaussian process and parameter adaptation by matching the solution or model. An adaptive gradient matching method based on a product-of-experts approach and a marginalization over the derivatives of the state variables was proposed by Calderhead et al. [8] and extended by Dondelinger et al. [10]. Barber & Wang [13] proposed a Gaussian process-based method in which the state variables are marginalized. Macdonald et al. [9] provided an interpretation of the above paradigms. Wenk et al. [14] proposed a fast Gaussian process-based gradient matching (FGPGM) algorithm with theoretical framework in systems of nonlinear ODEs, which was indicated more accurate, robust and efficient.

    For many practical problems, the variables are only partially observable, or not at all times [15]. As a consequence, parameter inference is more challenging, even for a coupled system where the parameters are uniquely determined by data of partially observed data under certain initial conditions. It is not clear whether the gradient matching techniques can be applied to the case when there are latent variables. The Markov chain Monte Carlo algorithm has the ability to sidestep the issue of parameter identifiability in many cases, but convergence remains a serious issue [7]. Therefore, the graphical model, density formula and algorithm for inference with partial observations, especially how to deal with the latent variables, are worth investigation. Moreover, we need to pay attention to the feasibility, accuracy, robustness and computational cost of numerical computations for such problems.

    In this work, we focus on the case of parameter inference with partially observable data. The main idea is to treat the observable and non-observable variables differently. For observable variables, we use the same approach as proposed by Wenk et al. [14]. We will provide three approaches to deal with the unobserved variables, including numerical integrations and Gaussian process sampling, and compare their performances. To circumvent the high computational cost of sampling in Bayesian approaches, we also combine the extended FGPGM with least square optimization method. The remaining part of the paper is organized as follows. In §2, we give the numerical method to deal with parameter identification problems with partial observation. Numerical examples are presented in §3. Some concluding remarks are given in §4.

    2. Algorithm

    The main strategy of Gaussian process-based gradient matching is to minimize the mismatch between the data and the ODE solutions in a maximum-likelihood sense, making use of the property that Gaussian process is closed under differentiation. In this section, we will extend the FGPGM method proposed in [14] to the situation that contains unobserved variables.

    In this work, we would like to estimate the time-independent parameters θ of the following dynamical system described by

    X˙(t)=f(X(t),θ),2.1
    where X˙ is the vector of time derivative of the variable X=(X1(t),X2(t),,XN1(t)) and f can be nonlinear vector valued functions. We assume only part of the variables are measurable and denote them as XM. Throughout the paper, we use subscript M to specify measurable components. They are observed on discrete time points as Y(ti)(i = 1, …N2) with noise ε such that Y = XM + ε. We assume that the noise is Gaussian ϵ(ti)N(0,σ2I), then
    ρ(y|xM,σ)=N(y|xM,σ2I),2.2
    where xM and y are the realizations of XM and Y, respectively. The latent/unmeasurable variables are denoted as XL, with dim(XM)+dim(XL)=dim(X).

    The idea of Gaussian process-based gradient matching is as follows. Firstly, we put a Gaussian process prior on xM,

    ρ(xM|μM,ϕ)=N(xM|μM,Cϕ).2.3
    Here Cϕ is denoted as the covariance matrix. Its components are given by Cϕ(i,j)=kϕ(ti,tj) with respect to a kernel function kϕ parametrized by the hyperparameters ϕ. Then according to lemma A.8, the conditional distribution of the kth state derivatives is
    ρ(x˙k|xk,ϕk)=N(x˙k|Dkxk,Ak),2.4
    where
    Dk=Cϕk(x˙k,xk)Cϕk(xk,xk)1(xkμk)2.5
    and
    Ak=Cϕk(x˙k,x˙k)Cϕk(x˙k,xk)Cϕk(xk,xk)1Cϕk(xk,x˙k).2.6
    Here xk=(xk(t1),xk(t2),,xk(tN2)). For more details, we refer to appendix A. We set up the following chain graph to show the relationship between the variables (figure 1). In figure 1, FM is the deterministic output of the measurable parts of the ODEs, whose realization is determined by x and θ. F¯M is introduced to incorporate the model uncertainty, which is assumed to be a Gaussian noise with standard deviation γ so that for its realization f¯M we have
    ρ(f¯M|x˙,γ)=N(f¯M|x˙,γI).2.7
    Then the joint density can be represented by the following theorem.
    Figure 1.

    Figure 1. Chain graph with partial observable variables.

    Theorem 2.1.

    Given the modelling assumptions summarized in the graphical probabilistic model in figure 1,

    ρ(xM,θ|y,ϕ,σ,γ)ρ(θ)N(xM|0,Cϕ)N(y|xM,σ2I)N(fM(xM,x~L(xM,θ),θ)|DxM,A+γI),2.8
    where x~L(xM,θ) involved in fM is the solution determined by θ and xM.

    The proof is given in appendix B.

    In our computation, x~L can be obtained by integrating the ODE system numerically with proposed θ and initial values of xM and xL. Then, the target is to maximize the likelihood function ρ(xM, θ|y, ϕ, σ, γ).

    The present algorithm is a combination of a Gaussian process-based gradient matching and a least square optimization. In the GP gradient matching step, the Gaussian process model is first fitted by inferring the hyperparameter ϕ. Secondly, the states and parameters are inferred using a one-chain MCMC scheme on the density as in [14]. Finally, the parameters estimated above are set as initial guess in the least square optimization. The algorithm can be summarized as follows.

    Algorithm

    Input: y, f(x, θ), γ, NMCMC, Nburnin, t, σs, σp
    Step 1. Fit Gaussian process model to data
    Step 2. Infer xM, xL and θ using MCMC
    S
    for i = 1 → NMCMC + Nburnin do
    T
      for each state xMj(tk) do
       Propose a new state value using a Gaussian distribution with standard deviation σs
       Accept proposed value based on the density (equation (2.8))
       Add current value to T
      end for
      for each parameter do
       Propose a new parameter value using a Gaussian distribution with standard deviation σp
       Infer xL by integration or sampling (detailed in §2).
       Accept proposed value based on the density (equation (2.8))
       Add current value to T
      end for
      Add the mean of T to S
    end for
    Discard the first Nburnin samples of S
    Return xM, xL, θ
    Step 3. Optimization of equation (2.11) using θ from Step 2 as initial guess.

    In Step 1, the Gaussian process model is fitted to data by maximizing the log of marginal likelihood of the observations y at times t

    log(ρ(y|t,ϕ,σ))=12yT(Cϕ+σI)1y12log|Cϕ+σI|n2log2π,2.9
    with respect to hyperparameters ϕ and σ. σ is the estimated standard deviation of the observation noise and n is the amount of observations.

    For the inference of the unobserved variables, there are the following possible ways. The first method is to integrate the whole ODE system, which only depends on initial conditions and the proposed value of parameters. An alternative way is to integrate partially the system, that is, integrate the equations for the unobservable variables only. The coupled observable variables are regarded as known coefficients whose values are taken from the observations. In order to ensure the stability and convergence of the integration, smoothing and interpolation may be necessary if the data of the observable variables are sparse and noisy. The numerical integration of x~L in these two methods are implemented only after each update of θ. The third approach to deal with the latent variables is applying the same sampling process as the observable states under the assumption of Gaussian process and doing the gradient matching, in which the posterior probability is adapted as

    ρ(xL|xM,θ,ϕ,σ,γ)=ρ(θ)N(xL|0,Cϕ)N(fL(xM,xL,θ)|DxL,A+γI).2.10
    Note that although there is no observation to match for the latent variables, the gradient matching is still valid. Thus, in the integration approaches only the observable variables are sampled while for the third one all the variables are sampled together with the updates of the likelihood function in each MCMC sampling cycle. The performances of these three methods will be discussed in §3.4. The last step is to solve the following minimization problem:
    minθxM(θ)yL2(0,T)2=minθi=1N2|xM(ti)y(ti)|2.2.11
    In the optimization process, gradient descent method is adopted where numerical gradient is used in each searching step. One advantage of least-square optimization is its ability to obtain accurate results at low computational cost, especially for observations with Gaussian noise. But it requires proper initial guess of the parameters so as to avoid falling in local minima, whereas gradient matching is relatively less sensitive to the initial guess. However, for FGPGM a large amount of MCMC samplings are necessary to ensure the expectations of random variables make sense, which increases computational cost. Therefore, if we combine these two methods, it is possible to use just a few MCMC samplings to obtain a good initial guess for the least-square optimization.

    3. Experiments

    For the Gaussian regression step for observable variables, the code published alongside Wenk et al. [14] was used. The gradient matching part should then be adapted to the partial observation case according to figure 1. In the following, we refer to FGPGM as the modified fast Gaussian process gradient matching algorithm that is adapted for partial observation case.

    3.1. Lotka Volterra

    The Lotka Volterra system was originally proposed in Lotka [16]. It was introduced to model the prey–predator interaction system whose dynamics is given by

    x˙1=θ1x1(t)θ2x1(t)x2(t)3.1
    and
    x˙2=θ3x2(t)+θ4x1(t)x2(t),3.2
    where θ1, θ2, θ3, θ4 > 0. In the present work, the system was observed with one variable and the initial value of the other variable. The other set-up is the same as Gorbach et al. [11] and Wenk et al. [14]. The observed series are located in the time interval [0, 2] at 20 uniformly distributed observation times. The initial values of the variables are (5, 3). The history of the observable variable is generated with numerical integration of the system with true parameters θ* = (θ1, θ2, θ3, θ4) = (2, 1, 4, 1) added with Gaussian noise with standard deviation 0.1. The radial basis function kernel (RBF) was used for the Gaussian process. For the model uncertainty, we set γ = 0.3. For the FGPGM step, the burn-in number and valid number are Nburnin = 7500, NMCMC = 10 000, respectively. The results with x1 being observed are shown in figure 2. Those with observation of x2 are given in figure 3. The errors of the reconstructed solution and parameters are summarized in tables 14. At this noise level, the inferences with and without optimization step both perform well with the overall relative error being of order 10−2. In the case with measurement of x2, we can see that the optimization process can improve the results from FGPGM, with the discernable errors of θ1 and θ3 in figure 3c being reduced. Although θ2 and θ4 appear quite accurate in the case of using FGPGM only (Nburnin = 7500, NMCMC = 10 000) in table 4, the relatively large error in θ3 may affect the reconstructed solution due to the coupling effect of the parameters in the system. By contrast, the parameter errors after the combination of FGPGM and optimization are all of order 10−2.
    Figure 2.

    Figure 2. Reconstruction and inference results for the Lotka–Volterra system, showing the state evolution over time and parameter distributions. x1 is observable and x2 is latent variable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared.

    Figure 3.

    Figure 3. The state evolution over time for Lotka–Volterra system and parameter inference results. x2 is observable and x1 is latent variable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared. (a) x1, (b) x2, (c) θ.

    Table 1. Error of reconstructed solution with x1 being observable.

    x1x1Lx1L x1x1L2x1L2 x2x2Lx2L x2x2L2x2L2
    FGPGM 1.69 × 10−2 0.83 × 10−2 3.47 × 10−2 2.44 × 10−2
    FGPGM + Opt. 1.40 × 10−2 0.49 × 10−2 4.26 × 10−2 3.63 × 10−2

    Table 2. Error of reconstructed parameters with x1 being observable.

    |θ1θ1| |θ2θ2| |θ3θ3| |θ4θ4|
    FGPGM 8.00 × 10−2 4.79 × 10−2 8.32 × 10−2 2.35 × 10−2
    FGPGM + Opt. 9.82 × 10−2 1.88 × 10−2 8.39 × 10−2 1.17 × 10−2

    Table 3. Error of reconstructed solution with x2 being observable.

    x1x1Lx1L x1x1L2x1L2 x2x2Lx2L x2x2L2x2L2
    FGPGM 2.99 × 10−2 3.08 × 10−2 1.98 × 10−2 1.92 × 10−2
    FGPGM + Opt. 0.80 × 10−2 0.65 × 10−2 1.37 × 10−2 1.07 × 10−2

    Table 4. Error of reconstructed parameters with x2 being observable.

    |θ1θ1| |θ2θ2| |θ3θ3| |θ4θ4|
    FGPGM 7.16 × 10−2 0.06 × 10−2 1.34 × 10−1 0.07 × 10−2
    FGPGM + Opt. 1.79 × 10−2 1.76 × 10−2 7.77 × 10−2 1.88 × 10−2

    The sensitivities of the variables to the parameters are listed in table 5.

    Table 5. Sensitivity of each variable to parameters for Lotka–Volterra system at θ* = (2, 1, 4, 1). The sensitivity index is defined in equation (3.3).

    Sij x1 x2
    θ1 0.20 0.61
    θ2 0.52 1.13
    θ3 0.40 0.33
    θ4 1.27 0.98

    The sensitivity indexes at the true parameter set θ* are defined as

    Sij=1xiL2(T1,T2)xi(t;θ)θjL2(T1,T2)(θ)3.3
    which are normalized. It is approximated by numerical difference. It is shown that near the true parameter set, θ1 and θ3 are relatively less sensitive to the variables than other parameters. This explains that θ1 and θ3 are less accurate in the numerical test (figures 2c and 3c).

    The cases with larger measurement noise level (with standard deviation 0.5) are shown in figures 4 and 5, corresponding to x1 and x2 observations, respectively. The errors of reconstructed solution and identified parameters are summarized in tables 69. It can be seen that the prediction of the unknown variable can deviate far from the ground truth if we use FGPGM method only (without Step 3 in the algorithm above). The inferring of states and parameters can be improved after further applying the deterministic optimization. This improvement is more obvious than the low noise case.

    Figure 4.

    Figure 4. Reconstruction and inference results for the Lotka–Volterra system with x1 being observable and x2 latent. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared. The observation noise has a standard deviation 0.5 (large noise case). (a) x1, (b) x2, (c) θ.

    Figure 5.

    Figure 5. The state evolution over time for Lotka–Volterra system. x2 is observable and x1 is latent variable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared. The observation noise has a standard deviation 0.5 (large noise case). (a) x1, (b) x2, (c) θ.

    Table 6. Error of reconstructed solution with x1 being observable (large noise).

    x1x1Lx1L x1x1L2x1L2 x2x2Lx2L x2x2L2x2L2
    FGPGM 1.56 × 10−1 7.30 × 10−2 4.03 × 10−1 3.81 × 10−1
    FGPGM+Opt. 3.02 × 10−2 3.08 × 10−2 2.70 × 10−2 2.68 × 10−2

    Table 7. Error of reconstructed parameters with x1 being observable (large noise).

    |θ1θ1| |θ2θ2| |θ3θ3| |θ4θ4|
    FGPGM 6.37 × 10−2 2.76 × 10−1 2.01 5.75 × 10−1
    FGPGM+Opt. 1.78 × 10−1 1.06 × 10−1 2.75 × 10−1 5.51 × 10−2

    Table 8. Error of reconstructed solution with x2 being observable (large noise).

    x1x1Lx1L x1x1L2x1L2 x2x2Lx2L x2x2L2x2L2
    FGPGM 3.22 × 10−1 1.89 × 10−1 8.80 × 10−2 5.34 × 10−2
    FGPGM + Opt. 6.88 × 10−2 6.68 × 10−2 8.40 × 10−2 7.71 × 10−2

    Table 9. Error of reconstructed parameters with x2 being observable (large noise).

    |θ1θ1| |θ2θ2| |θ3θ3| |θ4θ4|
    FGPGM 6.40 × 10−1 3.44 × 10−1 1.10 3.63 × 10−1
    FGPGM + Opt. 4.35 × 10−2 4.35 × 10−2 4.72 × 10−1 2.16 × 10−1

    3.2. Spiky dynamics

    This example is a system proposed by FitzHugh [17] and Nagumo et al. [18] for modelling the spike potentials in the giant squid neurons, which is abbreviated as FHN system. This system involves two ODEs (3.4) and (3.5) with three parameters.

    x1˙=θ1x1x133+x23.4
    and
    x2˙=1θ1(x1θ2+θ3x2).3.5

    The FHN system has notoriously fast changing dynamics due to its highly nonlinear terms. In the following numerical tests, the Matern 5/2 [19] kernel was used and γ was set to 0.3, the same as that in Wenk et al. [14]. We assume one of the two variables is observable, which was generated with θ* = (θ1, θ2, θ3) = (0.2, 0.2, 3) and added by Gaussian noise with average signal-to-noise ratio SNR = 100. There were 100 data points uniformly spaced in [0, 10]. The burn-in number and valid number of samplings in the FGPGM step are Nburnin = 7500, NMCMC = 10 000, respectively.

    In this case, if we merely use FGPGM step, the reconstructed solution corresponding to the identified parameters may deviate significantly from the true time series (see figure 6, where data of x1 are observable). It was pointed out [14] that all GP-based gradient matching algorithms lead to smoother trajectories than the ground truth. This becomes more severe with sparse observation. The least-square optimization after FGPGM may well reduce this effect and give a better reconstruction of the solution (figure 7).

    Figure 6.

    Figure 6. Results for FHN system obtained from FGPGM method, without further optimization. x1 is observable and x2 is latent variable. The ground truth, FGPGM result and reconstructed solution (integration of ODEs with inferred parameters) are compared. (a) x1, (b) X2.

    Figure 7.

    Figure 7. The state evolution over time and identified parameters for FHN system. x1 is observable and x2 is latent variable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared. (a) x1, (b) x2, (c) θ.

    Figures 7 and 8 present the results with single x1 and x2 observations, respectively. In both cases, the identified parameters are more accurate than using FGPGM only. From the sensitivity check in table 10, it is expected that θ1 is most accurate because it is most sensitive among these three parameters, whereas θ2 is most insensitive and would be harder to be identified. The numerical results agree with that. It is worth mentioning that in the FGPGM step, only 3500 samplings were taken and the time for optimization step was much less than FGPGM step. This means the time needed for the whole process can be greatly saved compared with that in Wenk et al. [14], where 100 000 MCMC samplings were implemented.

    Figure 8.

    Figure 8. The state evolution over time and identified parameters for FHN system. x2 is observable and x1 is latent variable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared. (a) x1, (b) x2, (c) θ.

    Table 10. Sensitivity of each variable to parameters for FHN system at θ* = (0.2, 0.2, 3.0). The sensitivity index is defined as equation (3.3).

    Sij x1 x2
    θ1 2.33 1.24
    θ2 0.44 0.31
    θ3 1.01 0.55

    In this example, we also notice that if we merely use least-square optimization method, the local minimum effect would lead to reconstruction being far from the ground truth, which is even less robust than FGPGM method. For example, if we choose initial guess of the parameters near (θ1, θ2, θ3) = (1.51, 2.2, 1.78) then the cost functional will fall into the local minimum during gradient-based search (figure 9). The existence of many local minima in the full observation case has been pointed out in e.g. [7,20]. These results clearly illustrate the performance of the combination of FGPGM and least-square optimization to avoid local minimum solution.

    Figure 9.

    Figure 9. Results of FHN system with x1 being latent, obtained by merely using least-square optimization with initial guess of parameters near a local minimum point. (a) x1, (b) x2.

    3.3. Protein transduction

    Finally, the protein transduction system proposed in Vyshemisky & Girolami [21] was adopted to illustrate the performance of the method in ODEs with more equations. The system is described by

    x1˙=θ1x1θ2x1x3+θ3x4,3.6
    x2˙=θ1x1,3.7
    x3˙=θ2x1x3+θ3x4+θ5x5θ6+x5,3.8
    x4˙=θ2x1x3θ3x4θ4x43.9
    andx5˙=θ4x4θ5x5θ6+x5.3.10
    We adopted the same experimental set-up of Dondelinger et al. [10] and Wenk et al. [14] as follows. γ = 10−4 in FGPGM step. The observation were made at discrete times [0, 1, 2, 4, 5, 7, 10, 15, 20, 30, 40, 50, 60, 80, 100]. The initial condition was [1, 0, 1, 0, 0] and the data were generated by numerically integrating the system under θ* = [0.07, 0.6, 0.05, 0.3, 0.017, 0.3], added by Gaussian noise with standard deviation 0.01. A sigmoid kernel was used to deal with the logarithmically spaced observation times and the typically spiky form of the dynamics as in the previous papers.

    Figure 10 gives the result with x3 being unobserved. In fact, the situations with one of other variables being unknown have better results than the case illustrated here, which will not be presented here. We can see that x3 was not well fitted by merely using FGPGM step (Nburnin = 7500, NMCMC = 10 000), whereas further applying the least-square optimization generates satisfactory results, with the parameters θ2 and θ4 being significantly improved. The sensitivity check is summarized in table 11, from which we can see that θ2 is less sensitive and thereby harder to infer accurately.

    Figure 10.

    Figure 10. The state evolution over time and inferred parameters for protein transduction system. x3 is unknown and other variables are observable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared. (a) x1, (b) x2, (c) x3, (d) x4, (e) x5, (f) θ.

    Table 11. Sensitivity of each variable to parameters for protein transduction system at θ* = [0.07, 0.6, 0.05, 0.3, 0.017, 0.3]. The sensitivity index is defined as equation (3.3).

    Sij x1 x2 x3 x4 x5
    θ1 2.86 9.78 1.73 1.77 3.33
    θ2 0.70 0.98 0.22 0.59 0.41
    θ3 1.35 2.11 0.47 0.92 0.90
    θ4 0.26 0.43 0.03 2.64 0.62
    θ5 1.53 2.58 24.48 0.90 49.38
    θ6 0.04 0.07 0.60 0.02 1.21

    It was mentioned in Dondelinger et al. [10] that θ5 and θ6 in this system are difficult to fit. θ5/θ6 is relatively sensitive but is still likely to be overestimated. It was pointed out in [14] that the parameters of the protein transduction system are unidentifiable. In the present case with partial observation, θ6 is far beyond the truth with the inferred value being around 3.0, which is therefore not presented in the figure. However, the observation can still be acceptably reconstructed.

    It would also be of interest to see the performance of our method for the cases with more latent variables. In this model, although x2 is not involved in equations for other variables, the data of x2 helps infer θ1. We also notice that x˙3+x˙4+x˙5=0, which means with known initial conditions the histories of the three variables can be known from any two of them. If x1 and x2 are both missing, it is impossible to identify θ1. Therefore, in the following test we choose data of x2, x4 and x5 as observations. The data have a Gaussian noise with standard deviation 0.01, the same as the previous case with one latent variable. It can be seen from figure 11 that the result from FGPGM step is worse than the case with only one latent variable, but the final reconstruction of latent variables and parameter identification after Step 3 is not significantly different from the case with one latent variable.

    Figure 11.

    Figure 11. The state evolution over time and inferred parameters for protein transduction system. x1 and x3 are unknown and other variables are observable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared. (a) x1, (b) x2, (c) x3, (d) x4, (e) x5, (f) θ.

    In order to imitate real situations, we considered the case with abnormal measurements, in which some observations have significant deviation from the ground truth. Moreover, only x1 and x5 are observable. Note that further reducing the observable variable will make the system unidentifiable. The results are shown in figure 12. At t = 40 and t = 60, the measurement errors are abnormally large (figure 12a,e). The identified parameters and reconstructed state series seem not severely impacted by the large measurement errors, which shows some robustness. This also illustrates that the sigmoid kernel is a proper choice for this example. The performance in more real situations is worth further investigation, especially for complex dynamical systems.

    Figure 12.

    Figure 12. The state evolution over time and inferred parameters for protein transduction system. x2, x3, x4 are unknown and other variables are observable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared. Observations at t = 40, 60 are abnormal with significant deviation from ground truth. (a) x1, (b) x2, (c) x3, (d) x4, (e) x5, (f) θ.

    Besides, the dependence of errors of inferred parameter on the noise levels have also been carried out. Table 12 gives the comparison among observation noise levels. The measurement errors are Gaussian and independent and identically distributed. Here x1 and x5 are observable. On the whole the identified parameters become less accurate when the observation noise level increases. When the standard deviation increases to 0.02, some parameters show quite large errors such that the identification should be regarded as failed. We need to mention that small errors in part of the parameters does not necessarily imply a better reconstruction of solution or prediction. It relates to the sensitivities of parameters and variables, which need further investigation.

    Table 12. Error of inferred parameters under various noise levels.

    noise level (s.d.) method |θ1θ1| |θ2θ2| |θ3θ3| |θ4θ4| |θ5θ5|
    0.005 FGPGM 0.0127 0.1009 0.0007 0.0312 0.0372
    0.01 FGPGM 0.0055 0.1312 0.0590 0.1316 0.0308
    0.02 FGPGM 0.0193 0.1730 1.1257 1.4630 0.0272
    0.005 FGPGM + Opt. 0.0253 0.0489 0.0327 0.0638 0.0602
    0.01 FGPGM + Opt. 0.0267 0.0702 0.0270 0.0736 0.0704
    0.02 FGPGM + Opt. 0.0121 0.1063 1.0824 1.4883 0.0477

    3.4. Comparisons among treatments for unobservable variables

    Now we illustrate the performances of the three ways to infer the unobservable variables proposed in §2. We take the spiky dynamics in §3.2 as example and choose x2 as unobservable variable. Here the observation noise for x1 is increased with a standard deviation of 0.1 and the time resolution is reduced to 50 points. The results of these three methods are compared in figures 1317. In all the cases, 2000 burning cycles and 3000 sampling cycles are implemented. In the full integration case, the ode45 solver is applied, while in the partial integration case Eular formula is adopted. Other computational parameters are unified as γ = 1.0 × 10−3, standard deviations that control the sampling steps are 0.001 and 0.12 for x1 and θ, respectively. The initial guess for the unobservable variable are plotted in dashed line in figure 15b.

    Figure 13.

    Figure 13. The state evolution over time and identified parameters for FHN system. x2 is unobservable and inferred by the integration of the whole ODE system. (a) x1, (b) x2, (c) θ.

    Figure 14.

    Figure 14. The state evolution over time and identified parameters for FHN system. x2 is unobservable and inferred by integrating the equation for x2 with inferred x1 as known coefficient. (a) x1, (b) x2, (c) θ.

    Figure 15.

    Figure 15. The state evolution over time and identified parameters for FHN system. x2 is unobservable and inferred by sampling under the assumption of Gaussian process. (a) x1, (b) x2, (c) θ.

    Figure 16.

    Figure 16. The state evolution over time and identified parameters for FHN system. x2 is unobservable and inferred by sampling under the assumption of Gaussian process. The initial guess of x2 is different from that in figure 15b. (a) x1, (b) x2, (c) θ.

    Figure 17.

    Figure 17. The state evolution over time and identified parameters for FHN system. x2 is unobservable and inferred by sampling under the assumption of Gaussian process. The initial guess of x2 at t = 0 is also far from the ground truth. (a) x1, (b) x2, (c) θ.

    The total CPU times corresponding to the full integration, partial integration and sampling approaches are 207.6 s, 180.7 s and 529.4 s correspondingly. In the partial integration case, if denoise and interpolation for observed variables or high-order integration scheme are needed to ensure the stability and convergence, then the computational cost will increase. If most of the variables in a high-dimensional system are unobservable, then the partial integration may not have a significant time saving compared with the full integration case. At present, the system is so simple that the computational expense for integration is very low compared with sampling and probability calculation. Therefore, the CPU time is much higher in the third approach where the sampled states are doubled. However, when the system is complex such that the computational cost for solving the system becomes an issue, the sampling approach with proper initial guesses may show advantages since it depends on the time points, and sparse approximation works well in Gaussian process methods [8]. It can be seen from the results that the sampling approach gets a better parameter approximation than the integration methods (figure 15c), though the inference of x2 is worse than them in terms of the curve features. It should be noted that the results for the full sampling method depend on the initial guess. An inappropriate initial distribution may lead to large deviation from the truth at the same computational parameters (figure 16b,c). If the system is large and most of the variables are unobservable, then the initial guesses of them may become an issue. To investigate the case of unknown initial conditions, we propose an initial guess such that x2 at t = 0 deviates a lot from the ground truth (figure 17b). It can be seen that the sampling is still able to find a solution close to the truth for this system. We should point out that lack of initial conditions may lead to non-uniqueness of identification for some systems [22].

    According to the above discussions, we suggest that if the system is very small such that solving ODEs is cheap in time, one can adopt the integration approaches since they are independent of the prior information of the unobserved variables. If the integration requires fine time step and most of the variables in the system are unobservable, then the full integration is preferred since denoise and interpolation of the observable variables which appear in the equations for unobservable ones can be avoided. For large-scale problems with reliable prior information, the full sampling method may have advantages in both accuracy and efficiency.

    3.5. Indirect observation

    In this section, we consider the case that the observable output y(t) is a function of the original system variable x(t). Then, the observation with noise becomes Y(t) = y(x(t)) + ɛ. Take the following system as example:

    x˙1=θ1x2eθ2x13.11
    and
    x˙2=x13.12
    and the measurement equation is
    y=ex1.3.13

    For such problem, we take time derivative of the output function

    y˙=θ1x2e(1θ2)x1,
    and then have a new three-equation system for {x1, x2, y}, in which the first two variables are unobservable and the last one is observable. The above method can then be applied to the new system. Assume that the observation is taken at 25 uniformly distributed time points within [0, 5], with a Gaussian noise of standard deviation 0.2. The true values of the parameters are θ* = (0.5, 1), and initial values x(0) = [0.5, 0]. In the Gaussian process gradient matching step, the Matern 5/2 kernel was used and γ = 10−3. The numerical results are provided in figure 18. The burning number and total sampling number are 2000 and 5000, respectively. In this example, the FGPGM step has already reached a satisfactory inference for the parameters (figure 18d), and further application of least optimization loses accuracy at the end of the time interval due to sparsity of the data (figure 18c) which leads to deviations in the parameters.
    Figure 18.

    Figure 18. The state evolution over time and identified parameters for example 4. Only y(t)=ex1(t) is observable. (a) x1, (b) x2, (c) y, (d) θ.

    4. Discussion

    In the work, we proposed an effective method for parameter inference of coupled ODE systems with partially observable data. Our method is based on previous work known as FGPGM [14], which avoids product of experts heuristics. We provide the treatment of latent variable and graphical frame for such problems. The likelihood formula is then derived. In order to improve the accuracy and efficiency of the method, we also use least-square optimization in our computation. In our numerical tests, we use only 10% of the sampling number that is suggested in the literature for the FGPGM step, which can already provide a good initial guess for the least optimization. Owing to the existence of latent variables, three approaches for treatment of unobservable variables are provided with the performances discussed. Suggestions for choices of the approaches are given according to the system scales and reliability of prior information. The present algorithm is robust and accurate with large data noise and partial observations. At the same time, thanks to the step of Gaussian process-based inference, it is easier to get rid of local minima for the least-square optimization step and achieve the global minimum, which reduces the dependence on the initial guess of parameters.

    Data accessibility

    Data and relevant code for this research work are stored in GitHub: https://github.com/yuchensufe/PartialObservation-GPGM/tree/v1.0 and have been archived within the Zenodo repository: https://doi.org/10.5281/zenodo.4501573.

    Authors' contributions

    Y.C. and S.X. designed the algorithm. Y.C. conducted numerical simulations. Y.C., J.C., A.G., H.H. and S.X. contributed to the research conceptualization and prepared the manuscript.

    Competing interests

    We declare we have no competing interests.

    Funding

    This work was supported in part by Kunshan Government Research Fund (grant no. R97030012S 2020DKU0032 (S1310)), Nature Science and Research Council (NSERC) of Canada and the Fields Institute for Research in Mathematical Sciences.

    Acknowledgements

    The authors also express gratitude to Prof. Xin Gao, Nathan Gold and other members of the Fields CQAM Lab on Health Analytics and Modelling for very beneficial discussions.

    Appendix A. Preliminaries

    In the following, we list some preliminaries on derivatives of a Gaussian process that are used in this work, the proofs can be find in e.g. [14,19,23]. Denote a random process Xt, its realization x and its time derivative X˙t.

    Definition A.1. [23]

    The random variable Xn converges to x in the first-mean sense (limit in mean) if for some X,

    limnE(|XnX|)=0.A 1

    Definition A.2.

    The stochastic process Xt is first-mean differentiable if for some X˙t

    limδt0EXt+δtXtδtX˙t=0.A 2

    Definition A.3.

    For given random variable X, the moment generating function (MGF) is defined by

    ΦX(t)=E[exp(Xt)]=exp(xt)ρ(x)dx.A 3

    Proposition A.4.

    If ΦX(t) is the MGF, then

    1.

    dΦX/dt|t=0=m, where m is the moment of X.

    2.

    Let X and Y be two random variables. X and Y have the same distribution if and only if they have the same MGFs.

    3.

    We say XN(μ, η2) if and only if ΦX(t)=exp(η2t2/2)+μt.

    4.

    If X and Y are two random variables, then the MGF ΦX+Y(t)=ΦX(t)ΦY(t).

    By the above propositions, one has

    Lemma A.5.

    If X, Y are two independent Gaussian random variables with means μX, μY and covariances ηX2,ηY2, then X + Y is a Gaussian random variable with mean μX + μy and covariance ηX2+ηy2.

    Definition A.6. [19]

    A real-valued stochastic process {Xt}tT, where T is an index set, is a Gaussian process if all the finite-dimensional distributions are a multivariate normal distribution. That is, for any choice of distinct values t1, t2, …tNT, the random vector X=(Xt1,,XtN)T has a multivariate normal distribution with joint Gaussian probability density function given by

    ρXt1Xt2XtN(xt1,,xtN)=1(2π)N/2det(Σ)1/2exp12(xμX)TΣ1(xμX).A 4
    where the mean vector is defined as
    (μX)i=E[Xti],A 5
    and covariance matrix (Σ)ij=cov(Xti,Xtj).

    The Gaussian processes only depend on the mean and covariance functions. Usual covariance functions could be squared exponential cov(Xti,Xtj)=kϕ(ti,tj)=exp((1/2l2)|titj|2), where l is a hyperparameter and represents the non-local interaction length scale.

    Let t0, δtR and Xt be a Gaussian process with constant mean μ and kernel function kϕ(t1, t2), assumed to be first-mean differentiable. Then Xt0+δt and Xt0 are jointly Gaussian distributed

    Xt0Xt0+δtNμμ,ΣA 6
    with density function
    ρ(xt0,xt0+δt)=12πdet(Σ)1/2exp12xt0μxt0+δtμTΣ1xt0μxt0+δtμ,A 7
    where
    Σ=kϕ(t0,t0)kϕ(t0,t0+δt)kϕ(t0+δt,t0)kϕ(t0+δt,t0+δt).A 8
    If we define linear transformation
    T=101δt1δt,A 9
    then we have
    Xt0Xt0+δtXt0δt=TXt0Xt0+δtNμ0,TΣTT,A 10
    i.e.
    ρ(Xt0,Xt0+δtXt0δt)=Nμ0,TΣTT,A 11
    where
    TΣTT=kϕ(t0,t0)kϕ(t0,t0+δt)kϕ(t0,t0)δt0kϕ(t0+δt0,t0)kϕ(t0,t0)δtkϕ(t0+δt0,t0+δt)kϕ(t0,t0+δt)δt0kϕ(t0+δt,t0)kϕ(t0,t0)δt0δt.A 12

    The above derivation shows that Xt0 and (Xt0+δtXt0)/δt are jointly Gaussian distributed. Using the definition of first-mean differential and the fact that rth-mean convergence implies convergence in distribution, it is clear that Xt0 and X˙t0 are jointly Gaussian

    Xt0X˙t0Nμ0,kϕ(t0,t0)kϕ(a,b)b|a=t0,b=t0kϕ(a,b)a|a=t0,b=t02kϕ(a,b)ab|a=t0,b=t0.A 13

    In general, X=(Xt1,,Xtk)T and X˙=(X˙t1,,X˙tk)T are jointly Gaussian

    XX˙Nμ0,Cϕ(X,X)Cϕ(X,X˙)Cϕ(X˙,X)Cϕ(X˙,X˙).A 14
    Here (Cϕ(a, b))ij = cov(ai, bj) is the covariance between ai and bj, and predefined kernel matrix of Gaussian process. By linearity of the covariance operator and the predefined kernel function kϕ(a, b), we have
    Cϕ(Xti,X˙tj)=kϕ(a,b)b|a=ti,b=tj,A 15
    Cϕ(X˙ti,Xtj)=kϕ(a,b)a|a=ti,b=tjA 16
    andCϕ(X˙ti,X˙tj)=2kϕ(a,b)ab|a=ti,b=tj.A 17

    Lemma A.7. (Matrix Inversions Lemma)

    Let Σ be a p × p matrix (p = n + m):

    Σ=Σ11Σ12Σ21Σ22,A 18
    where the sum matrices have dimension n × n, n × m, etc. Suppose Σ, Σ11, Σ22 are non-singular; and partition the inverse in the same way as Σ,
    Λ=Σ1=Λ11Λ12Λ21Λ22.A 19
    Then
    Λ11= (Σ11Σ12Σ221Σ21) 1Λ12=(Σ11Σ12Σ221Σ21)1Σ12Σ221,Λ21=(Σ22Σ21Σ111Σ12)1Σ21Σ111andΛ22=(Σ22Σ21Σ111Σ12) 1.A 20

    Lemma A.8. (conditional Gaussian distributions)

    Let XRD, YRM, be jointly Gaussian random vectors with distribution

    XYN(μ,Σ),A 21
    where
    μ=μXμY,Σ=ΣXXΣXYΣYXΣYY.A 22
    Then the conditional Gaussian distributions density functions are
    ρY|X(y|x)=ρXY(x,y)ρX(x)=1(2π)(M+D/2)det(ΣY|X)1/2exp(yμY|X)TΣY|X1(yμY|X)A 23
    where
    μY|X=μY+ΣYXΣXX1(xμX)A 24
    and
    ΣY|X=ΣYYΣYXΣXX1ΣXY.A 25

    According to above lemma, we have the condition distribution

    Lemma A.9.

    ρ(x˙|x)N(D(xμX),A),A 26
    where
    D=Cϕ(X˙,X)Cϕ(X,X)1A 27
    and
    A=Cϕ(X˙,X˙)Cϕ(X˙,X)Cϕ(X,X)1Cϕ(X,X˙).A 28

    Appendix B. Proof of theorem 2.1

    Proof.

    The joint density over all variables in figure 1 can be represented as

    ρ(xM,x˙M,y,xL,fM,f¯M,θ|ϕ,σ,γ)=ρGP(xM,x˙M,y|ϕ,σ)ρODE(fM,f¯M,θ,xL|xM,x˙M,γ)B 1
    ρGP(xM,x˙M,y|ϕ,σ)=ρ(xM|ϕ)ρ(y|xM,σ)ρ(x˙M|xM,ϕ)B 2
    ρODE(fM,f¯M,θ,xL|xM,x˙M,γ)=ρ(θ)ρ(fM,f¯M,xL|θ,xM,x˙M,γ)=ρ(θ)ρ(xL|θ,xM)ρ(fM,f¯M|xL,θ,xM,x˙M,γ)=ρ(θ)δ(x~L(θ,xM)xL)ρ(fM,f¯M|xL,θ,xM,x˙M,γ)=ρ(θ)ρ(fM|θ,xM,x~L(xM,θ))ρ(f¯M|x˙M,γ)δ(fMf¯M)=ρ(θ)δ(fM(θ,xM,x~L(xM,θ))fM)N(fM|x˙M,γI)=ρ(θ)N(fM(θ,xM,x~L(xM,θ))|x˙M,γI),B 3
    by which ρODE is independent of fM,f¯M,xL. x~L is deterministically decided by xM, θ through integration. Using lemma A.9, we have
    ρ(xM,θ,y|ϕ,σ,γ)=ρ(θ)N(xM|0,Cϕ)N(y|xM,σ2I)N(x˙M|DxM,A)N(fM(θ,xM,x~L(xM,θ))|x˙M,γI).B 4
    Integrating x˙M out yields
    ρ(xM,θ,y|ϕ,σ,γ)=ρ(θ)N(xM|0,Cϕ)N(y|xM,σ2I)N(fM(θ,xM,x~L(xM,θ))|DxM,A+γI).B 5
    Finally, we get
    ρ(xM,θ|y,ϕ,σ,γ)ρ(θ)N(xM|0,Cϕ)N(y|xM,σ2I)N(fM(θ,xM,x~L(xM,θ))|DxM,A+γI).B 6

    Footnotes

    Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.

    Comments