Bayesian inference for diffusion processes: using higherorder approximations for transition densities
Abstract
Modelling random dynamical systems in continuous time, diffusion processes are a powerful tool in many areas of science. Model parameters can be estimated from timediscretely observed processes using Markov chain Monte Carlo (MCMC) methods that introduce auxiliary data. These methods typically approximate the transition densities of the process numerically, both for calculating the posterior densities and proposing auxiliary data. Here, the Euler–Maruyama scheme is the standard approximation technique. However, the MCMC method is computationally expensive. Using higherorder approximations may accelerate it, but the specific implementation and benefit remain unclear. Hence, we investigate the utilization and usefulness of higherorder approximations in the example of the Milstein scheme. Our study demonstrates that the MCMC methods based on the Milstein approximation yield good estimation results. However, they are computationally more expensive and can be applied to multidimensional processes only with impractical restrictions. Moreover, the combination of the Milstein approximation and the wellknown modified bridge proposal introduces additional numerical challenges.
1. Introduction
Diffusion processes are used in many areas of science as a powerful tool to model continuoustime dynamical systems that are subject to random fluctuations. A diffusion process can be equivalently described by a stochastic differential equation (SDE).
If the SDE yields an analytical solution, the transition densities of the corresponding diffusion process are explicitly known and parameter estimation can be easily performed through a maximum likelihood approach, as demonstrated in [1]. However, in the majority of applications, this is not the case, and the transition densities are intractable.
When the transition densities are unknown, another challenge for parameter estimation is the type of available data. In practice, a process can only be observed at discrete points in time. A comprehensive overview of the methods for parameter inference from highfrequency data (i.e. where interobservation times are small) can be found in [2, ch. 6]. For parameter estimation from lowfrequency observations, Markov chain Monte Carlo (MCMC) techniques have been developed that introduce imputed data points to reduce the time steps between data points. This concept of Bayesian data imputation for the inference of diffusions has been used and developed further by many authors such as [3–6]. These methods are applicable to multidimensional processes and were extended for the case of latent process components as well as for the occurrence of measurement error. Thus, they are very promising for the use in realdata applications (e.g. [2,7]).
The concept of these MCMC algorithms is to construct a Markov chain whose elements are samples from the joint posterior density of the parameter and the imputed data points conditioned on the observations. This construction is achieved via a Gibbs sampling approach by alternately executing the following two steps:
(1)  drawing the parameter conditional on the augmented path that consists of the observed data points and imputed data points and  
(2)  drawing the imputed data points conditional on the current parameter and the observed data points. 
The numerical approximation of the transition densities of the process is necessary not only for calculating the posterior densities, but also for proposing the imputed data points. In both contexts, the Euler–Maruyama scheme is the standard approximation technique in the literature, including all of the aforementioned references. To reduce the amount of imputed data and the number of necessary iterations for the computationally expensive estimation method, one possible solution is to employ higherorder approximation schemes.
Therefore, we investigate the utilization and usefulness of such higherorder approximations on the example of the Milstein scheme. A closed form of the transition density based on the Milstein scheme is derived in [8]. In [9], this closed form is used to estimate the parameters of a hyperbolic diffusion process from highfrequency financial data, but not in the context of Bayesian data augmentation. For the latter, Elerian et al. [3] propose the possible use of the Milstein scheme. However, the specific implementation and benefit of this framework, in particular when using sophisticated proposal methods, remain unclear and, therefore, are the focus of this work. For our investigation, we first explain how to integrate the Milstein scheme into the framework of Bayesian data augmentation and then assess the effectiveness of this new combination in a simulation study which is a common approach in the literature (e.g. [10,11]).
This article is organized as follows. In §2, we define diffusion processes, describe the numerical approximation of their paths, and explain the derivation of the transition densities of the processes based on these approximations. In §3, we elaborate on the parameter estimation methods for diffusion processes using Bayesian data augmentation and the approximated transition densities. In §4, we give some comments about our implementation of these methods and in §5, we explain the setup of our simulation study. In §§6 and 7, we present the results and discussion. The source code of our implementation and the simulation study is publicly available at https://github.com/fuchslab/Inference_for_SDEs_with_the_Milstein_scheme.
2. Approximation of the transition density of a diffusion process
We consider a ddimensional timehomogeneous Itô diffusion process, (X_{t})_{t≥0}, a stochastic process that fulfils the following SDE:
In this work, we use rather simple, wellknown examples of such a diffusion process in order to focus on the investigated estimation methods and make the article easy to follow. Our example for the main text is the geometric Brownian motion (GBM), and in appendix D, we also provide all relevant details for the Cox–Ingersoll–Ross (CIR) process. The GBM is described by the following SDE:
In different contexts, one often considers the logarithm of the GBM, log X_{t}, which is simply a normally distributed random variable for fixed t, with corresponding SDE
2.1. Approximation of the solution of an SDE
Unlike the GBM, most SDEs do not have an analytical solution; thus, their transition densities are not explicitly known. Instead, numerical approximation schemes are used for the solution of the SDEs. Kloeden & Platen [14] have provided a detailed description of these methods. Several of the approximation schemes are based on the stochastic Taylor expansion. For a general treatment of this expansion, we refer the interested reader to [14]. The most commonly used approximation is the Euler(–Maruyama) scheme, which approximates the ddimensional solution (X_{t})_{t≥0} of an SDE by setting Y_{0} = x_{0} and, then, successively calculating the following:
A discretetime approximation Y^{Δ} with maximum step size Δ > 0 converges with strong order γ > 0 at time T to the solution X_{T} of a given SDE if there exists a positive constant C independent of Δ and a Δ_{0} > 0 such that
By adding another term of the stochastic Taylor expansion to equation (2.5), one obtains the Milstein scheme that approximates the ddimensional process (X_{t})_{t≥0} by setting Y_{0} = x_{0} and, then, successively calculating for the ith component
When σ(Y_{k}, θ) is constant in Y_{k}, the last term vanishes and the Milstein scheme reduces to the Euler scheme. If μ(Y_{k}, θ) is once continuously differentiable and σ(Y_{k}, θ) is twice continuously differentiable regarding Y_{k}, then, the Milstein scheme is strongly convergent of order 1.0, which is higher than that of the Euler scheme. An illustration of this difference in the simulation of SDE trajectories is presented e.g. in [15]. However, there is a severe restriction on the practical applicability of the Milstein scheme because the stochastic double integral in the last term of (2.6) only yields an analytical solution for j = l. Although approximation techniques for the double integral exist (e.g. [14]), they are unsuitable for our purposes. On the one hand, we wish to avoid adding yet another layer of approximation and, thus, additional computational time. On the other hand, we must find the distribution of Y_{k+1} based on approximation schemes (2.5) and (2.6), which is also not explicitly possible when adding another approximation. For this reason, we focus on models where the double integral appears exclusively for the same components of the Brownian motion. For example, this is the case when the process is driven by a onedimensional Brownian motion (i.e. the diffusion function σ(Y_{k}, θ) is of dimension d × 1). Hence, the diffusion model includes only one source of noise that may affect each of the components of the process. More generally, we require that
—  if an entry σ_{rj}(Y_{k}, θ) is nonzero, then the entries of all other columns and all rows must not depend on ${Y}_{k}^{(r)}$, and  
—  if an entry σ_{il}(Y_{k}, θ) depends on ${Y}_{k}^{(r)}$, then the entries of all other columns in row r must be zero. 
Assume that the ith component of the diffusion process appears in the ith row of the diffusion function and that the respective entry of the diffusion function does not depend on the remaining components ${Y}_{k}^{(r)}$, r ≠ i (the contrary would impose restrictions on other rows, as described above). Then, the ith component of the approximated process is
Moreover, note that if we consider the approximation ${Y}_{k+1}^{(i)}$ in equation (2.8) as a function $g(\mathrm{\Delta}{B}_{k}^{(j)})$ of the increment of the Brownian motion, g is quadratic in $\mathrm{\Delta}{B}_{k}^{(j)}$. Therefore, the function g has a global extremum with value
Since our example, the GBM, is a onedimensional process, the double integral in equation (2.6) vanishes and the Milstein scheme for the GBM yields the following:
2.2. Transition densities based on approximation schemes
While sampling diffusion paths is fairly straightforward for both approximation schemes as described above, determining the corresponding transition density is less apparent for the Milstein scheme. Since the Euler scheme is a linear transformation of $\mathrm{\Delta}{B}_{k}\sim \mathcal{N}(0,\sqrt{\mathrm{\Delta}{t}_{k}}{I}_{q})$, where I_{q} denotes the mdimensional identity matrix, the transition density derived from the Euler scheme is also a multivariate Gaussian density:
For the Milstein scheme, deriving the transition density is more complicated, even in the case of a onedimensional diffusion process, which we consider here. Elerian [8] derived the transition density by first rearranging the Milstein scheme to obtain a transformation of a noncentral chisquared distributed variable for which the density is known, and then applying the random variable transformation theorem. In appendix A, we present an alternative derivation that directly applies the random variable transformation theorem to ΔB_{k}. Both approaches produce the same result. For simplicity of notation, we set μ_{k} := μ(Y_{k}, θ), σ_{k} : = σ(Y_{k}, θ) and ${\sigma}_{k}^{\prime}:=\mathrm{\partial}\sigma (y,\theta )/\mathrm{\partial}y\hspace{0.17em}{}_{y={Y}_{k}}$. Then, the transition density based on the Milstein approximation for a onedimensional diffusion process is as follows:
For the GBM, we have σ(X_{t}, θ) = σX_{t} with parameter σ > 0, the process taking values in ${\mathbb{R}}_{+}$. Therefore, we obtain a lower bound for the possible values of Y_{k+1}
In figure 2, we illustrate the transition densities based on the GBM solution, Euler scheme and Milstein scheme for two different parameter settings. We observe that the Milstein transition density better approximates the mode of the transition density of the solution than the Euler transition density does. On the other hand, while the support of the Euler transition density is the set of all real numbers, the Milstein transition density puts zero weight on the values of Y_{k+1} that are below the lower bound, even though some of the values are feasible according to the transition density of the solution process.
Other approximation methods for the transition densities were developed for example in [16–18]. Here, we focus on the numerical approximation methods described above. Because for the estimation methods introduced in the next section, it is crucial to not only be able to approximate the transition density, but also sampling from the resulting density needs to be possible and fast.
3. Bayesian data augmentation for the parameter estimation of diffusions
With lowfrequency observations ${X}^{\mathrm{obs}}=({X}_{{\tau}_{0}},\dots ,{X}_{{\tau}_{M}})$ of the process (X_{t})_{t≥0} described by the SDE (2.1), we wish to estimate parameter θ. In this work, we assume that all observations are complete (i.e. there are no latent or unobserved components for all observations) and that there are no measurement errors. The approximation schemes for the solution of the SDE as introduced in §2 are only appropriate for small time steps. Therefore, we introduce additional data points X^{imp} at intermediate time points (as visualized in figure 3 and explained in detail in §3.2) and estimate the parameter θ from the augmented path $\{{X}^{\mathrm{obs}},{X}^{\mathrm{imp}}\}$. To this end, a twostep MCMC approach is used to construct the Markov chain ${\{{\theta}_{(i)},{X}_{(i)}^{\mathrm{imp}}\}}_{i=1,\dots ,L}$, the elements of which are samples from the joint posterior distribution π(θ, X^{imp}  X^{obs}):
Step (1) Parameter update: Draw ${\theta}_{(i)}\sim \pi ({\theta}_{(i)}\hspace{0.17em}\hspace{0.17em}{X}^{\mathrm{obs}},{X}_{(i1)}^{\mathrm{imp}})$,
Step (2) Path update: Draw ${X}_{(i)}^{\mathrm{imp}}\sim \pi ({X}_{(i)}^{\mathrm{imp}}\hspace{0.17em}\hspace{0.17em}{X}^{\mathrm{obs}},{\theta}_{(i)})$.
A general introduction to MCMC methods is presented in [19]. The resulting MCMC chain ${\{{\theta}_{(i)},{X}_{(i)}^{\mathrm{imp}}\}}_{i=l+1,\dots \hspace{0.17em},\hspace{0.17em}L}$, after discarding the first l elements as burnin, can be considered a sample drawn from the joint posterior distribution π(θ, X^{imp}  X^{obs}) and can be used for a fully Bayesian analysis. The two steps of the algorithm are described in detail in the following two subsections. We use π to denote the exact densities of the process that is the (full conditional) posterior densities as well as the transition densities. The meaning becomes clear from the arguments. Approximated densities are indicated by a corresponding superscript.
3.1. Parameter update
In Step (1), a parameter proposal θ* is drawn from a proposal density q(θ*  θ, X^{obs}, X^{imp}) which may or may not depend on the imputed and observed data. If a proposal θ* = θ + u with an update u that is independent of the current parameter value θ is used, the proposal strategy is called a random walk proposal. Proposal θ* is accepted with the following probability:
Due to Bayes’ theorem and the fact that a diffusion process has the Markov property, the (full conditional) posterior density can be represented as
The transition density $\pi ({X}_{{t}_{k+1}}\hspace{0.17em}\hspace{0.17em}{X}_{{t}_{k}},\theta )$ is generally not explicitly known, but it can be approximated by the Euler or Milstein scheme as described in §2.
3.2. Path update
Since a diffusion process has the Markov property, the likelihood function of parameter θ factorizes as
After initializing the imputed data by linear interpolation, the path is updated using the Metropolis–Hastings algorithm. A proposal ${X}_{({\tau}_{i},{\tau}_{i+1})}^{\mathrm{imp}\ast}$ is drawn from a distribution with density q, which may depend on the observed data, current imputed data and parameter θ. It is accepted with the following probability:
The challenging aspect of the path update step involves determining how to propose new points. The simplest approach uses the (approximated) transition density to propose a new point by conditioning only on the point to the left of the new point. We call this proposal method the leftconditioned proposal and illustrate it in figure 4a. The proposal density of an entire path segment is simply the product
This proposal strategy considers the information from the observation ${X}_{{\tau}_{i}}$ on the left, while the proposed path segment is independent of the observation ${X}_{{\tau}_{i+1}}$ on the right. This may lead to a large jump in the last step from ${X}_{{t}_{m1}}$ to ${X}_{{\tau}_{i+1}}$, as can be seen in figure 4c,d, and hence, to an improbable transition. Therefore, the acceptance probability for the leftconditioned proposal ${X}_{({\tau}_{i},{\tau}_{i+1})}^{\mathrm{imp}\ast}$, and consequently, the acceptance rate of the MCMC sampler is usually low.
A number of more sophisticated proposal strategies have been suggested. Ch. 7.1 in [2] reviews some of these. Here, we first consider the modified bridge (MB) proposal, which conditions on both the previous data point and the following observation on the right, as visualized in figure 4b. This strategy was originally proposed by Durham & Gallant [20] and first applied in the Bayesian framework in [21]. More recently, Whitaker et al. [10] suggested improved bridge constructs, and van der Meulen & Schauer [22] proposed the socalled guided proposals.
For the MB proposal, the proposal density of an entire path segment factorizes again as follows:
In [20], it is suggested to approximate the two transition densities on the righthand side by the Euler scheme and to further approximate $\mu ({X}_{{t}_{k+1}}^{\ast},\theta )$ and $\sigma ({X}_{{t}_{k+1}}^{\ast},\theta )$ by $\mu ({X}_{{t}_{k}}^{\ast},\theta )$ and $\sigma ({X}_{{t}_{k}}^{\ast},\theta )$, respectively. This way, they obtain that (3.4) is approximately proportional to a Gaussian density which we will use for the MB proposal based on the Euler scheme
We now consider the Milstein approximation for the two factors on the righthand side of (3.4). The first factor resembles the Milstein transition density stated in §2.2. With the same notation, Δ_{+} = t_{m} − t_{k+1}, and t_{m} = τ_{i+1}, the second factor is as follows:
For the GBM, we have X_{t} > 0 and ${\sigma}_{k+1}^{\ast}=\sigma {X}_{{t}_{k+1}}^{\ast}>0$ and thus, obtain the following bounds for ${\pi}^{\mathrm{Mil}}({X}_{{t}_{m}}{X}_{{t}_{k+1}}^{\ast},\theta )$, the second factor in (3.4):
Since the MB proposal takes into account information not only from the left data point but also from the observation on the right, it does not have a large jump in the last step as the leftconditioned proposal does. This is also apparent in the simulations in figure 4e,f. Therefore, the acceptance probability and acceptance rate are usually higher for the MB proposal than for the leftconditioned proposal. As appendix B demonstrates, the acceptance probability is even equal to 1 for the MB proposal if only one data point is imputed between two observations (i.e. the number of interobservation intervals is m = 2). This holds when using the Milstein scheme to approximate the transition density for the likelihood function and proposal density, but also when using the Euler scheme without the approximation of μ_{k+1} and σ_{k+1} by μ_{k} and σ_{k}, respectively.
The density of the MB proposal based on the Euler scheme in equation (3.5) can also be interpreted as the density that results from applying the Euler scheme to the following diffusion process:
Thus far, our path update has only been applied to imputed points between two observations. It can easily be extended to a case with several observations along the path by simply decomposing the path into independent path proposals, multiplying the respective acceptance probabilities and collectively accepting or rejecting the proposals. Moreover, the entire path does not have to be updated all at once, but can be divided into several path segments that are successively updated. Different algorithms for choosing the update interval are summarized in [2] and appendix C describes one of them.
Another challenge in the context of Bayesian data augmentation and the MCMC scheme discussed above is the dependence between the parameter components included in the diffusion function and the missing path segments between two observations. Roberts & Stramer [5] were the first to highlight that, in the discretized setting (as we consider it here), the dependence leads to a slower convergence of the MCMC algorithm as the number of imputed points m increases. All estimation methods compared here are affected by this issue in the same way; we hence do not further consider it here.
We have introduced a number of possible options for the choices to be made when constructing an estimation method in the framework as described so far:
—  approximate the transition densities in the likelihood function based on the Euler or Milstein scheme,  
—  use the leftconditioned, the MB or the DBM proposal, and  
—  use the Euler or Milstein scheme for the proposal densities (for the leftconditioned or MB proposal). 
(MBEE) MB proposal and transition density both based on the Euler scheme,
(MBEM) MB proposal based on the Euler scheme and transition density based on the Milstein scheme,
(MBMM) MB proposal and transition density both based on the Milstein scheme, and
(DBMM) DBM proposal (which is based on the Milstein scheme) and transition density based on the Milstein scheme.
To our knowledge, we are the first to use the Milstein scheme in the MCMC context described above.
4. Implementation
The implementation is relatively straightforward for the majority of the estimation procedures, and only the combination of the MB proposal and the Milstein approximation requires additional explanation. As mentioned, when approximating the two factors on the righthand side of (3.4) by the transition density based on the Milstein scheme, there is no closed formula for the normalization constant to obtain a proper density. The normalization is necessary because the proposal density for a path segment is the product of several of the terms from (3.4), where the condition on the left point, ${X}_{{t}_{k}}^{\ast}$, differs between a newly proposed segment and the last accepted segment if several consecutive points are imputed. Therefore, the normalization constants differ and do not cancel out in the acceptance probability. Normalization is not necessary only in the case where just one point is imputed between two observations (i.e. m = 2 subintervals) because the left point, ${X}_{{t}_{k}}$, is always a (fixed) observed point that is not updated. Thus, the normalization constants cancel out in the acceptance probability. For m > 2, we numerically integrate the product (3.4) over ${X}_{{t}_{k+1}}$ to obtain the normalization constant. The product in (3.4) may be very small (but not zero everywhere in a nonempty feasible set $\mathcal{D}$) and may thus numerically integrate to zero, especially when the upper interval bound of the feasible set is infinite. To overcome this problem, we take two measures. First, we do not integrate over the entire set of feasible points but determine the maximum of the product numerically and then integrate over the interval that includes all points with a function value of at least 10^{−20} times this maximum. Second, we rescale the product in (3.4) by dividing by the maximum before integrating.
To sample from the Milstein MB proposal density, we employ rejection sampling. For this, normalization of the product in (3.4) is not necessary. Again, we numerically determine the maximum d_{max} of the product, and the interval $\mathcal{I}$ that includes all points with a function value of at least 10^{−20} times this maximum. Then, we uniformly sample (u_{1}, u_{2}) from rectangle $\mathcal{I}\times (0,{d}_{\mathrm{max}})$ and accept u_{1} as a proposal ${X}_{{t}_{k+1}}^{\ast}$ if the unnormalized density value of (3.4) at u_{1} is at most u_{2}.
For the combination of the MB proposal and the Milstein approximation, the set of feasible proposal points may be empty. In this case, our implementation shifts to the Euler approximation for this point, i.e. the point is proposed with the MB proposal based on the Euler scheme and also the corresponding factor of the proposal density in the acceptance probability is based on the Euler scheme. In addition, for all methods, a negative point may be proposed, which is not feasible for a GBM. Therefore, in this case, we propose a new point. For both cases, we count the number of times that they occur during the estimation procedure. In the following simulation study, no cases of switching to the Euler scheme occurred and negative proposals occurred only very rarely (less than 1‰ of the number of iterations in the very worst case).
We implemented the described estimation procedures in R v. 3.6.2 [23]. The source code of our implementation and the following simulation study is publicly available at https://github.com/fuchslab/Inference_for_SDEs_with_the_Milstein_scheme.
5. Simulation study
In this section, we study the computational performance of the competing inference methods on the (relatively simple) benchmark model GBM. As a second (on the application side more often studied) benchmark model, the CIR process is investigated in appendix D. In this work, we focus on Bayesian inference by data augmentation and compare the four approaches listed at the end of §3.2. Conceptually different inference procedures, as summarized e.g. in [2], are not considered as competitors here as they would be employed in different data contexts. There are two aspects that are important to consider when we want to evaluate the different methods:
(a)  the accuracy with which the true posterior distribution is approximated based on one of the approximation schemes and a given number m and  
(b)  the accuracy with which we are able to draw from this approximated posterior distribution. 
For the simulation study, we generated 100 paths of the GBM in the time interval [0, 1] using the solution (2.3) with the parameter combination θ = (α, σ^{2})^{T} = (1, 2)^{T} and initial value x_{0} = 100. Figure 5 illustrates some of these paths. From each path, we took M = 20 equidistant points (i.e. the interobservation time Δt was 0.05) and applied each of the four described estimation methods once. We imputed data such that we got m = 2 and m = 5 interobservation intervals. We also included the case m = 1, i.e. no data were imputed and only Step (1) from §3, the parameter update, was repeated in the estimation procedure where the likelihood of the path in the acceptance probability is approximated by the Euler or the Milstein scheme. For the prior distribution of the parameters, we assumed that they were independently distributed with $\alpha \sim \mathcal{N}(0,10)$ and σ^{2} ∼ IG(κ_{0} = 2, ν_{0} = 2), where IG denotes the inverse gamma distribution with shape parameter κ_{0} and scale parameter ν_{0}. The a priori expectations of the parameters are thus $\mathbb{E}(\alpha )=0$ and $\mathbb{E}({\sigma}^{2})=2$.
Each of the estimation procedures performs the following steps:
(1)  Draw initial values for the parameters α and σ^{2} from the prior distributions.  
(2)  Initialize Y^{imp} by linear interpolation.  
(3)  Repeat the following steps:

Figures 6 and 7 present the output from one estimation procedure on the example of the combination MBMM of the MB proposal and the Milstein approximation for the proposal density and the likelihood function. From each estimation procedure, we obtained an MCMC chain of dimension n(m − 1) + 2. For each chain, we used the two components for parameters α and σ^{2} and calculated the mean, the median, and the variance after cutting off a burnin phase of 5000 iterations. To justify our use of independent proposals for the parameter update, we show in appendix E that the parameters are not strongly correlated.
As a benchmark, we also sampled from the true parameter posterior distribution based on the solution of the GBM. We used the Stan software [24,25] which provides an efficient C++ implementation of Hamiltonian Monte Carlo (HMC) sampling with the NoUturn sampler to sample from the true parameter posterior distribution. For each posterior distribution corresponding to one of the 100 sample paths, we generated four HMC chains with 500 000 iterations each. The first half of the chains was discarded as warmup and the remaining draws were combined to give a sample of size 10^{6}. We calculated the multivariate effective sample size (ESS) as defined in [26] which provides the size of an independent and identically distributed sample equivalent to our samples in terms of variance and found that the ESS of the obtained samples from the true posterior distribution is well over 500 000. For each of these samples, we also calculated the mean, the median and the variance.
The estimation procedures and time measurements were performed on a cluster of machines with the following specifications: AMD Opteron™ Processor 6376 (1.40 GHz), 512GB DDR3RAM.
6. Results
Figures 8 and 9 and tables 1 and 2 summarize the results of running each of the methods once for 1 h for each of the 100 GBM trajectories. Figures 8 and 9 show the density plots of the difference between the respective statistic (mean, median or variance) calculated for a sample from the approximated posterior distribution obtained by the respective method and the statistic for a sample from the true posterior distribution of the same sample path. Each density plot aggregates 100 such difference values, one for each of the 100 GBM trajectories. Table 1 tabulates the root mean square error (RMSE) based on these differences for each of the considered methods, discretization levels m and statistics. We use the RMSE as the measure of the overall accuracy. The lower the RMSE is, the higher the accuracy of the respective method. Table 2 empirically evaluates the computational efficiency of the considered methods, including the number of iterations completed after 1 h, the multivariate ESS based on the obtained sample after discarding a burnin phase of 5000 iterations, and the acceptance rates of the parameter and the path proposals. Each of these quantities is averaged over the 100 GBM trajectories and the coefficient of variation is also stated.
method  RMSEs for α  RMSEs for σ^{2}  

mean  median  variance  mean  median  variance  
m = 1  Euler  0.282  0.244  0.456  0.638  0.600  0.471 
Milstein  0.851  0.780  1.158  0.282  0.265  0.176  
m = 2  MBEE  0.266  0.238  0.526  0.211  0.198  0.141 
MBEM  0.311  0.302  0.476  0.109  0.106  0.057  
MBMM  0.315  0.305  0.470  0.112  0.107  0.057  
DBMM  0.318  0.308  0.485  0.101  0.099  0.044  
m = 5  MBEE  0.277  0.254  0.524  0.113  0.098  0.127 
MBEM  0.288  0.274  0.474  0.031  0.031  0.050  
MBMM  0.292  0.278  0.492  0.040  0.037  0.058  
DBMM  0.291  0.275  0.472  0.031  0.030  0.037 
method  number of iterations after 1 h  multivariate effective sample size  acceptance rate of the parameters  acceptance rate of the path  

mean  c.v.  mean  c.v.  mean  c.v.  mean  c.v.  
m = 1  Euler  25 134 301  0.03  1 273 744  0.16  0.518  0.02  —  — 
Milstein  454 863  0.03  146 362  0.41  0.425  0.14  —  —  
m = 2  MBEE  8 583 614  0.03  170 827  0.19  0.442  0.01  0.842  0.04 
MBEM  1 816 144  0.03  24090  0.38  0.417  0.03  0.799  0.05  
MBMM  300 870  0.03  6881  0.21  0.417  0.03  1.000  0.00  
DBMM  754 024  0.10  28 089  0.31  0.417  0.03  0.839  0.04  
m = 5  MBEE  6 765 054  0.10  49 885  0.18  0.310  0.01  0.892  0.02 
MBEM  892 487  0.02  5033  0.24  0.304  0.01  0.844  0.03  
MBMM  78 215  0.04  573  0.20  0.304  0.01  0.978  0.01  
DBMM  879 227  0.03  5535  0.21  0.304  0.01  0.884  0.02 
For the drift parameter α of the GBM, the four considered schemes perform comparably for m = 2 and m = 5. In particular, the use of the Milstein approximation does not improve the accuracy of the posterior mean and median for the same discretization level m. The accuracy of the posterior variance is slightly improved by the use of the Milstein approximation when data are imputed. Moreover, for MBEE, the accuracy does not consistently improve as m is increased. Whereas, the accuracy for the methods including the Milstein scheme improves considerably when imputed data are introduced (i.e. m > 1) and it improves slightly when m is increased from 2 to 5.
For the diffusion parameter σ^{2} of the GBM, we clearly see an improvement in overall accuracy for the methods involving the Milstein scheme. Combination DBMM turns out to be the most accurate, closely followed by MBEM in case of the mean and median.
According to table 2, the number of iterations completed within 1 h varies substantially among the different estimation procedures. It is always higher for the procedures that use the Euler approximation, while especially combination MBMM is very timeconsuming and thus completes fewer iterations. Similarly, the multivariate ESS varies substantially among the different estimation procedures. It is higher for m = 2 than for m = 5 for each of the considered estimation procedures. The acceptance rate of the parameters is slightly lower when the Milstein scheme is used for the approximation of the likelihood function. In addition, the acceptance rate of the parameters decreases as the number of imputed points increases. The acceptance rate of the path is highest for combination MBMM. For MBEE, it would be just as high if one would not substitute μ_{k+1} and σ_{k+1} by μ_{k} and σ_{k}. For MBEE, MBEM and DBMM, the acceptance rate of the path increases as the number of imputed points increases.
7. Summary and discussion
We have demonstrated how to implement an algorithm for the parameter estimation of SDEs from lowfrequency data using the Milstein scheme to approximate the transition density of the underlying process. Our motivation was to improve numerical accuracy and thus reduce the amount of imputed data and computational overhead. However, our findings are rather discouraging: we found that this method can be applied to multidimensional processes only with impractical restrictions. Moreover, we showed that the combination of the MB proposal with the Milstein scheme for the proposal density may lead to an empty set of possible proposal points, which would require switching to the Euler scheme in order to proceed. One of the strengths of the original (Eulerbased) MCMC scheme is its generic character and applicability. Through this, it possesses a practical advantage over otherwise more sophisticated methods such as the Exact Algorithm [27]. This strength does not translate to the Milsteinbased MCMC scheme due to the limited applicability of the Milstein approximation especially in the multidimensional setting. Thus, methods like the Exact Algorithm may be a reasonable alternative. The limited applicability of the Milstein approximation would also persist for advanced forms of the discussed MCMC scheme like the innovation scheme in [6] or for even more generic algorithms like particle MCMC as studied in [28].
In our simulation study, we found that the overall accuracy for the estimates for the drift parameter of the GBM does not necessarily improve when the Milstein scheme is used. Fewer iterations are completed for the methods involving the Milstein scheme and also the ESS is substantially lower. Thus, the poor sampling efficiency might outweigh the (potential) increase in accuracy of the approximation of the posterior distribution. Especially, the combination MBMM results in a particularly low number of iterations and a low ESS. Owing to the already quite low ESS achieved by the Milsteinbased methods for m = 5 subintervals between two observations, we did not consider higher discretization levels. Moreover, note that tuning the variance hyperparameters for the random walk proposals of the parameters in Step (3) in the simulation study to reach an optimal acceptance rate might lead to a higher ESS. However, since the acceptance rates achieved in the simulation study lie in a range where the sampling efficiency is rather robust to changes in the acceptance rate as shown in [29] (in the highdimensional limit), we do not expect the change in the ESS after tuning to be substantial.
For the estimates for the GBM diffusion parameter, the overall accuracy is increased by the use of the Milstein scheme. DBMM turns out to be the most effective combination in terms of overall accuracy.
We conducted another simulation study on the example of the CIR process, as shown in appendix D, and the results are very similar as for the GBM. The use of the Milstein approximation does not consistently improve the overall accuracy for the drift parameter; however, it does improve the accuracy for the diffusion parameter. Again combination DBMM achieves the highest accuracy, closely followed by MBEM.
It was expected that the use of the Milstein scheme would make a difference for the estimates for the diffusion parameters because the additional term added by the Milstein scheme compared to the Euler scheme involves the diffusion function and its derivative. Nevertheless, the general applicability of the Euler scheme remains a great advantage and the search for different proposal schemes such as in [10,22] rather than for different numerical discretization schemes may be a more promising way towards more efficient estimation algorithms for diffusion processes.
Data accessibility
The source code of our implementation and the simulation study is publicly available at https://github.com/fuchslab/Inference_for_SDEs_with_the_Milstein_scheme.
Authors' contributions
C.F. devised the project and provided supervision. S.P. implemented the described algorithms, carried out the simulation study and drafted the manuscript. Both authors contributed to the final version of the manuscript, gave final approval for publication and agree to be held accountable for the work performed therein.
Competing interests
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Funding
Our research was supported by the German Research Foundation within the SFB 1243, Subproject A17, by the Federal Ministry of Education and Research under grant no. 01DH17024, and by the Helmholtz pilot project ‘Uncertainty Quantification’.
Acknowledgements
The authors wish to thank three anonymous reviewers for very valuable suggestions that helped to significantly improve this article.
Appendix A. Derivation of the transition density based on the Milstein scheme
The Milstein scheme
Appendix B. Derivation of the acceptance probability for the MB proposal for m = 2 interobservation intervals
As stated in §3.2, the acceptance probability for the path update between two consecutive observations ${X}_{{\tau}_{i}}$ and ${X}_{{\tau}_{i+1}}$ with the MB proposal is
Appendix C. Choice of path update interval
For choosing the update interval, we use the random block size algorithm as suggested in [3]. Assuming that the augmented path contains a total of n + 1 data points Y_{0}, …, Y_{n}, it is divided into update segments ${Y}_{({c}_{0},{c}_{1})},{Y}_{({c}_{1},{c}_{2})},\dots $ by the following algorithm:
(1)  Set c_{0} = 0 and j = 1.  
(2)  While c_{j−1} < n:

Such a random choice of the path update interval is a simple way to vary the set of points that are updated together within one iteration.
Appendix D. Additional example: Cox–Ingersoll–Ross process
The onedimensional CIR process fulfils the SDE
For the CIR process, we have $\sigma ({X}_{t},\theta )=\sigma \sqrt{{X}_{t}}$ with parameter σ > 0, the process taking values in ${\mathbb{R}}_{0}$. We therefore obtain a lower bound for the possible values of ${X}_{{t}_{k+1}}$ when applying the Milstein scheme
For the simulation study, we generated 100 paths of the CIR process in the time interval [0, 1] with the parameter combination θ = (α, β, σ^{2})^{T} = (1, 1, 2)^{T} and initial value x_{0} = 10. From each path, we took 20 equidistant points and ran each of the described estimation methods once for 1 h to perform inference for the parameters β and σ^{2}, assuming α to be known. For the prior distribution of the parameters, we assumed that they were independently distributed with β ∼ IG(κ_{b} = 3, ν_{b} = 3) and σ^{2} ∼ IG(κ_{s} = 3, ν_{s} = 4). The a priori expectations of the parameters are thus $\mathbb{E}(\beta )=\frac{3}{2}$ and $\mathbb{E}({\sigma}^{2})=2$. For each estimation procedure, the steps as outlined in §5 were taken. As proposal densities for the parameters in Step (3), we used ${\beta}^{\ast}\sim \mathcal{L}\mathcal{N}(\mathrm{log}{\beta}_{i1},0.25)$ and ${\sigma}^{2\ast}\sim \mathcal{L}\mathcal{N}(\mathrm{log}{\sigma}_{i1}^{2},0.25)$.
The sampling results are summarized in figures 10 and 11 and tables 3 and 4. Similar to the results for the GBM, the use of the Milstein approximation does not consistently improve the overall accuracy for the drift parameter β. The accuracy increases for increasing m for most of the methods. Only combination MBMM has lower accuracy for m = 5 due to the low sampling efficiency and the resulting low ESS. For the diffusion parameter σ^{2}, the use of the Milstein approximation and increasing m both improve the overall accuracy. Again combination DBMM achieves the highest accuracy, closely followed by MBEM.
method  RMSEs for β  RMSEs for σ^{2}  

mean  median  variance  mean  median  variance  
m = 1  Euler  0.0179  0.0115  0.0478  0.1603  0.1530  0.0673 
Milstein  0.0174  0.0110  0.0587  0.1306  0.1233  0.0595  
m = 2  MBEE  0.0099  0.0064  0.0265  0.0910  0.0865  0.0417 
MBEM  0.0105  0.0063  0.0413  0.0656  0.0619  0.0309  
MBMM  0.0151  0.0120  0.0462  0.0658  0.0625  0.0325  
DBMM  0.0097  0.0061  0.0330  0.0653  0.0617  0.0308  
m = 5  MBEE  0.0052  0.0036  0.0144  0.0400  0.0380  0.0194 
MBEM  0.0077  0.0049  0.0375  0.0271  0.0259  0.0156  
MBMM  0.0307  0.0204  0.1103  0.0509  0.0420  0.0615  
DBMM  0.0085  0.0052  0.0321  0.0270  0.0256  0.0156 
method  number of iterations after 1 h  multivariate effective sample size  acceptance rate of the parameters  acceptance rate of the path  

mean  c.v.  mean  c.v.  mean  c.v.  mean  c.v.  
m = 1  Euler  23 461 023  0.11  2 422 521  0.14  0.443  0.03  —  — 
Milstein  4 685 450  0.03  480 549  0.08  0.442  0.03  —  —  
m = 2  MBEE  8 482 241  0.06  422 034  0.10  0.384  0.03  0.964  0.01 
MBEM  1 944 229  0.05  94 071  0.10  0.383  0.03  0.957  0.01  
MBMM  186 588  0.06  9429  0.13  0.383  0.03  1.000  0.00  
DBMM  1 905 354  0.04  95 262  0.10  0.383  0.03  0.968  0.01  
m = 5  MBEE  6 851 197  0.05  114 344  0.10  0.272  0.03  0.976  0.01 
MBEM  966 579  0.04  15 599  0.13  0.272  0.03  0.965  0.01  
MBMM  37 648  0.12  574  0.25  0.272  0.03  0.993  0.00  
DBMM  906 791  0.08  14 881  0.14  0.272  0.03  0.975  0.01 
Also for the CIR process, the number of iterations completed after 1 h and the multivariate ESS of the obtained sample vary substantially between the different procedures. Both quantities are highest for combination MBEE, they are similar for MBEM and DBMM, and particularly low for MBMM.
Footnotes
References
 1.
DacunhaCastelle D, FlorensZmirou D . 1986Estimation of the coefficients of a diffusion from discrete observations. Stochastics 19, 263284. (doi:10.1080/17442508608833428) Crossref, Google Scholar  2.
 3.
Elerian O, Chib S, Shephard N . 2001Likelihood inference for discretely observed nonlinear diffusions. Econometrica 69, 959993. (doi:10.1111/14680262.00226) Crossref, ISI, Google Scholar  4.
Eraker B . 2001MCMC analysis of diffusion models with application to finance. J. Bus. Econ. Stat. 19, 177191. (doi:10.1198/073500101316970403) Crossref, ISI, Google Scholar  5.
Roberts GO, Stramer O . 2001On inference for partially observed nonlinear diffusion models using the Metropolis–Hastings algorithm. Biometrika 88, 603621. (doi:10.1093/biomet/88.3.603) Crossref, ISI, Google Scholar  6.
Golightly A, Wilkinson DJ . 2008Bayesian inference for nonlinear multivariate diffusion models observed with error. Comput. Stat. Data Anal. 52, 16741693. (doi:10.1016/j.csda.2007.05.019) Crossref, ISI, Google Scholar  7.
Golightly A, Wilkinson DJ . 2006Bayesian sequential inference for stochastic kinetic biochemical network models. J. Comput. Biol. 13, 838851. (doi:10.1089/cmb.2006.13.838) Crossref, PubMed, ISI, Google Scholar  8.
Elerian O . 1998A note on the existence of a closed form conditional transition density for the Milstein scheme. Working paper. Nuffield College, University of Oxford, Oxford, UK. Google Scholar  9.
Tse YK, Zhang X, Yu J . 2004Estimation of hyperbolic diffusion using the Markov chain Monte Carlo method. Quant. Finance 4, 158169. (doi:10.1080/14697680400000020) Crossref, ISI, Google Scholar  10.
Whitaker GA, Golightly A, Boys RJ, Sherlock C . 2017Improved bridge constructs for stochastic differential equations. Stat. Comput. 27, 885900. (doi:10.1007/s1122201696603) Crossref, ISI, Google Scholar  11.
Mrázek M, Pospíšil J . 2017Calibration and simulation of Heston model. Open Math. 15, 679704. (doi:10.1515/math20170058) Crossref, ISI, Google Scholar  12.
Øksendal BK . 2003Stochastic differential equations: an introduction with applications, 6th edn. Berlin, Germany: Springer. Crossref, Google Scholar  13.
Iacus S . 2008Simulation and inference for stochastic differential equations. New York, NY: Springer. Crossref, Google Scholar  14.
Kloeden PE, Platen E . 1992Numerical solution of stochastic differential equations. Berlin, Germany: Springer. Crossref, Google Scholar  15.
Bayram M, Partal T, Buyukoz GO . 2018Numerical methods for simulation of stochastic differential equations. Adv. Differ. Equ. 2018, 17. (doi:10.1186/s1366201814665) Crossref, ISI, Google Scholar  16.
AïtSahalia Y . 2002Maximum likelihood estimation of discretely sampled diffusions: a closedform approximation approach. Econometrica 70, 223262. (doi:10.1111/14680262.00274) Crossref, ISI, Google Scholar  17.
AïtSahalia Y . 2008Closedform likelihood expansions for multivariate diffusions. Ann. Statist. 36, 906937. (doi:10.1214/009053607000000622) Crossref, ISI, Google Scholar  18.
Filipović D, Mayerhofer E, Schneider P . 2013Density approximations for multivariate affine jumpdiffusion processes. J. Econ. 176, 93111. (doi:10.1016/j.jeconom.2012.12.003) Crossref, ISI, Google Scholar  19.
Gilks W, Richardson S, Spiegelhalter D . 1996Markov chain Monte Carlo in practice. London, UK: Chapman & Hall. Google Scholar  20.
Durham GB, Gallant AR . 2002Numerical techniques for maximum likelihood estimation of continuoustime diffusion processes. J. Bus. Econ. Stat. 20, 297316. (doi:10.1198/073500102288618397) Crossref, ISI, Google Scholar  21.
Chib S, Shephard N . 2002[Numerical techniques for maximum likelihood estimation of continuoustime diffusion processes]: Comment. J. Bus. Econ. Stat. 20, 325327. Google Scholar  22.
van der Meulen F, Schauer M . 2017Bayesian estimation of discretely observed multidimensional diffusion processes using guided proposals. Electron. J. Stat. 11, 23582396. (doi:10.1214/17EJS1290) Crossref, ISI, Google Scholar  23. R Core Team. 2019R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.Rproject.org/. Google Scholar
 24.
Carpenter B et al. 2017Stan: a probabilistic programming language. J. Stat. Softw., Articles 76, 132. Crossref, ISI, Google Scholar  25. Stan Development Team. RStan: the R interface to Stan; 2019. R package version 2.19.1. http://mcstan.org/. Google Scholar
 26.
Vats D, Flegal JM, Jones GL . 2019Multivariate output analysis for Markov chain Monte Carlo. Biometrika 106, 321337. (doi:10.1093/biomet/asz002) Crossref, ISI, Google Scholar  27.
Beskos A, Papaspiliopoulos O, Roberts GO . 2008A factorisation of diffusion measure and finite sample path constructions. Methodol. Comput. Appl. Probab. 10, 85104. (doi:10.1007/s1100900790604) Crossref, ISI, Google Scholar  28.
Golightly A, Wilkinson DJ . 2011Bayesian parameter inference for stochastic biochemical network models using particle Markov chain monte carlo. Interface Focus 1, 807820. (doi:10.1098/rsfs.2011.0047) Link, ISI, Google Scholar  29.
Roberts GO, Rosenthal JS . 2001Optimal scaling for various Metropolis–Hastings algorithms. Stat. Sci. 16, 351367. (doi:10.1214/ss/1015346320) Crossref, ISI, Google Scholar  30.
Schmidt KD . 2009Maßund Wahrscheinlichkeit [Measure and Probability]. Berlin, Germany: Springer. Google Scholar  31.
Gillespie DT . 1992Markov processes: an introduction for physical scientists. Boston, MA: Academic Press. Google Scholar