Open AccessResearch articles

# Bayesian inference for diffusion processes: using higher-order approximations for transition densities

Susanne Pieschner

Susanne Pieschner

Institute of Computational Biology, Helmholtz Zentrum München, German Research Center for Environmental Health, Ingolstädter Landstr. 1, 85764 Neuherberg, Germany

Department of Mathematics, Technische Universität München, Boltzmannstrasse 3, 85748 Garching, Germany

Find this author on PubMed

and
Christiane Fuchs

Christiane Fuchs

Institute of Computational Biology, Helmholtz Zentrum München, German Research Center for Environmental Health, Ingolstädter Landstr. 1, 85764 Neuherberg, Germany

Department of Mathematics, Technische Universität München, Boltzmannstrasse 3, 85748 Garching, Germany

Data Science Group, Faculty of Business Administration and Economics, Universität Bielefeld, Postfach 100131, 33501 Bielefeld, Germany

[email protected]

Find this author on PubMed

Published:https://doi.org/10.1098/rsos.200270

## Abstract

Modelling random dynamical systems in continuous time, diffusion processes are a powerful tool in many areas of science. Model parameters can be estimated from time-discretely observed processes using Markov chain Monte Carlo (MCMC) methods that introduce auxiliary data. These methods typically approximate the transition densities of the process numerically, both for calculating the posterior densities and proposing auxiliary data. Here, the Euler–Maruyama scheme is the standard approximation technique. However, the MCMC method is computationally expensive. Using higher-order approximations may accelerate it, but the specific implementation and benefit remain unclear. Hence, we investigate the utilization and usefulness of higher-order approximations in the example of the Milstein scheme. Our study demonstrates that the MCMC methods based on the Milstein approximation yield good estimation results. However, they are computationally more expensive and can be applied to multidimensional processes only with impractical restrictions. Moreover, the combination of the Milstein approximation and the well-known modified bridge proposal introduces additional numerical challenges.

### 1. Introduction

Diffusion processes are used in many areas of science as a powerful tool to model continuous-time dynamical systems that are subject to random fluctuations. A diffusion process can be equivalently described by a stochastic differential equation (SDE).

If the SDE yields an analytical solution, the transition densities of the corresponding diffusion process are explicitly known and parameter estimation can be easily performed through a maximum likelihood approach, as demonstrated in [1]. However, in the majority of applications, this is not the case, and the transition densities are intractable.

When the transition densities are unknown, another challenge for parameter estimation is the type of available data. In practice, a process can only be observed at discrete points in time. A comprehensive overview of the methods for parameter inference from high-frequency data (i.e. where inter-observation times are small) can be found in [2, ch. 6]. For parameter estimation from low-frequency observations, Markov chain Monte Carlo (MCMC) techniques have been developed that introduce imputed data points to reduce the time steps between data points. This concept of Bayesian data imputation for the inference of diffusions has been used and developed further by many authors such as [36]. These methods are applicable to multidimensional processes and were extended for the case of latent process components as well as for the occurrence of measurement error. Thus, they are very promising for the use in real-data applications (e.g. [2,7]).

The concept of these MCMC algorithms is to construct a Markov chain whose elements are samples from the joint posterior density of the parameter and the imputed data points conditioned on the observations. This construction is achieved via a Gibbs sampling approach by alternately executing the following two steps:

 (1) drawing the parameter conditional on the augmented path that consists of the observed data points and imputed data points and (2) drawing the imputed data points conditional on the current parameter and the observed data points.
In both steps, direct sampling from the corresponding conditional distribution is generally not possible; therefore, a Metropolis–Hastings algorithm is applied. The (full conditional) posterior densities are reformulated as the product of the transition densities of the process in both steps and the prior density of the parameter in the first step. Because the transition densities are intractable, they can only be numerically approximated.

The numerical approximation of the transition densities of the process is necessary not only for calculating the posterior densities, but also for proposing the imputed data points. In both contexts, the Euler–Maruyama scheme is the standard approximation technique in the literature, including all of the aforementioned references. To reduce the amount of imputed data and the number of necessary iterations for the computationally expensive estimation method, one possible solution is to employ higher-order approximation schemes.

Therefore, we investigate the utilization and usefulness of such higher-order approximations on the example of the Milstein scheme. A closed form of the transition density based on the Milstein scheme is derived in [8]. In [9], this closed form is used to estimate the parameters of a hyperbolic diffusion process from high-frequency financial data, but not in the context of Bayesian data augmentation. For the latter, Elerian et al. [3] propose the possible use of the Milstein scheme. However, the specific implementation and benefit of this framework, in particular when using sophisticated proposal methods, remain unclear and, therefore, are the focus of this work. For our investigation, we first explain how to integrate the Milstein scheme into the framework of Bayesian data augmentation and then assess the effectiveness of this new combination in a simulation study which is a common approach in the literature (e.g. [10,11]).

This article is organized as follows. In §2, we define diffusion processes, describe the numerical approximation of their paths, and explain the derivation of the transition densities of the processes based on these approximations. In §3, we elaborate on the parameter estimation methods for diffusion processes using Bayesian data augmentation and the approximated transition densities. In §4, we give some comments about our implementation of these methods and in §5, we explain the set-up of our simulation study. In §§6 and 7, we present the results and discussion. The source code of our implementation and the simulation study is publicly available at https://github.com/fuchslab/Inference_for_SDEs_with_the_Milstein_scheme.

### 2. Approximation of the transition density of a diffusion process

We consider a d-dimensional time-homogeneous Itô diffusion process, (Xt)t≥0, a stochastic process that fulfils the following SDE:

$dXt=μ(Xt,θ) dt+σ(Xt,θ) dBt,X0=x0,$2.1
with state space $X⊆Rd$, starting value $x0∈X$, and a q-dimensional Brownian motion, (Bt)t≥0. The model parameter θΘ is from an open set $Θ⊆Rp$. In addition, we assume that the drift function $μ : X×Θ→Rd$ and diffusion function $σ : X×Θ→Rd×q$ fulfil the Lipschitz condition and growth bound to ensure that (2.1) has a unique solution (e.g. [12, ch. 5]).

In this work, we use rather simple, well-known examples of such a diffusion process in order to focus on the investigated estimation methods and make the article easy to follow. Our example for the main text is the geometric Brownian motion (GBM), and in appendix D, we also provide all relevant details for the Cox–Ingersoll–Ross (CIR) process. The GBM is described by the following SDE:

$dXt=αXt dt+σXt dBt,X0=x0,$2.2
with state space $X=R+$, starting value $x0∈X$ and the two-dimensional parameter θ = (α, σ)T, where $α∈R$ and $σ∈R+$, $R+$ being the set of all strictly positive real numbers. The GBM is especially suitable as a benchmark model because it has an explicit solution. The stochastic process
$Xt=x0exp((α−12σ2)t+σBt),$
fulfils (2.2) for all t ≥ 0. Hence, the multiplicative increments of the GBM are lognormally distributed as follows:
$XtXs∼LN((α−12σ2)(t−s),σ2(t−s)),$
for ts ≥ 0, and the transition density is explicitly known as
$p(s,x,t,y)=P(Xt=y | Xs=x)=12π(t−s)σyexp(−(log⁡y−log⁡x−(α−12σ2)(t−s))22σ2(t−s)).$2.3
A derivation of the solution of the GBM and its transition density can be found in [13].

In different contexts, one often considers the logarithm of the GBM, log Xt, which is simply a normally distributed random variable for fixed t, with corresponding SDE

$d(log⁡Xt)=(α−12σ2)dt+σ dBt,log⁡X0=log⁡x0.$2.4
However, we do not employ this transformation here because of the constant diffusion function in (2.4). For the log-transformed GBM, the approximation methods that we wish to compare would yield an identical approximation.

#### 2.1. Approximation of the solution of an SDE

Unlike the GBM, most SDEs do not have an analytical solution; thus, their transition densities are not explicitly known. Instead, numerical approximation schemes are used for the solution of the SDEs. Kloeden & Platen [14] have provided a detailed description of these methods. Several of the approximation schemes are based on the stochastic Taylor expansion. For a general treatment of this expansion, we refer the interested reader to [14]. The most commonly used approximation is the Euler(–Maruyama) scheme, which approximates the d-dimensional solution (Xt)t≥0 of an SDE by setting Y0 = x0 and, then, successively calculating the following:

$Yk+1=Yk+μ(Yk,θ)Δtk+σ(Yk,θ)ΔBk,$2.5
where Δtk = tk+1tk, $ΔBk=Btk+1−Btk$ and Yk is the approximation of $Xtk$ for $k=0,1,2,…$. The approximation improves as the time step Δtk decreases. The Euler scheme contains only the time component and the stochastic integral of multiplicity one from the stochastic Taylor expansion of process (Xt)t≥0, and has strong order of convergence 0.5.

A discrete-time approximation YΔ with maximum step size Δ > 0 converges with strong order γ > 0 at time T to the solution XT of a given SDE if there exists a positive constant C independent of Δ and a Δ0 > 0 such that

$E(|XT−YTΔ|)≤CΔγ,$
for all Δ ∈ (0, Δ0). Strong convergence ensures a pathwise approximation of the solution process (Xt)t≥0 of the given SDE. The higher the order of strong convergence is, the faster the mean absolute error between the approximation and the solution decreases as the maximum time step size Δ decreases.

By adding another term of the stochastic Taylor expansion to equation (2.5), one obtains the Milstein scheme that approximates the d-dimensional process (Xt)t≥0 by setting Y0 = x0 and, then, successively calculating for the ith component

$Yk+1(i)=Yk(i)+μi(Yk,θ)Δtk+∑ j=1qσij(Yk,θ)ΔBk(j)+∑ j=1q∑l=1q∑r=1dσrj(Yk,θ)∂σil∂y(r)(Yk,θ)∫tktk+1∫tks dBu(j)dBs(l)$2.6
for k = 0, 1, … and i = 1, …, d.

When σ(Yk, θ) is constant in Yk, the last term vanishes and the Milstein scheme reduces to the Euler scheme. If μ(Yk, θ) is once continuously differentiable and σ(Yk, θ) is twice continuously differentiable regarding Yk, then, the Milstein scheme is strongly convergent of order 1.0, which is higher than that of the Euler scheme. An illustration of this difference in the simulation of SDE trajectories is presented e.g. in [15]. However, there is a severe restriction on the practical applicability of the Milstein scheme because the stochastic double integral in the last term of (2.6) only yields an analytical solution for j = l. Although approximation techniques for the double integral exist (e.g. [14]), they are unsuitable for our purposes. On the one hand, we wish to avoid adding yet another layer of approximation and, thus, additional computational time. On the other hand, we must find the distribution of Yk+1 based on approximation schemes (2.5) and (2.6), which is also not explicitly possible when adding another approximation. For this reason, we focus on models where the double integral appears exclusively for the same components of the Brownian motion. For example, this is the case when the process is driven by a one-dimensional Brownian motion (i.e. the diffusion function σ(Yk, θ) is of dimension d × 1). Hence, the diffusion model includes only one source of noise that may affect each of the components of the process. More generally, we require that

$σrj(Yk,θ)∂σil∂y(r)(Yk,θ)≡0for j≠l,$2.7
so that only j = l is inside the double integral. Relation (2.7) implies the following:
 — if an entry σrj(Yk, θ) is non-zero, then the entries of all other columns and all rows must not depend on $Yk(r)$, and — if an entry σil(Yk, θ) depends on $Yk(r)$, then the entries of all other columns in row r must be zero.
In particular, this means that unless the rth row of the diffusion function contains only zeros, component $Yk(r)$ can only appear in one column of the diffusion function (and if it appears, then the entries of all other columns in row r must be zero). Moreover, each component of the diffusion process (Xt)t≥0 can only be directly affected by more than one component of the Brownian motion, if the size of all stochastic effects (i.e. all entries of the diffusion function) does not depend on the respective component of the diffusion process. Further, if all d components of the diffusion process appear in the diffusion function, then the process can be affected by at most d components of the Brownian motion. Besides, if all d components of the diffusion process appear in the diffusion function and the process shall be affected by d components of the Brownian motion, the diffusion function must be a (possibly column-wise permuted) diagonal matrix. In many applications, these are not realistic assumptions.

Assume that the ith component of the diffusion process appears in the ith row of the diffusion function and that the respective entry of the diffusion function does not depend on the remaining components $Yk(r)$, ri (the contrary would impose restrictions on other rows, as described above). Then, the ith component of the approximated process is

$Yk+1(i)=Yk(i)+μi(Yk,θ)Δtk+σij(Yk,θ)ΔBk(j)+σij(Yk,θ)∂σij∂y(i)(Yk,θ)12((ΔBk(j))2−Δtk)$2.8
for k = 0, 1, … and where j is the column index of the one non-zero entry depending on $Yk(i)$ in the ith row of the diffusion function.

Moreover, note that if we consider the approximation $Yk+1(i)$ in equation (2.8) as a function $g(ΔBk(j))$ of the increment of the Brownian motion, g is quadratic in $ΔBk(j)$. Therefore, the function g has a global extremum with value

$g∗=Yk(i)−12σij(Yk,θ)/(∂σij∂y(i)(Yk,θ))+(μi(Yk,θ)−12σij(Yk,θ)∂σij∂y(i)(Yk,θ))Δtk.$2.9
Hence, there is a bound on the range of possible values for $Yk+1(i)$ resulting from the Milstein scheme which might exclude values that the solution process $Xtk$ could take. Whether this is a lower or upper bound depends on the sign of the diffusion function and its derivative. The second derivative of g is given by
$∂2g(ΔBk(j))∂(ΔBk(j))2=σij(Yk,θ)∂σij∂y(i)(Yk,θ)=:g″.$
Thus, the extremum g* is a maximum and puts an upper bound on the possible values of $Yk+1(i)$ if g″ < 0, and g* is a minimum and puts a lower bound on $Yk+1(i)$ if g″ > 0. For the case where g″ = 0, the Milstein scheme reduces to the Euler scheme.

Since our example, the GBM, is a one-dimensional process, the double integral in equation (2.6) vanishes and the Milstein scheme for the GBM yields the following:

$Yk+1=Yk+αYkΔtk+σYkΔBk+12σ2Yk((ΔBk)2−Δtk),$
for $k=0,1,…$, where the first three summands also correspond to the Euler scheme. Figure 1 illustrates the two approximation schemes. It presents three trajectories of the GBM, which are represented by red points and which were simulated by setting a seed for the random number generator and, then, sampling from the explicit transition density (2.3). The same seed was used to sample the increments of the Brownian motion from the normal density and then transform them by (2.5) and (2.6) to obtain the Euler (black) and the Milstein (blue) approximation of the trajectories. We observe that in almost all cases, the Milstein approximation is either closer to or as close to the points of the trajectories as the Euler approximation.

#### 2.2. Transition densities based on approximation schemes

While sampling diffusion paths is fairly straightforward for both approximation schemes as described above, determining the corresponding transition density is less apparent for the Milstein scheme. Since the Euler scheme is a linear transformation of $ΔBk∼N(0,ΔtkIq)$, where Iq denotes the m-dimensional identity matrix, the transition density derived from the Euler scheme is also a multivariate Gaussian density:

$πEuler(Yk+1|Yk)=ϕ(Yk+1|Yk+μ(Yk,θ) Δtk,σ(Yk,θ)σT(Yk,θ)Δtk),$
where ϕ(y|a, b) denotes the multivariate Gaussian density with mean $a∈Rd$ and covariance matrix $b∈Rd×d$ evaluated at y.

For the Milstein scheme, deriving the transition density is more complicated, even in the case of a one-dimensional diffusion process, which we consider here. Elerian [8] derived the transition density by first rearranging the Milstein scheme to obtain a transformation of a non-central chi-squared distributed variable for which the density is known, and then applying the random variable transformation theorem. In appendix A, we present an alternative derivation that directly applies the random variable transformation theorem to ΔBk. Both approaches produce the same result. For simplicity of notation, we set μk := μ(Yk, θ), σk : = σ(Yk, θ) and $σk′:=∂σ(y,θ)/∂y |y=Yk$. Then, the transition density based on the Milstein approximation for a one-dimensional diffusion process is as follows:

$πMil(Yk+1|Yk)=exp(−Ck(Yk+1)Dk)2πΔtkAk(Yk+1)⋅[exp(−Ak(Yk+1)Dk)+exp(Ak(Yk+1)Dk)]$
with
$Ak(Yk+1)=(σk)2+2σkσk′(Yk+1−Yk−(μk−12σkσk′)Δtk),Ck(Yk+1)=σk+σk′(Yk+1−Yk−(μk−12σkσk′)Δtk),Dk=σk(σk′)2Δtk$
and for
$Yk+1≥Yk−12σkσk′+(μk−12σkσk′)Δtk,if σkσk′>0$2.10
and
$Yk+1≤Yk−12σkσk′+(μk−12σkσk′)Δtk,if σkσk′<0.$2.11
The bounds in (2.10) and (2.11) coincide with the bound (2.9) on the range of possible values Yk+1 resulting from the Milstein scheme in §2.1. For values of Yk+1 within the respective bound, Ak(Yk+1) is non-negative and its square root takes real values; otherwise, the transition density is equal to zero. Hence, there is a lower or an upper bound on the support of πMil. Moreover, one can show that the value of the transition density tends to infinity as Yk+1 approaches the bound. However, the interval for which the density increases towards infinity may be arbitrarily narrow depending on the parameter setting.

For the GBM, we have σ(Xt, θ) = σXt with parameter σ > 0, the process taking values in $R+$. Therefore, we obtain a lower bound for the possible values of Yk+1

$Yk+1≥Yk(12+(α−12σ2)Δtk).$2.12
Depending on the parameter combination θ = (α, σ)T, this lower bound may be negative, in which case the support of the transition density includes the entire state space of the GBM.

In figure 2, we illustrate the transition densities based on the GBM solution, Euler scheme and Milstein scheme for two different parameter settings. We observe that the Milstein transition density better approximates the mode of the transition density of the solution than the Euler transition density does. On the other hand, while the support of the Euler transition density is the set of all real numbers, the Milstein transition density puts zero weight on the values of Yk+1 that are below the lower bound, even though some of the values are feasible according to the transition density of the solution process.

Other approximation methods for the transition densities were developed for example in [1618]. Here, we focus on the numerical approximation methods described above. Because for the estimation methods introduced in the next section, it is crucial to not only be able to approximate the transition density, but also sampling from the resulting density needs to be possible and fast.

### 3. Bayesian data augmentation for the parameter estimation of diffusions

With low-frequency observations $Xobs=(Xτ0,…,XτM)$ of the process (Xt)t≥0 described by the SDE (2.1), we wish to estimate parameter θ. In this work, we assume that all observations are complete (i.e. there are no latent or unobserved components for all observations) and that there are no measurement errors. The approximation schemes for the solution of the SDE as introduced in §2 are only appropriate for small time steps. Therefore, we introduce additional data points Ximp at intermediate time points (as visualized in figure 3 and explained in detail in §3.2) and estimate the parameter θ from the augmented path ${Xobs,Ximp}$. To this end, a two-step MCMC approach is used to construct the Markov chain ${θ(i),X(i)imp}i=1,…,L$, the elements of which are samples from the joint posterior distribution π(θ, Ximp | Xobs):

• Step (1) Parameter update: Draw $θ(i)∼π(θ(i) | Xobs,X(i−1)imp)$,

• Step (2) Path update: Draw $X(i)imp∼π(X(i)imp | Xobs,θ(i))$.

A general introduction to MCMC methods is presented in [19]. The resulting MCMC chain ${θ(i),X(i)imp}i=l+1,… , L$, after discarding the first l elements as burn-in, can be considered a sample drawn from the joint posterior distribution π(θ, Ximp | Xobs) and can be used for a fully Bayesian analysis. The two steps of the algorithm are described in detail in the following two subsections. We use π to denote the exact densities of the process that is the (full conditional) posterior densities as well as the transition densities. The meaning becomes clear from the arguments. Approximated densities are indicated by a corresponding superscript.

#### 3.1. Parameter update

In Step (1), a parameter proposal θ* is drawn from a proposal density q(θ* | θ, Xobs, Ximp) which may or may not depend on the imputed and observed data. If a proposal θ* = θ + u with an update u that is independent of the current parameter value θ is used, the proposal strategy is called a random walk proposal. Proposal θ* is accepted with the following probability:

$ζ(θ∗,θ)=1∧π(θ∗ | Xobs,Ximp) q(θ | θ∗,Xobs,Ximp)π(θ | Xobs,Ximp) q(θ∗ | θ,Xobs,Ximp).$
Otherwise, the previous θ value is kept.

Due to Bayes’ theorem and the fact that a diffusion process has the Markov property, the (full conditional) posterior density can be represented as

$π(θ | Xobs,Ximp)∝(∏k=0n−1π(Xtk+1 | Xtk,θ))p(θ),$
where $π(Xtk+1 | Xtk,θ)$ denotes the transition density of the process (Xt)t≥0, n + 1 is the total number of data points in the augmented path, and p denotes the prior density of the parameter. We choose a random walk proposal where the r components of θ* that take values on the entire real line $R$ are drawn from the normal distribution $N(θj,γj2)$ for $j=1,…,r$ and some predefined $γj∈R+$. The (remaining) strictly positive components are drawn from a lognormal distribution $LN(log⁡θj,γj2)$, for j = r + 1, …, p. In this case, the acceptance probability reduces to
$ζ(θ∗,θ)=1∧(∏k=0n−1π(Xtk+1 | Xtk,θ∗)π(Xtk+1 | Xtk,θ))p(θ∗)p(θ)(∏ j=r+1 pθj∗θj),$3.1
as derived in [2, ch. 7.1.3].

The transition density $π(Xtk+1 | Xtk,θ)$ is generally not explicitly known, but it can be approximated by the Euler or Milstein scheme as described in §2.

#### 3.2. Path update

Since a diffusion process has the Markov property, the likelihood function of parameter θ factorizes as

$π(Xτ0,…,XτM | θ)=π(Xτ0 | θ)∏i=1Mπ(Xτi | Xτi−1,θ),$3.2
and the latent path segments between observations are conditionally independent given the observations. Hence, it is sufficient to consider the imputation problem in Step (2) only for one path segment between two consecutive observations $Xτi$ and $Xτi+1$. As figure 3 illustrates, the time interval between the two observations is divided into m subintervals, such that the endpoints of these intervals are $τi=t0 and the time steps are Δtk = tk+1tk for $k=0,…,m−1$. We denote the observations by $X{τi,τi+1}obs={Xτi,Xτi+1}$ and the imputed data points by $X(τi,τi+1)imp={Xt1,…,Xtm−1}$.

After initializing the imputed data by linear interpolation, the path is updated using the Metropolis–Hastings algorithm. A proposal $X(τi,τi+1)imp∗$ is drawn from a distribution with density q, which may depend on the observed data, current imputed data and parameter θ. It is accepted with the following probability:

$ζ(X(τi,τi+1)imp∗,X(τi,τi+1)imp)=1∧π(X(τi,τi+1)imp∗ | X{τi,τi+1}obs,θ) q(X(τi,τi+1)imp | X(τi,τi+1)imp∗,X{τi,τi+1}obs,θ)π(X(τi,τi+1)imp | X{τi,τi+1}obs,θ) q(X(τi,τi+1)imp∗ | X(τi,τi+1)imp,X{τi,τi+1}obs,θ).$3.3
Otherwise, the proposal is discarded and the previously imputed data $X(τi,τi+1)imp$ is kept. Due to the Markov property, we have:
$π(X(τi,τi+1)imp∗ | X{τi,τi+1}obs,θ)π(X(τi,τi+1)imp | X{τi,τi+1}obs,θ)=∏k=0m−1π(Xtk+1∗ | Xtk∗,θ)π(Xtk+1 | Xtk,θ),$
where $Xt0∗=Xt0=Xτi$, $Xtm∗=Xtm=Xτi+1$ and $π(Xtk+1 | Xtk,θ)$ denotes the transition density of process (Xt)t≥0.

The challenging aspect of the path update step involves determining how to propose new points. The simplest approach uses the (approximated) transition density to propose a new point by conditioning only on the point to the left of the new point. We call this proposal method the left-conditioned proposal and illustrate it in figure 4a. The proposal density of an entire path segment is simply the product

$qLC(X(τi,τi+1)imp∗ | Xτi,,θ)=∏k=0m−2π(Xtk+1∗ | Xtk∗,θ),$
where $Xt0∗=Xτi$. Thus, the acceptance probability reduces to
$ζ(X(τi,τi+1)imp∗,X(τi,τi+1)imp)=1∧(∏k=0m−1π(Xtk+1∗ | Xtk∗,θ)π(Xtk+1 | Xtk,θ))(∏k=0m−2π(Xtk+1 | Xtk,θ)π(Xtk+1∗ | Xtk∗,θ))=1∧π(Xτi+1 | Xtm−1∗,θ)π(Xτi+1 | Xtm−1,θ),$
where $Xtm∗=Xtm=Xτi+1$. Here, the transition density can again be approximated by the Euler or Milstein scheme from §2.

This proposal strategy considers the information from the observation $Xτi$ on the left, while the proposed path segment is independent of the observation $Xτi+1$ on the right. This may lead to a large jump in the last step from $Xtm−1$ to $Xτi+1$, as can be seen in figure 4c,d, and hence, to an improbable transition. Therefore, the acceptance probability for the left-conditioned proposal $X(τi,τi+1)imp∗$, and consequently, the acceptance rate of the MCMC sampler is usually low.

A number of more sophisticated proposal strategies have been suggested. Ch. 7.1 in [2] reviews some of these. Here, we first consider the modified bridge (MB) proposal, which conditions on both the previous data point and the following observation on the right, as visualized in figure 4b. This strategy was originally proposed by Durham & Gallant [20] and first applied in the Bayesian framework in [21]. More recently, Whitaker et al. [10] suggested improved bridge constructs, and van der Meulen & Schauer [22] proposed the so-called guided proposals.

For the MB proposal, the proposal density of an entire path segment factorizes again as follows:

$qMB(X(τi,τi+1)imp∗ | Xτi,,Xτi+1,θ)=∏k=0m−2π(Xtk+1∗ | Xtk∗,Xτi+1,θ),$
where $Xt0∗=Xτi$. We apply Bayes’ theorem and the Markov property to rewrite the left- and right-conditioned proposal density of one point as
$π(Xtk+1∗ | Xtk∗,Xτi+1,θ)∝π(Xtk+1∗ | Xtk∗,θ) π(Xτi+1 | Xtk+1∗,θ),$3.4
for $k=0,…,m−2$.

In [20], it is suggested to approximate the two transition densities on the right-hand side by the Euler scheme and to further approximate $μ(Xtk+1∗,θ)$ and $σ(Xtk+1∗,θ)$ by $μ(Xtk∗,θ)$ and $σ(Xtk∗,θ)$, respectively. This way, they obtain that (3.4) is approximately proportional to a Gaussian density which we will use for the MB proposal based on the Euler scheme

$πEuler(Xtk+1∗ | Xtk∗,Xτi+1,θ)=ϕ(Xtk+1∗ | Xtk∗+(Xτi+1−Xtk∗τi+1−tk) Δtk, (τi+1−tk+1τi+1−tk)Σ(Xtk∗,θ)Δtk),$3.5
where $Σ(Xtk∗,θ)=σ2(Xtk∗,θ)$ and ϕ is defined in §2.2.

We now consider the Milstein approximation for the two factors on the right-hand side of (3.4). The first factor resembles the Milstein transition density stated in §2.2. With the same notation, Δ+ = tmtk+1, and tm = τi+1, the second factor is as follows:

$πMil(Xtm|Xtk+1∗,θ)=exp⁡(−Fm(Xtk+1∗)/Gm(Xtk+1∗))2πΔ+Em(Xtk+1∗)×[exp(−Em(Xtk+1∗)Gm(Xtk+1∗))+exp(Em(Xtk+1∗)Gm(Xtk+1∗))]$
with
$Em(Xtk+1∗)=(σk+1∗)2+2σk+1∗σ∗k+1′(Xtm−Xtk+1∗−(μk+1∗−12σk+1∗σ∗k+1′)Δ+),Fm(Xtk+1∗)=σk+1∗+σ∗k+1′(Xtm−Xtk+1∗−(μk+1∗−12σk+1∗σ∗k+1′)Δ+),Gm(Xtk+1∗)=σk+1∗(σ∗k+1′)2Δ+,$
for $Em(Xtk+1∗)≥0$ (which cannot be rearranged for $Xtk+1∗$ in general); otherwise, the density is equal to zero. The terms $μk+1∗$ and $σk+1∗$ are similar to μk+1 and σk+1, but $Xtk+1$ is replaced by $Xtk+1∗$. Here, we do not respectively approximate μk+1 and σk+1 by μk and σk because doing so does not lead to simplification. Moreover, there is no closed formula for the normalization constant needed to scale the product of the two transition densities to a proper density.

For the GBM, we have Xt > 0 and $σk+1∗=σXtk+1∗>0$ and thus, obtain the following bounds for $πMil(Xtm|Xtk+1∗,θ)$, the second factor in (3.4):

$Xtk+1∗≤Xtm12+(α−12σ2)Δ+=:u2nd,if 12+(α−12σ2)Δ+>0(Case I),Xtk+1∗≥Xtm12+(α−12σ2)Δ+=:l2nd,if 12+(α−12σ2)Δ+<0(Case II)andXtk+1∗≥0,if 12+(α−12σ2)Δ+=0(Case III).$
From (2.12), we obtain the following lower bound for $πMil(Xtk+1∗ | Xtk∗,θ)$, the first factor in (3.4):
$Xtk+1∗≥Xtk∗(12+(α−12σ2)Δtk)=:l1st.$
At the same time, proposals $Xtk+1∗$ for the GBM should always be strictly positive to be in the state space. Let l: = max{0, l1st}. The constraints on $Xtk+1∗$ derived from the two factors in (3.4) lead to three cases for the set $D$ of feasible points of $Xtk+1∗$ for the GBM (assuming $Xtm>0$)
$D={∅,if ( Case I) applies and l1st>u2nd,[l,u2nd],if ( Case I) applies and l1st≤u2nd,[l,∞),if ( Case II) or ( Case III) apply.$

Since the MB proposal takes into account information not only from the left data point but also from the observation on the right, it does not have a large jump in the last step as the left-conditioned proposal does. This is also apparent in the simulations in figure 4e,f. Therefore, the acceptance probability and acceptance rate are usually higher for the MB proposal than for the left-conditioned proposal. As appendix B demonstrates, the acceptance probability is even equal to 1 for the MB proposal if only one data point is imputed between two observations (i.e. the number of inter-observation intervals is m = 2). This holds when using the Milstein scheme to approximate the transition density for the likelihood function and proposal density, but also when using the Euler scheme without the approximation of μk+1 and σk+1 by μk and σk, respectively.

The density of the MB proposal based on the Euler scheme in equation (3.5) can also be interpreted as the density that results from applying the Euler scheme to the following diffusion process:

$dXt=(Xτi+1−Xtτi+1−t) dt+τi+1−tk+1τi+1−tσ(Xt,θ) dBt$
for t ∈ [tk, tk+1]. See [10] for a detailed discussion of the connection between the modified bridge and the continuous-time conditioned process. Applying the Milstein scheme to this process yields another proposal scheme to which we refer as the diffusion bridge Milstein (DBM) proposal. For the DBM proposal, the proposal density of a path segment also factorizes as:
$qDBM(X(τi,τi+1)imp∗ | Xτi,,Xτi+1,θ)=∏k=0m−2π(Xtk+1∗ | Xtk∗,Xτi+1,θ),$
where $Xt0∗=Xτi$, and each factor $π(Xtk+1∗ | Xtk∗,Xτi+1,θ)$ corresponds to the density based on the Milstein scheme from §2.2 where we replace μk by $(Xτi+1−Xtk)/(τi+1−tk)$, σk by $(τi+1−tk+1)/(τi+1−tk)σ(Xtk,θ)$, and σk′ by $(τi+1−tk+1)/(τi+1−tk)∂σ(y,θ)/∂y |y=Xtk$. Like the MB proposal, the DBM proposal takes into account information from the observation on the right, and, therefore, it does not have a large jump in the last step as illustrated in figure 4g.

Thus far, our path update has only been applied to imputed points between two observations. It can easily be extended to a case with several observations along the path by simply decomposing the path into independent path proposals, multiplying the respective acceptance probabilities and collectively accepting or rejecting the proposals. Moreover, the entire path does not have to be updated all at once, but can be divided into several path segments that are successively updated. Different algorithms for choosing the update interval are summarized in [2] and appendix C describes one of them.

Another challenge in the context of Bayesian data augmentation and the MCMC scheme discussed above is the dependence between the parameter components included in the diffusion function and the missing path segments between two observations. Roberts & Stramer [5] were the first to highlight that, in the discretized setting (as we consider it here), the dependence leads to a slower convergence of the MCMC algorithm as the number of imputed points m increases. All estimation methods compared here are affected by this issue in the same way; we hence do not further consider it here.

We have introduced a number of possible options for the choices to be made when constructing an estimation method in the framework as described so far:

 — approximate the transition densities in the likelihood function based on the Euler or Milstein scheme, — use the left-conditioned, the MB or the DBM proposal, and — use the Euler or Milstein scheme for the proposal densities (for the left-conditioned or MB proposal).
In the following, we will omit the left-conditioned proposal due to the inefficiency that we already pointed out. Instead, we will consider the following four combinations:
• (MBE-E) MB proposal and transition density both based on the Euler scheme,

• (MBE-M) MB proposal based on the Euler scheme and transition density based on the Milstein scheme,

• (MBM-M) MB proposal and transition density both based on the Milstein scheme, and

• (DBM-M) DBM proposal (which is based on the Milstein scheme) and transition density based on the Milstein scheme.

Combination MBE-M merges the Euler and Milstein scheme. We include it here because it combines the faster scheme for the proposals (where accuracy is less important) and the more accurate scheme for the acceptance probability.

To our knowledge, we are the first to use the Milstein scheme in the MCMC context described above.

### 4. Implementation

The implementation is relatively straightforward for the majority of the estimation procedures, and only the combination of the MB proposal and the Milstein approximation requires additional explanation. As mentioned, when approximating the two factors on the right-hand side of (3.4) by the transition density based on the Milstein scheme, there is no closed formula for the normalization constant to obtain a proper density. The normalization is necessary because the proposal density for a path segment is the product of several of the terms from (3.4), where the condition on the left point, $Xtk∗$, differs between a newly proposed segment and the last accepted segment if several consecutive points are imputed. Therefore, the normalization constants differ and do not cancel out in the acceptance probability. Normalization is not necessary only in the case where just one point is imputed between two observations (i.e. m = 2 subintervals) because the left point, $Xtk$, is always a (fixed) observed point that is not updated. Thus, the normalization constants cancel out in the acceptance probability. For m > 2, we numerically integrate the product (3.4) over $Xtk+1$ to obtain the normalization constant. The product in (3.4) may be very small (but not zero everywhere in a non-empty feasible set $D$) and may thus numerically integrate to zero, especially when the upper interval bound of the feasible set is infinite. To overcome this problem, we take two measures. First, we do not integrate over the entire set of feasible points but determine the maximum of the product numerically and then integrate over the interval that includes all points with a function value of at least 10−20 times this maximum. Second, we rescale the product in (3.4) by dividing by the maximum before integrating.

To sample from the Milstein MB proposal density, we employ rejection sampling. For this, normalization of the product in (3.4) is not necessary. Again, we numerically determine the maximum dmax of the product, and the interval $I$ that includes all points with a function value of at least 10−20 times this maximum. Then, we uniformly sample (u1, u2) from rectangle $I×(0,dmax)$ and accept u1 as a proposal $Xtk+1∗$ if the unnormalized density value of (3.4) at u1 is at most u2.

For the combination of the MB proposal and the Milstein approximation, the set of feasible proposal points may be empty. In this case, our implementation shifts to the Euler approximation for this point, i.e. the point is proposed with the MB proposal based on the Euler scheme and also the corresponding factor of the proposal density in the acceptance probability is based on the Euler scheme. In addition, for all methods, a negative point may be proposed, which is not feasible for a GBM. Therefore, in this case, we propose a new point. For both cases, we count the number of times that they occur during the estimation procedure. In the following simulation study, no cases of switching to the Euler scheme occurred and negative proposals occurred only very rarely (less than 1‰ of the number of iterations in the very worst case).

We implemented the described estimation procedures in R v. 3.6.2 [23]. The source code of our implementation and the following simulation study is publicly available at https://github.com/fuchslab/Inference_for_SDEs_with_the_Milstein_scheme.

### 5. Simulation study

In this section, we study the computational performance of the competing inference methods on the (relatively simple) benchmark model GBM. As a second (on the application side more often studied) benchmark model, the CIR process is investigated in appendix D. In this work, we focus on Bayesian inference by data augmentation and compare the four approaches listed at the end of §3.2. Conceptually different inference procedures, as summarized e.g. in [2], are not considered as competitors here as they would be employed in different data contexts. There are two aspects that are important to consider when we want to evaluate the different methods:

 (a) the accuracy with which the true posterior distribution is approximated based on one of the approximation schemes and a given number m and (b) the accuracy with which we are able to draw from this approximated posterior distribution.
We are interested in the overall accuracy, i.e. the combination of (a) and (b), achieved within a fixed amount of computational time.

For the simulation study, we generated 100 paths of the GBM in the time interval [0, 1] using the solution (2.3) with the parameter combination θ = (α, σ2)T = (1, 2)T and initial value x0 = 100. Figure 5 illustrates some of these paths. From each path, we took M = 20 equidistant points (i.e. the inter-observation time Δt was 0.05) and applied each of the four described estimation methods once. We imputed data such that we got m = 2 and m = 5 inter-observation intervals. We also included the case m = 1, i.e. no data were imputed and only Step (1) from §3, the parameter update, was repeated in the estimation procedure where the likelihood of the path in the acceptance probability is approximated by the Euler or the Milstein scheme. For the prior distribution of the parameters, we assumed that they were independently distributed with $α∼N(0,10)$ and σ2 ∼ IG(κ0 = 2, ν0 = 2), where IG denotes the inverse gamma distribution with shape parameter κ0 and scale parameter ν0. The a priori expectations of the parameters are thus $E(α)=0$ and $E(σ2)=2$.

Each of the estimation procedures performs the following steps:

(1)

Draw initial values for the parameters α and σ2 from the prior distributions.

(2)

Initialize Yimp by linear interpolation.

(3)

Repeat the following steps:

Parameter update: Apply random walk proposals.

 (a) Draw a proposal $α∗∼N(αi−1,0.25)$. (b) Draw a proposal $σ2∗∼LN(log⁡σi−12,0.25)$. (c) Accept both or none.

Path update:

 (a) Choose an update interval (ta, tb) as described in appendix C with λ = 5. (b) Draw a proposal $X(ta,tb)imp∗$ according to the investigated method. (c) Accept or reject the proposal.

We let each procedure run for 1 h and evaluate the overall accuracy of the obtained sample compared to a sample from the true posterior distribution (as described below).

Figures 6 and 7 present the output from one estimation procedure on the example of the combination MBM-M of the MB proposal and the Milstein approximation for the proposal density and the likelihood function. From each estimation procedure, we obtained an MCMC chain of dimension n(m − 1) + 2. For each chain, we used the two components for parameters α and σ2 and calculated the mean, the median, and the variance after cutting off a burn-in phase of 5000 iterations. To justify our use of independent proposals for the parameter update, we show in appendix E that the parameters are not strongly correlated.

As a benchmark, we also sampled from the true parameter posterior distribution based on the solution of the GBM. We used the Stan software [24,25] which provides an efficient C++ implementation of Hamiltonian Monte Carlo (HMC) sampling with the No-U-turn sampler to sample from the true parameter posterior distribution. For each posterior distribution corresponding to one of the 100 sample paths, we generated four HMC chains with 500 000 iterations each. The first half of the chains was discarded as warm-up and the remaining draws were combined to give a sample of size 106. We calculated the multivariate effective sample size (ESS) as defined in [26] which provides the size of an independent and identically distributed sample equivalent to our samples in terms of variance and found that the ESS of the obtained samples from the true posterior distribution is well over 500 000. For each of these samples, we also calculated the mean, the median and the variance.

The estimation procedures and time measurements were performed on a cluster of machines with the following specifications: AMD Opteron™ Processor 6376 (1.40 GHz), 512GB DDR3-RAM.

### 6. Results

Figures 8 and 9 and tables 1 and 2 summarize the results of running each of the methods once for 1 h for each of the 100 GBM trajectories. Figures 8 and 9 show the density plots of the difference between the respective statistic (mean, median or variance) calculated for a sample from the approximated posterior distribution obtained by the respective method and the statistic for a sample from the true posterior distribution of the same sample path. Each density plot aggregates 100 such difference values, one for each of the 100 GBM trajectories. Table 1 tabulates the root mean square error (RMSE) based on these differences for each of the considered methods, discretization levels m and statistics. We use the RMSE as the measure of the overall accuracy. The lower the RMSE is, the higher the accuracy of the respective method. Table 2 empirically evaluates the computational efficiency of the considered methods, including the number of iterations completed after 1 h, the multivariate ESS based on the obtained sample after discarding a burn-in phase of 5000 iterations, and the acceptance rates of the parameter and the path proposals. Each of these quantities is averaged over the 100 GBM trajectories and the coefficient of variation is also stated.

Table 1. Empirical characteristics for evaluating the overall accuracy of the parameter estimation procedures for different numbers m of subintervals between two observations aggregated over 100 deviations between the respective statistics calculated for the sample from the approximated posterior and for the sample from the true posterior distribution, one for each of the 100 sample paths of the GBM. The lowest RMSE (root mean square error) per m and per statistic is printed in bold.

methodRMSEs for α
RMSEs for σ2
meanmedianvariancemeanmedianvariance
m = 1Euler0.2820.2440.4560.6380.6000.471
Milstein0.8510.7801.1580.2820.2650.176
m = 2MBE-E0.2660.2380.5260.2110.1980.141
MBE-M0.3110.3020.4760.1090.1060.057
MBM-M0.3150.3050.4700.1120.1070.057
DBM-M0.3180.3080.4850.1010.0990.044
m = 5MBE-E0.2770.2540.5240.1130.0980.127
MBE-M0.2880.2740.4740.0310.0310.050
MBM-M0.2920.2780.4920.0400.0370.058
DBM-M0.2910.2750.4720.0310.0300.037

Table 2. Empirical characteristics for evaluating the computational efficiency of the parameter estimation procedures for different numbers m of subintervals between two observations aggregated over 100 trajectories of the GBM. Each of the procedures was run for 1 h. Acceptance rates are defined to take values between 0 and 1. For m = 1, no data points were imputed and only Step (1), the parameter update, was repeated in the estimation procedure. Specifications for the computing power are stated in the main text. c.v. denotes the coefficient of variation.

methodnumber of iterations after 1 h
multivariate effective sample size
acceptance rate of the parameters
acceptance rate of the path
meanc.v.meanc.v.meanc.v.meanc.v.
m = 1Euler25 134 3010.031 273 7440.160.5180.02
Milstein454 8630.03146 3620.410.4250.14
m = 2MBE-E8 583 6140.03170 8270.190.4420.010.8420.04
MBE-M1 816 1440.03240900.380.4170.030.7990.05
MBM-M300 8700.0368810.210.4170.031.0000.00
DBM-M754 0240.1028 0890.310.4170.030.8390.04
m = 5MBE-E6 765 0540.1049 8850.180.3100.010.8920.02
MBE-M892 4870.0250330.240.3040.010.8440.03
MBM-M78 2150.045730.200.3040.010.9780.01
DBM-M879 2270.0355350.210.3040.010.8840.02

For the drift parameter α of the GBM, the four considered schemes perform comparably for m = 2 and m = 5. In particular, the use of the Milstein approximation does not improve the accuracy of the posterior mean and median for the same discretization level m. The accuracy of the posterior variance is slightly improved by the use of the Milstein approximation when data are imputed. Moreover, for MBE-E, the accuracy does not consistently improve as m is increased. Whereas, the accuracy for the methods including the Milstein scheme improves considerably when imputed data are introduced (i.e. m > 1) and it improves slightly when m is increased from 2 to 5.

For the diffusion parameter σ2 of the GBM, we clearly see an improvement in overall accuracy for the methods involving the Milstein scheme. Combination DBM-M turns out to be the most accurate, closely followed by MBE-M in case of the mean and median.

According to table 2, the number of iterations completed within 1 h varies substantially among the different estimation procedures. It is always higher for the procedures that use the Euler approximation, while especially combination MBM-M is very time-consuming and thus completes fewer iterations. Similarly, the multivariate ESS varies substantially among the different estimation procedures. It is higher for m = 2 than for m = 5 for each of the considered estimation procedures. The acceptance rate of the parameters is slightly lower when the Milstein scheme is used for the approximation of the likelihood function. In addition, the acceptance rate of the parameters decreases as the number of imputed points increases. The acceptance rate of the path is highest for combination MBM-M. For MBE-E, it would be just as high if one would not substitute μk+1 and σk+1 by μk and σk. For MBE-E, MBE-M and DBM-M, the acceptance rate of the path increases as the number of imputed points increases.

### 7. Summary and discussion

We have demonstrated how to implement an algorithm for the parameter estimation of SDEs from low-frequency data using the Milstein scheme to approximate the transition density of the underlying process. Our motivation was to improve numerical accuracy and thus reduce the amount of imputed data and computational overhead. However, our findings are rather discouraging: we found that this method can be applied to multidimensional processes only with impractical restrictions. Moreover, we showed that the combination of the MB proposal with the Milstein scheme for the proposal density may lead to an empty set of possible proposal points, which would require switching to the Euler scheme in order to proceed. One of the strengths of the original (Euler-based) MCMC scheme is its generic character and applicability. Through this, it possesses a practical advantage over otherwise more sophisticated methods such as the Exact Algorithm [27]. This strength does not translate to the Milstein-based MCMC scheme due to the limited applicability of the Milstein approximation especially in the multidimensional setting. Thus, methods like the Exact Algorithm may be a reasonable alternative. The limited applicability of the Milstein approximation would also persist for advanced forms of the discussed MCMC scheme like the innovation scheme in [6] or for even more generic algorithms like particle MCMC as studied in [28].

In our simulation study, we found that the overall accuracy for the estimates for the drift parameter of the GBM does not necessarily improve when the Milstein scheme is used. Fewer iterations are completed for the methods involving the Milstein scheme and also the ESS is substantially lower. Thus, the poor sampling efficiency might outweigh the (potential) increase in accuracy of the approximation of the posterior distribution. Especially, the combination MBM-M results in a particularly low number of iterations and a low ESS. Owing to the already quite low ESS achieved by the Milstein-based methods for m = 5 subintervals between two observations, we did not consider higher discretization levels. Moreover, note that tuning the variance hyperparameters for the random walk proposals of the parameters in Step (3) in the simulation study to reach an optimal acceptance rate might lead to a higher ESS. However, since the acceptance rates achieved in the simulation study lie in a range where the sampling efficiency is rather robust to changes in the acceptance rate as shown in [29] (in the high-dimensional limit), we do not expect the change in the ESS after tuning to be substantial.

For the estimates for the GBM diffusion parameter, the overall accuracy is increased by the use of the Milstein scheme. DBM-M turns out to be the most effective combination in terms of overall accuracy.

We conducted another simulation study on the example of the CIR process, as shown in appendix D, and the results are very similar as for the GBM. The use of the Milstein approximation does not consistently improve the overall accuracy for the drift parameter; however, it does improve the accuracy for the diffusion parameter. Again combination DBM-M achieves the highest accuracy, closely followed by MBE-M.

It was expected that the use of the Milstein scheme would make a difference for the estimates for the diffusion parameters because the additional term added by the Milstein scheme compared to the Euler scheme involves the diffusion function and its derivative. Nevertheless, the general applicability of the Euler scheme remains a great advantage and the search for different proposal schemes such as in [10,22] rather than for different numerical discretization schemes may be a more promising way towards more efficient estimation algorithms for diffusion processes.

### Data accessibility

The source code of our implementation and the simulation study is publicly available at https://github.com/fuchslab/Inference_for_SDEs_with_the_Milstein_scheme.

### Authors' contributions

C.F. devised the project and provided supervision. S.P. implemented the described algorithms, carried out the simulation study and drafted the manuscript. Both authors contributed to the final version of the manuscript, gave final approval for publication and agree to be held accountable for the work performed therein.

### Competing interests

The authors declare that there are no conflicts of interest regarding the publication of this paper.

### Funding

Our research was supported by the German Research Foundation within the SFB 1243, Subproject A17, by the Federal Ministry of Education and Research under grant no. 01DH17024, and by the Helmholtz pilot project ‘Uncertainty Quantification’.

## Acknowledgements

The authors wish to thank three anonymous reviewers for very valuable suggestions that helped to significantly improve this article.

## Appendix A. Derivation of the transition density based on the Milstein scheme

The Milstein scheme

$Yk+1=Yk+μ(Yk,θ)Δtk+σ(Yk,θ)ΔBk+12σ(Yk,θ)∂σ∂y(Yk,θ)((ΔBk)2−Δtk),$
can be considered a variable transformation of the random variable $Z∼N(0,1)$ with density ϕ(z) using the transformation function
$f(z)=az2+bz+c,$
where the coefficients are defined as
$a=12σ(Yk,θ)∂σ∂y(Yk,θ)Δtk,b=σ(Yk,θ)Δtkandc=Yk+[μ(Yk,θ)−12σ(Yk,θ)∂σ∂y(Yk,θ)]Δtk,$
and whose derivative and inverse function are
$f′(z)=2az+b$
and
$f−1(y)=−b2a±b2+4a(y−c)2afor y≥−b24a+c.$
By applying the random variable transformation theorem as found in [30, p. 269] or [31, p. 27], the density ρY of Yk+1 can be derived as follows:
$ρY(y)=∑{z∈R:f(z)=y}ϕ(z)|f′(z)|=ϕ(−b2a−b2+4a(y−c)2a)|f′(−b2a−b2+4a(y−c)2a)|+ϕ(−b2a+b2+4a(y−c)2a)|f′(−b2a+b2+4a(y−c)2a)|=12πexp(−12(−b2a−b2+4a(y−c)2a)2)|b+2a(−ba−b2+4a(y−c)2a)|+12πexp(−12(−b2a+b2+4a(y−c)2a)2)|b+2a(−b2a+b2+4a(y−c)2a)|=12π(exp(−18a2(b2+2bb2+4a(y−c)+b2+4a(y−c)))|−b2+4a(y−c)|+exp(−18a2(b2−2bb2+4a(y−c)+b2+4a(y−c)))|b2+4a(y−c)|)=exp⁡(−b2+2a(y−c)4a2)2πb2+4a(y−c)(exp(−bb2+4a(y−c)4a2)+exp(bb2+4a(y−c)4a2))=exp⁡(−b2+2a(y−c)4a2)2πb2+4a(y−c)⋅2cosh(bb2+4a(y−c)4a2).$
After substituting the coefficients a, b and c and abbreviating μk := μ(Yk, θ), σk := σ(Yk, θ) and $σk′:=σ′(Yk,θ)=∂σ(y,θ)/∂y|y=Yk$, we obtain the transition density based on the Milstein scheme
$πMil(Yk+1|Yk,θ)=exp(−(σkΔtk)2+212σkσk′Δtk(Yk+1−Yk−(μk−12σkσk′)Δtk)4(12σkσk′Δtk)2)2π(σkΔtk)2+412σkσk′Δtk(Yk+1−Yk−(μk−12σkσk′)Δtk)⋅[exp(−σkΔtk(σkΔtk)2+412σkσk′Δtk(Yk+1−Yk−(μk−12σkσk′)Δtk)4(12σkσk′Δtk)2)+exp(σkΔtk(σkΔtk)2+412σkσk′Δtk(Yk+1−Yk−(μk−12σkσk′)Δtk)4(12σkσk′Δtk)2)]=exp⁡(−Ck(Yk+1)Dk)2πΔtkAk(Yk+1)⋅[exp(−Ak(Yk+1)Dk)+exp(Ak(Yk+1)Dk)]$
with
$Ak(Yk+1)=(σk)2+2σkσk′(Yk+1−Yk−(μk−12σkσk′)Δtk)Ck(Yk+1)=σk+σk′(Yk+1−Yk−(μk−12σkσk′)Δtk)Dk=σk(σk′)2Δtk$
and for
$Yk+1≥Yk−12σkσk′+(μk−12σkσk′)Δtk,if σkσk′>0andYk+1≤Yk−12σkσk′+(μk−12σkσk′)Δtk,if σkσk′<0.$
In the case of σk = 0, Yk+1 conditioned on Yk is deterministic. For σk = 0, the Milstein scheme reduces to the Euler scheme.

## Appendix B. Derivation of the acceptance probability for the MB proposal for m = 2 inter-observation intervals

As stated in §3.2, the acceptance probability for the path update between two consecutive observations $Xτi$ and $Xτi+1$ with the MB proposal is

$ζ(X(τi,τi+1)imp∗,X(τi,τi+1)imp)=1∧π(X(τi,τi+1)imp∗ | X{τi,τi+1}obs,θ)qMB(X(τi,τi+1)imp | Xτi,,Xτi+1,θ)π(X(τi,τi+1)imp | X{τi,τi+1}obs,θ)qMB(X(τi,τi+1)imp∗ | Xτi,,Xτi+1,θ)=1∧∏k=0m−1π(Xtk+1∗ | Xtk∗,θ)π(Xtk+1 | Xtk,θ)∏k=0m−2π(Xtk+1 | Xtk,Xτi+1,θ)π(Xtk+1∗ | Xtk∗,Xτi+1,θ),$
where $Xt0∗=Xt0=Xτi$ and $Xtm∗=Xtm=Xτi+1$. For the case where only one data point is imputed between two observations (i.e. m = 2) this reduces to
$ζ(X(τi,τi+1)imp∗,X(τi,τi+1)imp)=1∧π(Xt1∗ | Xτi,θ)π(Xτi+1 | Xt1∗,θ)π(Xt1 | Xτi,θ)π(Xτi+1 | Xt1,θ)π(Xt1 | Xτi,Xτi+1,θ)π(Xt1∗ | Xτi,Xτi+1,θ)=1∧[π(Xt1∗ | Xτi,θ)π(Xτi+1 | Xt1∗,θ)π(Xt1 | Xτi,θ)π(Xτi+1 | Xt1,θ)π(Xt1 | Xτi,θ) π(Xτi+1 | Xt1,θ)/π(Xτi+1 | Xτi,θ)π(Xt1∗ | Xτi,θ) π(Xτi+1 | Xt1∗,θ)/π(Xτi+1 | Xτi,θ)]=1.$
This relation holds for any (approximated) transition density $π(Xtk+1 | Xtk,θ)$.

## Appendix C. Choice of path update interval

For choosing the update interval, we use the random block size algorithm as suggested in [3]. Assuming that the augmented path contains a total of n + 1 data points Y0, …, Yn, it is divided into update segments $Y(c0,c1),Y(c1,c2),…$ by the following algorithm:

(1)

Set c0 = 0 and j = 1.

(2)

While cj−1 < n:

 (a) Draw Z ∼ Po(λ) and set $cj=min{c j−1+Z,n}$. (b) Increment j.

Here, Z ∼ Po(λ) denotes the Poisson distribution with parameter λ.

Such a random choice of the path update interval is a simple way to vary the set of points that are updated together within one iteration.

## Appendix D. Additional example: Cox–Ingersoll–Ross process

The one-dimensional CIR process fulfils the SDE

$dXt=α(β−Xt) dt+σXt dBt,X0=x0,$
with starting value $x0∈R+$ and parameters α, β, $σ∈R+$. If 2αβ > σ2, the process is strictly positive (i.e. $X=R+$) otherwise it is non-negative (i.e. $X=R0$). The transition density is explicitly known as
$p(s,x,t,y)= c(vu)η/2 e−(u+v)Iη(2uv)$
for t > s ≥ 0, where
$c=2ασ2(1−e−α(t−s)),u=cx e−α(t−s),v=cy,η=2αβσ2−1,$
and $Iη$ denotes the modified Bessel function of the first kind of order η, i.e.
$Iη(z)=∑k=0∞(z2)2k+η1k! Γ(k+η+1)$
for $z∈R$, where Γ is the Gamma function.

For the CIR process, we have $σ(Xt,θ)=σXt$ with parameter σ > 0, the process taking values in $R0$. We therefore obtain a lower bound for the possible values of $Xtk+1$ when applying the Milstein scheme

$Xtk+1≥(α(β−Xtk)−14σ2)Δtk=:lleft.$
The second bound that occurs when combining the MB proposal with the Milstein scheme is as follows:
$Xtk+1≥β−1α(1Δ+Xtm+14σ2)=:lright.$
The set $D$ of feasible points of $Xtk+1$ for the CIR process when combining the MB proposal with the Milstein scheme is thus $D=[l,∞)$ with l := max(0, lleft, lright).

For the simulation study, we generated 100 paths of the CIR process in the time interval [0, 1] with the parameter combination θ = (α, β, σ2)T = (1, 1, 2)T and initial value x0 = 10. From each path, we took 20 equidistant points and ran each of the described estimation methods once for 1 h to perform inference for the parameters β and σ2, assuming α to be known. For the prior distribution of the parameters, we assumed that they were independently distributed with β ∼ IG(κb = 3, νb = 3) and σ2 ∼ IG(κs = 3, νs = 4). The a priori expectations of the parameters are thus $E(β)=32$ and $E(σ2)=2$. For each estimation procedure, the steps as outlined in §5 were taken. As proposal densities for the parameters in Step (3), we used $β∗∼LN(log⁡βi−1,0.25)$ and $σ2∗∼LN(log⁡σi−12,0.25)$.

The sampling results are summarized in figures 10 and 11 and tables 3 and 4. Similar to the results for the GBM, the use of the Milstein approximation does not consistently improve the overall accuracy for the drift parameter β. The accuracy increases for increasing m for most of the methods. Only combination MBM-M has lower accuracy for m = 5 due to the low sampling efficiency and the resulting low ESS. For the diffusion parameter σ2, the use of the Milstein approximation and increasing m both improve the overall accuracy. Again combination DBM-M achieves the highest accuracy, closely followed by MBE-M.

Table 3. Empirical characteristics for evaluating the overall accuracy of the parameter estimation procedures for different numbers m of subintervals between two observations aggregated over 100 deviations between the respective statistics calculated for the sample from the approximated posterior and for the sample from the true posterior distribution, one for each of the 100 sample paths of the CIR process. The lowest RMSE per m and per statistic is printed in bold.

methodRMSEs for β
RMSEs for σ2
meanmedianvariancemeanmedianvariance
m = 1Euler0.01790.01150.04780.16030.15300.0673
Milstein0.01740.01100.05870.13060.12330.0595
m = 2MBE-E0.00990.00640.02650.09100.08650.0417
MBE-M0.01050.00630.04130.06560.06190.0309
MBM-M0.01510.01200.04620.06580.06250.0325
DBM-M0.00970.00610.03300.06530.06170.0308
m = 5MBE-E0.00520.00360.01440.04000.03800.0194
MBE-M0.00770.00490.03750.02710.02590.0156
MBM-M0.03070.02040.11030.05090.04200.0615
DBM-M0.00850.00520.03210.02700.02560.0156

Table 4. Empirical characteristics for evaluating the computational efficiency of the parameter estimation procedures for different numbers m of subintervals between two observations aggregated over 100 trajectories of the CIR process. Each of the procedures was run for 1 h. Acceptance rates are defined to take values between 0 and 1. For m = 1, no data points were imputed and only Step (1) from §3, the parameter update, was repeated in the estimation procedure. Specifications for the computing power are stated in the main text. c.v. denotes the coefficient of variation.

methodnumber of iterations after 1 h
multivariate effective sample size
acceptance rate of the parameters
acceptance rate of the path
meanc.v.meanc.v.meanc.v.meanc.v.
m = 1Euler23 461 0230.112 422 5210.140.4430.03
Milstein4 685 4500.03480 5490.080.4420.03
m = 2MBE-E8 482 2410.06422 0340.100.3840.030.9640.01
MBE-M1 944 2290.0594 0710.100.3830.030.9570.01
MBM-M186 5880.0694290.130.3830.031.0000.00
DBM-M1 905 3540.0495 2620.100.3830.030.9680.01
m = 5MBE-E6 851 1970.05114 3440.100.2720.030.9760.01
MBE-M966 5790.0415 5990.130.2720.030.9650.01
MBM-M37 6480.125740.250.2720.030.9930.00
DBM-M906 7910.0814 8810.140.2720.030.9750.01

Also for the CIR process, the number of iterations completed after 1 h and the multivariate ESS of the obtained sample vary substantially between the different procedures. Both quantities are highest for combination MBE-E, they are similar for MBE-M and DBM-M, and particularly low for MBM-M.

## Appendix E. Analysis of the correlation between the parameters

In this section, we provide several plots (see figures 1215) showing that the parameters of the two benchmark models are not strongly correlated in order to justify our use of independent parameter proposals in the simulation study.

## Footnotes

### References

• 1.
Dacunha-Castelle D, Florens-Zmirou D. 1986Estimation of the coefficients of a diffusion from discrete observations. Stochastics 19, 263-284. (doi:10.1080/17442508608833428)
• 2.
Fuchs C. 2013Inference for diffusion processes. Berlin, Germany: Springer.
• 3.
Elerian O, Chib S, Shephard N. 2001Likelihood inference for discretely observed nonlinear diffusions. Econometrica 69, 959-993. (doi:10.1111/1468-0262.00226)
• 4.
Eraker B. 2001MCMC analysis of diffusion models with application to finance. J. Bus. Econ. Stat. 19, 177-191. (doi:10.1198/073500101316970403)
• 5.
Roberts GO, Stramer O. 2001On inference for partially observed nonlinear diffusion models using the Metropolis–Hastings algorithm. Biometrika 88, 603-621. (doi:10.1093/biomet/88.3.603)
• 6.
Golightly A, Wilkinson DJ. 2008Bayesian inference for nonlinear multivariate diffusion models observed with error. Comput. Stat. Data Anal. 52, 1674-1693. (doi:10.1016/j.csda.2007.05.019)
• 7.
Golightly A, Wilkinson DJ. 2006Bayesian sequential inference for stochastic kinetic biochemical network models. J. Comput. Biol. 13, 838-851. (doi:10.1089/cmb.2006.13.838)
• 8.
Elerian O. 1998A note on the existence of a closed form conditional transition density for the Milstein scheme. Working paper. Nuffield College, University of Oxford, Oxford, UK. Google Scholar
• 9.
Tse YK, Zhang X, Yu J. 2004Estimation of hyperbolic diffusion using the Markov chain Monte Carlo method. Quant. Finance 4, 158-169. (doi:10.1080/14697680400000020)
• 10.
Whitaker GA, Golightly A, Boys RJ, Sherlock C. 2017Improved bridge constructs for stochastic differential equations. Stat. Comput. 27, 885-900. (doi:10.1007/s11222-016-9660-3)
• 11.
Mrázek M, Pospíšil J. 2017Calibration and simulation of Heston model. Open Math. 15, 679-704. (doi:10.1515/math-2017-0058)
• 12.
Øksendal BK. 2003Stochastic differential equations: an introduction with applications, 6th edn. Berlin, Germany: Springer.
• 13.
Iacus S. 2008Simulation and inference for stochastic differential equations. New York, NY: Springer.
• 14.
Kloeden PE, Platen E. 1992Numerical solution of stochastic differential equations. Berlin, Germany: Springer.
• 15.
Bayram M, Partal T, Buyukoz GO. 2018Numerical methods for simulation of stochastic differential equations. Adv. Differ. Equ. 2018, 17. (doi:10.1186/s13662-018-1466-5)
• 16.
Aït-Sahalia Y. 2002Maximum likelihood estimation of discretely sampled diffusions: a closed-form approximation approach. Econometrica 70, 223-262. (doi:10.1111/1468-0262.00274)
• 17.
Aït-Sahalia Y. 2008Closed-form likelihood expansions for multivariate diffusions. Ann. Statist. 36, 906-937. (doi:10.1214/009053607000000622)
• 18.
Filipović D, Mayerhofer E, Schneider P. 2013Density approximations for multivariate affine jump-diffusion processes. J. Econ. 176, 93-111. (doi:10.1016/j.jeconom.2012.12.003)
• 19.
Gilks W, Richardson S, Spiegelhalter D. 1996Markov chain Monte Carlo in practice. London, UK: Chapman & Hall. Google Scholar
• 20.
Durham GB, Gallant AR. 2002Numerical techniques for maximum likelihood estimation of continuous-time diffusion processes. J. Bus. Econ. Stat. 20, 297-316. (doi:10.1198/073500102288618397)
• 21.
Chib S, Shephard N. 2002[Numerical techniques for maximum likelihood estimation of continuous-time diffusion processes]: Comment. J. Bus. Econ. Stat. 20, 325-327. Google Scholar
• 22.
van der Meulen F, Schauer M. 2017Bayesian estimation of discretely observed multi-dimensional diffusion processes using guided proposals. Electron. J. Stat. 11, 2358-2396. (doi:10.1214/17-EJS1290)
• 23.
R Core Team. 2019R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/. Google Scholar
• 24.
Carpenter Bet al. 2017Stan: a probabilistic programming language. J. Stat. Softw., Articles 76, 1-32.
• 25.
Stan Development Team. RStan: the R interface to Stan; 2019. R package version 2.19.1. http://mc-stan.org/. Google Scholar
• 26.
Vats D, Flegal JM, Jones GL. 2019Multivariate output analysis for Markov chain Monte Carlo. Biometrika 106, 321-337. (doi:10.1093/biomet/asz002)
• 27.
Beskos A, Papaspiliopoulos O, Roberts GO. 2008A factorisation of diffusion measure and finite sample path constructions. Methodol. Comput. Appl. Probab. 10, 85-104. (doi:10.1007/s11009-007-9060-4)
• 28.
Golightly A, Wilkinson DJ. 2011Bayesian parameter inference for stochastic biochemical network models using particle Markov chain monte carlo. Interface Focus 1, 807-820. (doi:10.1098/rsfs.2011.0047)
• 29.
Roberts GO, Rosenthal JS. 2001Optimal scaling for various Metropolis–Hastings algorithms. Stat. Sci. 16, 351-367. (doi:10.1214/ss/1015346320)
• 30.
Schmidt KD. 2009Maßund Wahrscheinlichkeit [Measure and Probability]. Berlin, Germany: Springer. Google Scholar
• 31.
Gillespie DT. 1992Markov processes: an introduction for physical scientists. Boston, MA: Academic Press. Google Scholar