# Linking Gaussian process regression with data-driven manifold embeddings for nonlinear data fusion

## Abstract

In statistical modelling with Gaussian process regression, it has been shown that combining (few) high-fidelity data with (many) low-fidelity data can enhance prediction accuracy, compared to prediction based on the few high-fidelity data only. Such information fusion techniques for multi-fidelity data commonly approach the high-fidelity model *f*_{h}(*t*) as a function of *two* variables (*t*, *s*), and then use *f*_{l}(*t*) as the *s* data. More generally, the high-fidelity model can be written as a function of several variables (*t*, *s*_{1}, *s*_{2}….); the low-fidelity model *f*_{l} and, say, some of its derivatives can then be substituted for these variables. In this paper, we will explore mathematical algorithms for multi-fidelity information fusion that use such an approach towards improving the representation of the high-fidelity function with only a few training data points. Given that *f*_{h} may not be a simple function—and sometimes not even a function—of *f*_{l}, we demonstrate that using additional functions of *t*, such as derivatives or shifts of *f*_{l}, can drastically improve the approximation of *f*_{h} through Gaussian processes. We also point out a connection with ‘embedology’ techniques from topology and dynamical systems. Our illustrative examples range from instructive caricatures to computational biology models, such as Hodgkin–Huxley neural oscillations.

### 1. Introduction

Recent advances in both algorithms and hardware are increasingly making machine learning an important component of mathematical modelling for physico-chemical, engineering, as well as biological systems (e.g. [1–4]). Part of these developments focus on multi-resolution and multi-fidelity data fusion [5,6]. Fusing information from models constructed at different levels of resolution/fidelity has been shown to enhance prediction accuracy in data-driven scientific computing. Richardson extrapolation, for example, has been widely used to improve the rate of convergence using different resolution discretizations in many practical applications [7]. Also, multi-grid methods in numerical analysis solve numerical PDEs effectively using multi-resolution and linearly dependent discretizations [8–10]. Through advances of machine learning algorithms, if some data from a fine-resolution simulation are missing, it becomes possible to estimate them exploiting data from a low-resolution simulation [11,12].

In addition, if high-fidelity data are costly to obtain (experimentally or computationally) while low-fidelity data are relatively cheap, a combination of a few high-fidelity data and many low-fidelity data can also lead to overall computational efficiency. For example, we may be able to combine few experimental data from a high-resolution measurement with extensive data from a computer simulation or from lower-resolution measurements. Recently, Gaussian process (GP) regression has been widely used to effectively combine multiple fidelity data [13–17]. ‘Classical’ information fusion by GP had focused only on linear dependency between high- and low-fidelity data via auto-regressive schemes such as the coKriging approach of Kennedy & O’Hagan [18]. If two (or more) datasets have *nonlinear* dependency, such linear-type approaches will lose their effectiveness.

When a high-fidelity model *f*_{h}(*t*) has nonlinear dependency with a low-fidelity model *f*_{l}(*t*), nonlinear auto-regressive Gaussian process (NARGP) [17] has been observed to achieve highly accurate prediction results by introducing an additional dimension: *f*_{h}(*t*) is approximated as a curve on a two-dimensional manifold parametrized by *t* and a second variable, *s*. The data points on this manifold are of the form (*f*_{h}(*t*), *t*, *s*( = *f*_{l}(*t*))). More generally, assuming $t\in {\mathbb{R}}^{d}$, NARGP finds a smooth manifold in a (*d* + 1)-dimensional space. In this framework, the low-fidelity model provides the additional ‘latent’ variable of the high-fidelity model.

Deep multi-fidelity GPs were introduced [19] as an improvement, especially in the context of discontinuities in *f*_{h} with respect to *t* (and *f*_{l}). This approach focuses on constructing a useful transformation *T*(*t*) of the input variables *t* through a (deep) neural network. Then, the high-fidelity function *f*_{h}(*t*) is approximated as a GP of (just) *f*_{l}(*T*(*t*)). One must now, of course, perform optimization for the additional network hyperparameters.

In this paper, we discuss a connection of NARGP with data-driven embeddings in topology/dynamical systems, and extend it (in the spirit of such data-driven embeddings) in an attempt to improve the numerical approximation of *f*_{h} in the ‘sparse *f*_{h}, rich *f*_{l}’ data setting. In what follows we will (rather arbitrarily) ascribe the characterization ‘high-fidelity’ or ‘low-fidelity’ to different functions used in our illustrations; this characterization is solely based on the number of available data points for each. In the spirit of the Richardson extrapolation or the multi-grid computations mentioned above, one expects that observations/data at multiple fidelities are obtained by a systematic fine/coarse-graining process of studying the same system.

For our first example, the ‘high-fidelity’ function *f*_{h}, for which we only have a few data points, is a function of *t*; but it actually also happens that we can describe it as a function of *f*_{l} (figure 1):

In this framework, the high-fidelity datasets (*t*, *f*_{h}(*t*)) can be regarded as ‘ground truth’ obtained from experimental measurements or the high-fidelity model. The low-fidelity datasets (*t*, *f*_{l}(*t*)), on the other hand, are obtained from a model with ‘qualitatively’ correct features—here, the right frequency—yet ‘quantitatively’ inaccurate observations (wrong scaling).

When we choose *t* as the coordinate parametrizing *f*_{h}(*t*) (the green curve in figure 1*a*), the GP regression fails to represent the high-frequency sine function with just a few training data points as shown in figure 1*b*. However, as figure 1*c* shows, if we choose *f*_{l} as the coordinate of *f*_{h}(*f*_{l}) (coloured by red in figure 1*a*), the GP regression *can* represent the simple quadratic function quite effectively. If we still need to know the parametrization of *f*_{h} by *t*, we can obtain it through the ‘data rich’ *f*_{l} : *f*_{h}(*t*) ≡ *f*_{h}(*f*_{l}(*t*)).

If *f*_{h}*is not* a function of *f*_{l}, however (as has been observed in the literature [17] and as can be seen in figure 2*a*) more variables are necessary to create a domain over which *f*_{h} is a function.

In the NARGP framework, the variable used in addition to *f*_{l} is *t* itself. In this paper, we will also advocate the use of delays or derivatives of *f*_{l} as additional variables. This approach can also help remove the explicit dependence of *f*_{h} on *t*, since embedding theories in dynamical systems [20–25] guarantee that we can choose any generic observation of *t*, or derivatives and/or delays of this observation, as a replacement for *t*; see §2.2 for more details.

In this paper, all examples follow the same problem set-up. We only have a few high-fidelity data points (ground truth), while plentiful data points are available from the low-fidelity model (characterized by fewer modelling terms, or perturbed model parameters when compared with the high-fidelity one). In addition, the high-fidelity function can be written as a simple function of *t*, *f*_{l}(*t*) and its derivatives. With this set-up, we demonstrate the effectiveness of the proposed framework through pedagogical examples in §3.

In §3.4, we apply the proposed framework to the Hodgkin–Huxley model, describing the behaviour of action potentials in a neuron. Here, the high- and the low-fidelity functions are action potentials at two different values of the external current. This is a case where *f*_{h} is a complicated function of *t* and does not only depend on *f*_{l}; yet as we will see, delays of *f*_{l} will help us construct an accurate approximation of *f*_{h}.

The paper is organized as follows. In §2, we review the NARGP framework and concepts of ‘embedology’. Also, we illustrate why and when this framework is successful. In §3, we demonstrate the effectiveness of our proposed framework via some pedagogical examples and the biologically motivated application to the Hodgkin–Huxley model. In §4, we summarize our results and discuss open issues for further development of general multi-fidelity information fusion in modelling and simulation practice across scientific domains.

### 2. Methods

#### 2.1. Nonlinear information fusion algorithms

‘Classical’ multi-fidelity data fusion algorithms require a linear (or almost linear) dependency between different fidelity datasets. Under this constraint, we can merge two or more datasets by using scaling and shifting parameters such as the Kennedy and O’Hagan coKriging approach [18]. However, more generally, nonlinear dependencies may exist between the datasets, typically degrading the quality of the results of linear information fusion algorithms. In order to resolve nonlinear dependencies between datasets, the use of a space-dependent scaling factor *ρ*(*x*) [26] or, alternatively, deep multi-fidelity GPs [19] have been introduced; clearly, the improvement they bring requires additional hyperparameter optimization.

When the high-fidelity model *f*_{h} nonlinearly depends on the low-fidelity model *f*_{l}, but can be written as a simple function of *t* and *f*_{l}, the NARGP [17] is an appropriate choice. In this framework, a one-dimensional high-fidelity function *f*_{h} is assumed to be a ‘simple’ function *g* of two variables (*t*, *s*), i.e. it is a curve that lies in the two-dimensional manifold described by *g*. Then, GP regression in the two-dimensional space is performed, where the data for *s* are the *f*_{l}(*t*) data,

Algorithmically, classical autoregressive GPs employ an *explicit* method such as a scaling constant (*ρ*) between two covariance kernels, *k*_{1} and *k*_{2} as

The NARGP framework, on the other hand, employs an *implicit* approach by the automatic relevance determination (ARD) weight [27] in the extended space parametrized by *t* and *s*: a different scaling hyperparameter for each of the two dimensions in the kernel. In many applications, a radial basis function (see equation (2.3)) has been used as the covariance kernel (where ARD implies a different scaling hyperparameter *θ*_{i} for each dimension):

Figure 2 showcases an example where *f*_{h} cannot be written as a function of *f*_{l}:

*s*=

*f*

_{l}(

*t*); we then approximate the two-dimensional function

*g*only requires a few training data point pairs for the GP regression. Then,

*f*

_{h}can be written as

*f*

_{h}(

*t*) =

*g*(

*t*,

*s*=

*f*

_{l}(

*t*)) (figure 2

*b*). Figure 2

*c*demonstrates that we can, alternatively, use delays of

*f*

_{l}instead of

*t*as an additional variable. A rationalization of this follows in the next section.

#### 2.2. Data-driven higher-dimensional embeddings

The theorem of Whitney [20] states that any sufficiently smooth manifold of dimension $d\in \mathbb{N}$ can be embedded in Euclidean space ${\mathbb{R}}^{n}$, with the tight bound *n* ≥ 2*d* + 1. Nash [21] showed that this embedding can even be isometric if the manifold is compact and Riemannian, even though the bound on *n* is higher. Many results on the reconstruction of invariant sets in the state spaces of dynamical systems are based on these two theorems [22–25]. Here, the *n* embedding dimensions are usually formed by *n* scalar observations of the system state variables. Instead of *n* different observations, recording *n* time delays of a single scalar observable is also possible, as originally formulated by Takens [22] (see also [28]).

Given a smooth, *d*-dimensional manifold *M*, as well as an observable $h\hspace{0.17em}:\hspace{0.17em}M\to \mathbb{R}$, it is possible to construct an embedding $\varphi \hspace{0.17em}:\hspace{0.17em}M\to {\mathbb{R}}^{n}$ through

*p*=

*t*and the observable is

*h*(

*t*) =

*f*

_{l}(

*t*). Figure 2

*c*shows an example where the embedding from delays of

*f*

_{l}to

*t*is successful, and figure 3

*e*shows an example where

*f*

_{h}is not a function over the manifold that is parametrizable by

*f*

_{l}. Sauer

*et al.*[24] specified the conditions on the trajectory of the observable that have to be satisfied, such that the state space can be embedded successfully. They also extended the results on embeddings of invariant sets with fractal dimension. In the context of the dynamical systems that generate the trajectories, the observable and the vector field together build a tuple that has to be generic (Takens [22]) or prevalent (Sauer

*et al.*[24]). The two papers show that ‘almost all’ tuples are admissible, where the notions of genericity and prevalence are defining that probabilistic statement in the respective function spaces. These results are crucial for the numerical exploitation of the embedding theorems, since they show that ‘almost all’ observables of a dynamical system can be chosen for a delay embedding.

In many applications, the intrinsic dimension *d* of *M* is unknown, and *n* is preemptively chosen large (larger than necessary). Manifold learning techniques are then often capable of reducing the embedding dimension, bringing it closer to the minimal embedding dimension necessary for the given manifold.

We also note that in our framework, if we can obtain low-fidelity data from a (dynamic) process instead of just single measurements, the same embedding and manifold learning techniques can be used even if the ‘independent variable’ *t* is not known [29].

Let us consider equation (2.4) again. NARGP performs GP regression with two observations {*t*, *f*_{l}(*t*)}. Using embedding theory, we can rationalize (a) why the two observations were necessary and (b) why performing GP regression with an additional delay of the scalar observable, {*f*_{l}(*t*), *f*_{l}(*t* − *τ*)}, is equally appropriate for a relatively small time horizon (figure 2*c*). Note that delay coordinates *f*_{l}(*t*) and *f*_{l}(*t* − *τ*) lie on an ellipse with period of 0.25. Hence, if the data are collected over times longer than 0.25, using only delays will fail to represent the high-fidelity function due to multivaluedness (see figure 3*e* and §3.1); *t* itself used as an observable will resolve this.

#### 2.3. Extending the formulation through delays

Now we provide a mathematical formulation of the approach. We assume that for two compact sets $A,B\subset \mathbb{R}$, *f*_{h} : *A* → *B* and *f*_{l} : *A* → *B* are *C*^{K}-smooth functions with *K* ≥ 2, and we want to construct an interpolant of *f*_{h} with a numerical scheme (here, a GP). The domain *A* of the functions is chosen to be a subset of $\mathbb{R}$ (i.e. one dimensional)—because the interpretation of this domain in all our examples can be that of time. To emphasize this, we choose the symbols *t* and *s* to denote one-dimensional variables, and *x* = (*x*_{1}, *x*_{2}, …) for variables with more than one dimension. All statements in this section can be made analogously in domains with arbitrary, finite dimension. We chose to present them in one dimension to simplify notation, and because all examples in the following sections are using one input variable for the functions *f*_{l} and *f*_{h}. We assume that

(1) | only a small number of data points { | ||||

(2) |
$${f}_{\mathrm{h}}(t)=g(t,{f}_{\mathrm{l}}(t),{f}_{\mathrm{l}}^{(1)}(t),\dots ,{f}_{\mathrm{l}}^{(K)}(t)),$$2.7 where ${f}_{\mathrm{l}}^{(i)}(t)$ denotes the i-th derivative, i ∈ {1, …, K}, and | ||||

(3) |
$${\parallel \frac{\mathrm{\partial}}{\mathrm{\partial}{x}_{l}}g({x}_{1},\dots ,{x}_{K+2})\parallel}_{{L}^{\mathrm{\infty}}}\le {c}_{g}\phantom{\rule{1em}{0ex}}\mathrm{\forall}l\in \{1,\dots ,K+2\}.$$2.8 |

Assumption (1) outlines the general setting for multi-fidelity modelling, as explained in the introduction. Assumptions (2) and (3) define the relation between the low- and high-fidelity functions. It is difficult to state precisely for which classes of functions *f*_{l} and *f*_{h} these assumptions are satisfied. A special case is if the norm of the derivative (∂/∂*t*)*f*_{h} is also bounded by *c*_{g}, in which case *g* does not need to depend on *f*_{l} at all. Another case is given if the function *f*_{h} is in the (linear) span of *f*_{l} and its derivatives, in which case *g* is a linear function. More generally, the Taylor series of *f*_{h} reveals why assumptions (2) and (3) are required for a successful, numerical approximation of *g* with few data points of *f*_{h}. Assume we want to evaluate *f*_{h} in a small neighbourhood of a point *t*_{0} ∈ *A*. Then we can write

*f*

_{h}, we can write

*t*)

*f*

_{h}(

*t*) can be large (because the derivatives ${f}_{\mathrm{l}}^{(i)}(t)$ can be large), but if we know all ${f}_{\mathrm{l}}^{(i)}(t)$, we only have to estimate

*g*from data

*f*

_{h}and ${f}_{\mathrm{l}}^{(i)}$. Crucially, we do not approximate the function

*f*

_{h}and its derivatives. The derivatives of

*g*are bounded by

*c*

_{g}through assumption (2.7), and

*g*is a

*C*

^{K}function, so only a few data points are necessary for a good fit. If we have access to function values of

*f*

_{l}over ‘delays’ (at discrete shifts, say in the form of a finite difference stencil) in space, rather than its derivatives, we can use the Newton series approximation of

*f*

_{h}instead of equation (2.9) for an analogous argumentation. For functions

*f*

_{h}that are analytic around the expansion point

*t*

_{0}, the function

*f*

_{h}can be evaluated at

*t*close to it by

_{m}

*f*

_{h}is the

*m*-th finite difference approximation of

*f*

_{h}, with a small step size Δ

*t*. By equation (2.7), these differences can be expressed through

*g*and delays of

*f*

_{l}(instead of delays of

*f*

_{h}), analogously to equations (2.10)–(2.12). Using delays in space compared to derivatives has numerical advantages, especially in cases where

*f*

_{h}or

*f*

_{l}are not differentiable (or even have discontinuities). It also enables us to estimate the derivatives of

*f*

_{l}implicitly, in case only the function

*f*

_{l}is available (and not its derivatives). Note that the delays (or derivatives) in this section are used to explain the numerical advantages of assumptions (2) and (3) above. This is different from their use in the previous section on embedology, where delays were used to construct a map back to the domain of the functions

*f*

_{l}and

*f*

_{h}, in case it is not directly accessible. Figures 1 and 2 show results from the two examples that demonstrate these two approaches.

#### 2.4. Outline of the numerical approach

In order to employ the delay coordinates of the low-fidelity function, it is required to know shifts of it. A necessary condition of the proposed framework is that the low-fidelity function is given explicitly or can be *well-learned* by given data such that low-fidelity function values can be accurately approximated (interpolated) at arbitrary points. If the state variable *t* is not available, the low-fidelity model should be a generic observation of *t* to be useful in employing Takens’ embedding theorem [22,24]. Under these conditions, we now present a summary of the workflow.

If the low-fidelity model is given in the form of (rich) data, we train a GP regression model for it from these data {(*t*_{l,i}, *f*_{l}(*t*_{l,i}))|*i* = 1, …, *n*_{l}} via minimizing a negative log marginal likelihood estimation. This data driven process can be circumvented if the low-fidelity model is explicitly given, as in the above examples. After that, we compute predictive posterior means of the low-fidelity model at the points *t*_{h} where the high-fidelity data are available. We also compute a number of shifts of the low-fidelity function at the points *t*_{h} − *kτ* and at the test points *t**. Next, we train another GP regression model for high-fidelity datasets in the higher-dimensional space, $\{({\hat{\mathit{t}}}_{\mathit{i}},{y}_{h}({t}_{h,i}))|i=1,\dots ,{n}_{h}\}$ and ${\hat{\mathit{t}}}_{\mathit{i}}={[{t}_{h,i},{f}_{\mathrm{l}}({t}_{h,i}),{f}_{\mathrm{l}}({t}_{h,i}-\tau ),\dots ,{f}_{\mathrm{l}}({t}_{h,i}-n\tau )]}^{\mathrm{T}}$. The number of delays *n* is strongly linked (in a sense, determines) the simplicity of the function *g* in equation (2.7). In this paper, observations *y*_{h}(*t*_{h,i}) are obtained from the high-fidelity model such as *y*_{h}(*t*_{h,i}) = *f*_{h}(*t*_{h,i}). Then, we construct and optimize a covariance matrix *K* using a radial basis function, which is a *de facto* default kernel function in GP regression (see equation (2.3)). Generally, the choice of a kernel defines the class of functions that the GP can access [27]. Since the correct class may not be known in advance for specific applications, there is no systematic way to choose the kernel. Finally, we compute the predictive posterior mean (${\overline{\mathbf{y}}}^{\ast}$) and variance (cov(**y***)) at all the test points (${\hat{\mathit{T}}}^{\mathbf{\ast}}$) in the higher-dimensional space by conditioning the joint Gaussian prior distribution with all the training points ($\hat{\mathit{T}}$):

**y**).

_{h}Each new delay burdens the optimization by a single additional hyperparameter. For more details, refer to [17,27]. In this paper, all GP computations are performed by the open source Python package GPy [30].

### 3. Results

We introduce three pedagogical examples to demonstrate our approach. First, we explore the case where *f*_{h} is a phase shifted version of *f*_{l} (§3.1). Then, we show that oscillations with different periods (leading to different recurrences) present a more challenging scenario, which however can still be resolved by using shifts of *f*_{l}. The third example involves discontinuities in *f*_{h} and *f*_{l}. After these three examples, in §3.4, we demonstrate the approach in the context of the Hodgkin–Huxley model. We investigate the effectiveness of the proposed framework by comparing it to three established frameworks: (1) single GP regression with high-fidelity data only (GP or Kriging), (2) auto-regressive method with a constant scaling parameter *ρ* (AR1 or coKriging), and (3) NARGP in the same computational environment.

#### 3.1. Phase shifted oscillations

Using models at different levels of resolution (e.g. for biological oscillators) will often give oscillations that have very comparable periods but are phase-shifted. Let us start by considering two functions with different phases on *t* ∈ [0, 1],

Our ‘high-fidelity function’ can be rewritten by a trigonometric addition formula:

*t*

^{2}, the low-fidelity function

*f*

_{l}(

*t*) = sin(8

*πt*) and its first derivative ${f}_{\mathrm{l}}^{(1)}(t)=\mathrm{cos}(8\pi t)$.

Using delays of *f*_{l} with a small stepsize *τ* contains enough information to numerically estimate its derivatives, hence we can also write *f*_{h} as

*g*is trained on these four-dimensional data. In addition, we perform GP regression in a three-dimensional space constructed from only three delays:

As shown in figure 3*b*, the single GP regression model provides inaccurate predictive posterior means due to lack of high-fidelity data. While the linear auto-regressive model (AR1) also fails to predict the high-fidelity values, the NARGP (with 10 high-fidelity data points and 100 low-fidelity data points) catches the trend of the high-fidelity data, yet still yields inaccurate results: NARGP is informed only by *t* and *f*_{l}(*t*), but *not* by ${f}_{\mathrm{l}}^{(1)}(t)$. Similarly, the GP regression with only delays (no information about *t*) in figure 3*e* fails to represent the high-fidelity function for these long observation windows. Beyond 0.25, *t* cannot be recovered from the shifts of *f*_{l} because *f*_{l} is only a generic observer of *t* ∈ [0, 0.25].

As shown in figure 3*f* , the GP using *t* and three delays of *f*_{l} provides an excellent prediction with only 10 high-fidelity data points (and 100 low-fidelity data points). This means that, in the four-dimensional space, *g* (see equation (3.3)) has small derivatives, which then helps to employ GP regression successfully.

Next, we investigate the sensitivity and scalability of the proposed framework on the number of high-fidelity data points (training data points). We train all GP regression models with 10, 15, 20 and 25 randomly chosen high-fidelity data points and 100 uniformly distributed low-fidelity data points. The error is obtained by averaging 10 trials of random data selections. A log *L*^{2} error with respect to the number of high-fidelity data points is presented in figure 4*a*.

The two established approaches (AR1 and NARGP) and the GP with only delays have no significant accuracy enhancements as the number of training points increases. The reason for the consistently large errors is the lack of additional information provided by the derivatives. The GP *in the higher-dimensional space* that includes *t*, on the other hand, shows a strong correlation between accuracy and the number of training points—more high-fidelity points visibly improve the approximation.

#### 3.2. Different periodicity

In this example, the high- and the low-fidelity model oscillations are not just phase shifted, but they also are characterized by different periods. In applications, this could arise if we tried to match observations of oscillations of *the same model* at two *different parameter values*. Different (possibly irrationally related) oscillation periods dramatically complicate the dependency across the two datasets.

We consider two different period *and* phase shifted data,

The high-fidelity function can be rewritten by a trigonometric addition formula

*πt*) can be rewritten again by a trigonometric subtraction formula

*πt*) can be rewritten in the same way. This shows that the high-fidelity function can be written in terms of sin(

*bt*), cos(

*bt*),

*f*

_{l}(

*t*) and ${f}_{\mathrm{l}}^{(1)}(t)$. Since sin(

*bt*) and cos(

*bt*) have lower frequency compared to the original frequency 8, the bound

*c*

_{g}for the derivatives of

*g*(see §2.3) is smaller. It is then reasonable that we can approximate the high-fidelity function in the higher-dimensional space with only a few training data points.

We perform the GP in two different extended spaces: (1) three additional delays, totalling four-dimensional space (GP+E) and (2) five additional delays, totalling six-dimensional space (GP+E(2)), and compare them to a single GP, AR1 and NARGP. Examples of regression results with 15 high-fidelity data points and 200 uniformly distributed low-fidelity data are shown in figure 5. The GP in the four-dimensional space provides better regression results than other established methods, and the GP in the six-dimensional space presents the best results.

Moreover, as shown in figure 5*b*, the phase discrepancy between the high- and low-fidelity functions increases as time increases, resulting in larger error for larger values of *t* (figure 5*b*–*d*). However, the GPs in the higher-dimensional spaces provide accurate prediction results over this time observation window.

The sensitivity to the number of high-fidelity data is shown in figure 4*b*. The GPs in the four- and six-dimensional space show significant computational accuracy gain compared to all other methods. These results demonstrate the capability of the proposed framework for period-shifted and phase-shifted data fusion.

#### 3.3. Models with discontinuities

In general, a smooth stationary covariance kernel cannot capture discontinuous model data. In order to resolve this problem, a non-stationary kernel has been introduced [31,32], with space-dependent hyperparameters. Moreover, nested GPs were also used successfully to alleviate this problem [33]. Both approaches introduce, of course, additional hyperparameters to optimize.

In this example, we introduce a discontinuous function *f*_{l} on *t* ∈ [0, 0.5),

*t*∈ [0.5, 1] as

*f*

_{h}

*f*

_{h}is discontinuous, but can be expressed in terms of a linear function of

*g*in two variables.

Examples of regression results with 10 high-fidelity data points and 200 uniformly distributed low-fidelity data points are shown in figure 6. Since *g* is a *linear* function of *t* and *f*_{l}(*t*), the NARGP, as well as our GPs in the higher-dimensional spaces, provide highly accurate prediction results with just a few high-fidelity data.

In analysing the sensitivity to the number of high-fidelity data points (figure 4*c*), there is no significant accuracy gain after 10 such high-fidelity data points. That is because 10 training data points are enough to represent a linear function accurately. It is worth noting that, here, the NARGP provides better prediction results with 20 training data points compared to the GPs in the higher-dimensional space, possibly due to overfitting.

#### 3.4. The Hodgkin–Huxley model

Based on the results of our pedagogical examples, we apply the proposed framework to a famous model of a cellular process, a version of the Hodgkin–Huxley equations [34]. In 1952, Hodgkin and Huxley introduced a mathematical model which can describe the initiation and propagation of action potentials in a neuron. Specifically, they invented electrical equivalent circuits to mimic the ion channels, where ions traffic through the cell membrane. The model for intracellular action potentials (*V*_{m}) can be written as a simple ODE

*C*

_{m}is the membrane capacitance and

*I*

_{ion}and

*I*

_{ext}represent the total ionic current and the external current, respectively.

The total ionic current *I*_{ion} = *I*_{Na} + *I*_{K} + *I*_{L} is the sum of the three individual currents as a sodium current (*I*_{Na}), a potassium current (*I*_{K}) and a leakage current (*I*_{L}). In order to calculate the three individual currents in time, the Hodgkin–Huxley model introduced gates which regulate the flow of ions through the channels. Specifically, the three ionic currents are affected by the three different gates *n*, *m* and *h*. Based on these gates, the total ionic currents can be calculated by

*K*), and a leakage (* ≡

*L*), current. The three gates

*n*,

*m*and

*h*can then be modelled by the following ODEs:

In this paper, we set the model parameter values to ${\overline{g}}_{\mathrm{Na}}=1.2$, ${\overline{g}}_{\mathrm{K}}=0.36$, ${\overline{g}}_{\mathrm{L}}=0.003$, *E*_{Na} = 55.17, *E*_{K} = −72.14 and *E*_{L} = −49.42 [35]. We assume that we have too few high-fidelity data points to directly estimate the function *V*_{m}(*t*). In addition, we assume that we have many data points from a low-fidelity model which has a slightly perturbed model parameter (*I*_{ext}) compared to the ‘true’ high-fidelity model. In this paper, we set *I*_{ext} = 1.0 for the high-fidelity model and *I*_{ext} = 1.05 for the low-fidelity model, resulting in different oscillation periods (and a phase shift when we start at the same initial conditions). The action potentials *V*_{m} of the two different fidelity models are shown in figure 7*a*,*b*.

Examples of regression results for *V*_{m} by 5 different methods with 20 high-fidelity data points and 300 uniformly distributed low-fidelity data points are shown in figure 7*b*–*f* . Since the two datasets are phase-shifted, the single GP, AR1, and NARGP fail to accurately approximate the high-fidelity model. However, GPs in the higher-dimensional spaces provide reasonable prediction results. The GP in the six-dimensional space (GP+E(2)) shows significant improvement in the form of large uncertainty reduction as well as high prediction accuracy.

The sensitivity to the number of high-fidelity data is shown in figure 4*d*. The proposed framework shows computational accuracy gains compared to all other methods, as well as marked improvement when new high-fidelity points are added.

### 4. Conclusion

In this paper, we explored mathematical algorithms for multi-fidelity information fusion and its links with ‘embedology’, motivated by the NARGP approach. These modifications/extensions of kriging show promise in improving the representation of data-poor ‘high-fidelity’ datasets exploiting data-rich ‘low-fidelity’ datasets. Given that *f*_{h} may not be a simple function—and sometimes not even a function—of *f*_{l}, we demonstrated that using additional functions of *t*, such as derivatives or shifts of *f*_{l}, can drastically improve the approximation of *f*_{h} through GP.

The limitations of the proposed framework arise in the form of the curse of dimensionality and of overfitting. As the number of hyperparameters in the GP framework grows in an increasingly higher dimensional input space, the optimization cost grows (and there is always the possibility of converging to local, unsatisfactory minima). Adaptively testing for the ‘best’ number of delays is possible, and will be pursued in future work. The natural option of using multiple low-fidelity models (instead of delays of just one of them) is also being explored. Techniques that systematically find all the local hyperparameter minima (in the spirit of the reduced gradient method [36]) may also be useful in this effort. Another promising research direction involves the construction of data-informed kernels (e.g. through ‘neural-net-induced Gaussian process’ [37]) for more realistic and unbiased predictions. Alternatively, it is interesting to consider transformations of the input space using manifold learning techniques and the so-called Mahalanobis distance [38,39], which has been demonstrated to successfully match different (yet conjugate) models [40,41].

What we believe is a most promising direction for the use of these techniques is the reconciliation of different granularity multi-scale models—having, say, an atomistic ‘high-fidelity’ simulation enhanced by a continuum ‘low-fidelity’ approximate closure. Thus, ‘heterogeneous data fusion’ becomes a version of multi-fidelity data fusion [12]. In this paper, the fusion tools simply ‘filled in the gaps’ in a single manifestation of the high-fidelity data. In a time-dependent context, ‘full space, full time’ low-fidelity simulations can help complete and thus accelerate ‘small space, small time’ high-fidelity simulations—in a form reminiscent of the patch-dynamics approach in equation-free computation [42] (see also [11,12]). Using a qualitatively correct (even though quantitatively inaccurate) low-fidelity model—as opposed to just the local Taylor series that play the role of low-fidelity modelling in patch dynamics—may very much improve the computational savings of such multi-scale computation schemes.

### Data accessibility

This article has no additional data.

### Competing interests

We declare we have no competing interests.

### Funding

S.L., F.D. and I.G.K. gratefully acknowledge the partial support of NSF, NIH and DARPA. G.E.K. gratefully acknowledges the financial support of DARPA (grant no. N66001-534 15-2-4055) as well as the MIT ESRDC grant.

## Footnotes

†Department of Applied Mathematics and Statistics and Department of Medicine, Johns Hopkins University, Baltimore, MD, USA.

One contribution of 15 to a theme issue ‘Multi-resolution simulations of intracellular processes’.

### References

- 1.
Tarca AL, Carey VJ, Chen Xw, Romero R, Drăghici S . 2007Machine learning and its applications to biology.**PLoS Comput. Biol.**, e116. (doi:10.1371/journal.pcbi.0030116) Crossref, PubMed, ISI, Google Scholar**3** - 2.
Holzinger A, Jurisica I . 2014**Interactive knowledge discovery and data mining in biomedical informatics: state-of-the-art and future challenges**.**Lecture Notes in Computer Science, vol. 8401**. Berlin, Germany: Springer. Crossref, Google Scholar - 3.
Libbrecht MW, Noble WS . 2015Machine learning applications in genetics and genomics.**Nat. Rev. Genet.**, 321–332. (doi:10.1038/nrg3920) Crossref, PubMed, ISI, Google Scholar**16** - 4.
Villoutreix P, Andén J, Lim B, Lu H, Kevrekidis IG, Singer A, Shvartsman SY . 2017Synthesizing developmental trajectories.**PLoS Comput. Biol.**, e1005742. (doi:10.1371/journal.pcbi.1005742) Crossref, PubMed, ISI, Google Scholar**13** - 5.
Lanckriet GR, De Bie T, Cristianini N, Jordan MI, Noble WS . 2004A statistical framework for genomic data fusion.**Bioinformatics**, 2626-2635. (doi:10.1093/bioinformatics/bth294) Crossref, PubMed, ISI, Google Scholar**20** - 6.
Willett P . 2013Combination of similarity rankings using data fusion.**J. Chem. Inf. Model.**, 1-10. (doi:10.1021/ci300547g) Crossref, PubMed, ISI, Google Scholar**53** - 7.
Dimov I, Zlatev Z, Faragó I, Havasi Á . 2017**Richardson extrapolation: practical aspects and applications,****vol. 2**. Berlin, Germany: Walter de Gruyter GmbH & Co KG. Google Scholar - 8.
Brandt A . 2001Multiscale scientific computation: review 2001. In*Multiscale and multiresolution methods*(eds TJ Barth, T Chan, R Haimes). Lecture Notes in Computational Science and Engineering, vol. 20, pp. 3–95. Berlin, Germany: Springer. (doi:10.1007/978-3-642-56205-1_1) Google Scholar - 9.
Brandt A . 2005Multiscale solvers and systematic upscaling in computational physics.**Comput. Phys. Commun.**, 438-441. (doi:10.1016/j.cpc.2005.03.097) Crossref, ISI, Google Scholar**169** - 10.
Brandt A, Brannick J, Kahl K, Livshits I . 2011Bootstrap AMG.**SIAM J. Sci. Comput.**, 612-632. (doi:10.1137/090752973) Crossref, ISI, Google Scholar**33** - 11.
Lee S, Kevrekidis IG, Karniadakis GE . 2017A general CFD framework for fault-resilient simulations based on multi-resolution information fusion.**J. Comput. Phys.**, 290-304. (doi:10.1016/j.jcp.2017.06.044) Crossref, ISI, Google Scholar**347** - 12.
Lee S, Kevrekidis IG, Karniadakis GE . 2017A resilient and efficient CFD framework: statistical learning tools for multi-fidelity and heterogeneous information fusion.**J. Comput. Phys.**, 516-533. (doi:10.1016/j.jcp.2017.05.021) Crossref, ISI, Google Scholar**344** - 13.
Le Gratiet L, Garnier J . 2014Recursive co-kriging model for design of computer experiments with multiple levels of fidelity.**Int. J. Uncertain. Quantif.**, 365-386. (doi:10.1615/Int.J.UncertaintyQuantification.v4.i5) Crossref, ISI, Google Scholar**4** - 14.
Perdikaris P, Venturi D, Royset J, Karniadakis G . 2015Multi-fidelity modelling via recursive co-kriging and Gaussian–Markov random fields.**Proc. R. Soc. A**, 20150018. (doi:10.1098/rspa.2015.0018) Link, Google Scholar**471** - 15.
Perdikaris P, Venturi D, Karniadakis GE . 2016Multifidelity information fusion algorithms for high-dimensional systems and massive data sets.**SIAM J. Sci. Comput.**, B521-B538. (doi:10.1137/15M1055164) Crossref, ISI, Google Scholar**38** - 16.
Parussini L, Venturi D, Perdikaris P, Karniadakis G . 2017Multi-fidelity Gaussian process regression for prediction of random fields.**J. Comput. Phys.**, 36-50. (doi:10.1016/j.jcp.2017.01.047) Crossref, ISI, Google Scholar**336** - 17.
Perdikaris P, Raissi M, Damianou A, Lawrence N, Karniadakis G . 2017Nonlinear information fusion algorithms for data-efficient multi-fidelity modeling.**Proc. R. Soc. A**, 20160751. (doi:10.1098/rspa.2016.0751) Link, Google Scholar**473** - 18.
Kennedy MC, O’Hagan A . 2000Predicting the output from a complex computer code when fast approximations are available.**Biometrika**, 1-13. (doi:10.1093/biomet/87.1.1) Crossref, ISI, Google Scholar**87** - 19.
Raissi M, Karniadakis G . 2016Deep multi-fidelity Gaussian processes. (http://arxiv.org/abs/160407484) Google Scholar - 20.
Whitney H . 1936Differentiable manifolds.**Ann. Math.**, 645-680. (doi:10.2307/1968482) Crossref, Google Scholar**37** - 21.
Nash J . 1966Analyticity of the solutions of implicit function problems with analytic data.**Ann. Math.**, 345-355. (doi:10.2307/1970448) Crossref, ISI, Google Scholar**84** - 22.
Takens F . 1981Detecting strange attractors in turbulence. In*Dynamical systems and turbulence, Warwick 1980*(eds D Rand, LS Young), pp. 366–381. Berlin, Germany: Springer. (doi:10.1007/BFb0091924) Google Scholar - 23.
Kostelich EJ, Yorke JA . 1990Noise reduction: finding the simplest dynamical system consistent with the data.**Physica D**, 183-196. (doi:10.1016/0167-2789(90)90121-5) Crossref, ISI, Google Scholar**41** - 24.
Sauer T, Yorke JA, Casdagli M . 1991Embedology.**J. Stat. Phys.**, 579-616. (doi:10.1007/BF01053745) Crossref, ISI, Google Scholar**65** - 25.
Kennel MB, Brown R, Abarbanel HD . 1992Determining embedding dimension for phase-space reconstruction using a geometrical construction.**Phys. Rev. A**, 3403-3411. (doi:10.1103/PhysRevA.45.3403) Crossref, PubMed, ISI, Google Scholar**45** - 26.
Le Gratiet L . 2013Multi-fidelity Gaussian process regression for computer experiments. Université Paris-Diderot-Paris VII. Google Scholar - 27.
Rasmussen CE, Williams CKI . 2006**Gaussian processes for machine learning**. Cambridge, MA: MIT Press. Google Scholar - 28.
Packard NH, Crutchfield JP, Farmer JD, Shaw RS . 1980Geometry from a time series.**Phys. Rev. Lett.**, 712-716. (doi:10.1103/PhysRevLett.45.712) Crossref, ISI, Google Scholar**45** - 29.
Dietrich F, Kooshkbaghi M, Bollt EM, Kevrekidis IG . 2018Manifold learning for bifurcation diagram observations. (http://arxiv.org/abs/1810.12952) Google Scholar - 30. GPy: GPy: a Gaussian process framework in Python, since 2012. http://github.com/SheffieldML/GPy. Google Scholar
- 31.
Schmidt AM, O’Hagan A . 2003Bayesian inference for non-stationary spatial covariance structure via spatial deformations.**J. R. Stat. Soc. B**, 743-758. (doi:10.1111/rssb.2003.65.issue-3) Crossref, Google Scholar**65** - 32.
Paciorek CJ, Schervish MJ . 2004Nonstationary covariance functions for Gaussian process regression. In*Proc. 16th Int. Conf. on Neural Information Processing Systems, Whistler, Canada, 9–11 December 2003*, pp. 273–280. Cambridge, MA: MIT Press. Google Scholar - 33.
Hensman J, Lawrence ND . 2014Nested variational compression in deep Gaussian processes. (http://arxiv.org/abs/14121370) Google Scholar - 34.
Hodgkin AL, Huxley AF . 1952A quantitative description of membrane current and its application to conduction and excitation in nerve.**J. Physiol.**, 500-544. (doi:10.1113/jphysiol.1952.sp004764) Crossref, PubMed, ISI, Google Scholar**117** - 35.
Siciliano R . 2012**The Hodgkin–Huxley model: its extensions, analysis and numerics. Montreal, Canada:**McGill University. Google Scholar - 36.
Quapp W . 2003Reduced gradient methods and their relaton to reaction paths.**J. Theor. Comput. Chem.**, 385-417. (doi:10.1142/S0219633603000604) Crossref, ISI, Google Scholar**2** - 37.
Pang G, Yang L, Karniadakis GE . 2018Neural-net-induced Gaussian process regression for function approximation and PDE solution. (http://arxiv.org/abs/180611187) Google Scholar - 38.
Singer A, Coifman RR . 2008Non-linear independent component analysis with diffusion maps.**Appl. Comput. Harmon. Anal.**, 226-239. (doi:10.1016/j.acha.2007.11.001) Crossref, ISI, Google Scholar**25** - 39.
Singer A, Erban R, Kevrekidis IG, Coifman RR . 2009Detecting intrinsic slow variables in stochastic dynamical systems by anisotropic diffusion maps.**Proc. Natl Acad. Sci. USA**, 16090-16095. (doi:10.1073/pnas.0905547106) Crossref, PubMed, ISI, Google Scholar**106** - 40.
Dsilva CJ, Talmon R, Gear CW, Coifman RR, Kevrekidis IG . 2016Data-driven reduction for a class of multiscale fast-slow stochastic dynamical systems.**SIAM J. Appl. Dyn. Sys.**, 1327-1351. (doi:10.1137/151004896) Crossref, ISI, Google Scholar**15** - 41.
Kemeth FP *et al.*2018An emergent space for distributed data with hidden internal order through manifold learning.**IEEE Access**, 77 402-77 413. (doi:10.1109/ACCESS.2018.2882777) Crossref, ISI, Google Scholar**6** - 42.
Kevrekidis IG, Gear CW, Hyman JM, Kevrekidis PG, Runborg O, Theodoropoulos C . 2003Equation-free, coarse-grained multiscale computation: enabling microscopic simulators to perform system-level analysis.**Commun. Math. Sci.**, 715-762. (doi:10.4310/CMS.2003.v1.n4.a5) Crossref, Google Scholar**1**