Considering discrepancy when calibrating a mechanistic electrophysiology model

Uncertainty quantification (UQ) is a vital step in using mathematical models and simulations to take decisions. The field of cardiac simulation has begun to explore and adopt UQ methods to characterize uncertainty in model inputs and how that propagates through to outputs or predictions; examples of this can be seen in the papers of this issue. In this review and perspective piece, we draw attention to an important and under-addressed source of uncertainty in our predictions—that of uncertainty in the model structure or the equations themselves. The difference between imperfect models and reality is termed model discrepancy, and we are often uncertain as to the size and consequences of this discrepancy. Here, we provide two examples of the consequences of discrepancy when calibrating models at the ion channel and action potential scales. Furthermore, we attempt to account for this discrepancy when calibrating and validating an ion channel model using different methods, based on modelling the discrepancy using Gaussian processes and autoregressive-moving-average models, then highlight the advantages and shortcomings of each approach. Finally, suggestions and lines of enquiry for future work are provided. This article is part of the theme issue ‘Uncertainty quantification in cardiac and cardiovascular modelling and simulation’.


S1. Differences between Model T and Model F
Here we show the equations of the currents in Model T [1] and Model F [2] that are different in kinetics. Below g X is the conductance (constant), E X is the reversal potential, Xo is the extracellular concentration, X i is the intracellular concentration of ion X. V is the membrane voltage, F is Faraday constant, R is the ideal gas constant.

S2. Modelling discrepancy using a Gaussian process
In order to add a discrepancy term to our basic measurement model (see main text), we model the i th observation as: is the model discrepancy term, a function with arguments v C and parameters φ. Note that the inputs v C can be independent from the inputs passed to the mechanistic model. We choose v C to be (1) time t, and (2) the open probability O (i.e. O in Eq. (3.3)) and the voltage V .
Following [3] we place a zero mean Gaussian process prior on the discrepancy function given by is the covariance function (also known as covariance kernel) parameterised by φ. One common choice for the covariance function is the squared exponential function given by where q is the number of covariates, such as time or open probability as mentioned above, representing v C . The parameter ρ j quantifies the characteristic length-scale along the j th covariate and α denotes the marginal variance of the GP prior. Together they constitute the parameter vector φ = [α, ρ 1 , . . . , ρq]. Since our measurement noise is Gaussian with variance σ 2 we can analytically compute the discrepancy function to obtain the marginal likelihood of N observations , conditioned on the parameters θ, φ of the mechanistic and discrepancy models respectively, as well as the calibration inputs u C and v C , given by and I is a N × N identity matrix.

Inference of the model and GP parameters
We proceed by first placing suitable prior distributions, p(θ) and p(φ), on the model and GP parameters and then obtain the posterior distribution using Bayes theorem as follows: Since this posterior distribution is analytically intractable due to the non-linear dependence on θ and φ we resort to Markov chain Monte Carlo (MCMC) in order to obtain samples from this distribution.

Predictions
Having inferred the parameters θ and φ we may want to predict the output of the model in we consider the number of validation points M to be different than the number of measurement points N , although these numbers can be the same for specific choices of calibrations. We denote the column vector for the corresponding M predicted outputs as , and the model evaluations with the new inputs as f θ, . . , f M (θ, u V )] T Furthermore, we denote the collection of calibration inputs at the N training (points corresponding to the measurements) as I C = (u C , v C ), and at the prediction points as Note that for a fixed value of parameters, θ and φ respectively, we can analytically obtain the predictive distribution of Y V given by where the mean and variance is given by [4] µ where Σ M N and Σ N M denotes the M × N and N × M matrices of covariance function evaluations between the training and prediction inputs given by with inputs v C and v V , respectively, and Σ M M is the covariance evaluated at the prediction inputs v V only: Finally, to obtain the marginal (i.e. integrating out the parameters) predictive distribution: we use Monte Carlo integration using the samples of θ and φ obtained through MCMC.

Sparse Gaussian Process
The above formulation of the discrepancy model suffers from a crucial computational bottleneck stemming from the need of inverting the covariance matrix Σ N N while evaluating the marginal likelihood in Eq. S2.4, as well as drawing posterior predictions in Eq. S2.12 (in turn using Eq. S2.8). In all the calibration problems under consideration here, we have a large number of data points (time series measurements) where N ≥ 80000. Thus, it becomes infeasible to apply Gaussian processes for modelling the discrepancy without tackling this excessive computational load related to repeated inversion of a large matrix.
In order to alleviate this computational bottleneck we use a sparse approximation of the true covariance function. Quiñonero-Candela et al. [5] provides an extensive review of such sparse approximations techniques. Following [5] we use a set of P or inducing inputs (or pseudo-inputs) x C with associated latent function δ(φ, x C ) representing the discrepancy function corresponding to the inducing inputs. This inducing function is assigned a zero mean GP prior as follows: Let us denote the vector of discrepancy function evaluations at all the training points as We can then write the joint prior as a product of all the training and inducing points as p(δ φ,u C ) = N (0, Σ N N ) and p(δ φ,x C ) = N (0, Σ P P ) respectively, where Σ P P denotes the covariance evaluated at all pairs of inducing inputs: (S2.14) We can then approximate the prior on the true discrepancy function δ φ,u C marginalising the inducing discrepancies as: where Σ N P , Σ P N denotes the covariance matrices containing the cross-covariances between the training and inducing inputs (evaluated in the same way as in Eqs. S2.9, S2.10). This sparse approximation was first introduced in [6] to scale the GP regression model. This approximation is widely known as the fully independent training conditional (FITC) approximation in machine learning parlance since the introduction of these inducing inputs and corresponding function values δ φ,x C induces a conditional independence among all the elements of δ φ,u C [5], that is we have (S2.16) Using this approximate prior p(δ φ,u C |δ φ,x C ) to obtain the marginal likelihood and the prediction terms we essentially approximate the true covariance Σ N N as [5]: where diag(A) is a diagonal matrix whose elements match the diagonal of A and the matrix Q is given by Σ has the same diagonal elements as Σ N N and the off-diagonal elements are the same as for Q.

S3. Modelling residuals using an ARMA(p, q) process
In the previous section we modelled the discrepancy as a function drawn from a GP prior. Alternatively, we can address the case of discrepancy using a correlated residual approach (see Section 3.6.2 in [7] for an introduction to this modelling approach). In this case we can model the residuals between the data (Y C ) i and the mechanistic model f i (θ, u i C ) as an ARMA(p, q) process as follows: . . , ζp] T are the vectors representing the p ≥ 0 autoregressive coefficients and q ≥ 0 moving-average coefficients of the ARMA process. The rationale behind this modelling approach comes from the fact that if the mechanistic model is able to explain the measurements adequately then the residuals are essentially uncorrelated measurement noise ∼ N (0, σ 2 ). Note that we use a different symbol ν, as opposed to , to represent the noise term in order to highlight the difference in its interpretations. However, the existence of discrepancy between the model output and the observations points to the fact that the residuals, for each data sample, has unexplained structure that can be modelled using a pre-determined correlation structure, as expressed through an ARMA(p, q) model.

Inference
We first re-write the normally distributed error term ν i as using which we can write the conditional likelihood of the observed data for N measurements as [8] p where we have used ν i for i ≥ p + 1 by assuming that νp = ν p−1 = . . . = ν p+1−q = 0, its expected value. Note that to calculate the likelihood for all the N measurements requires us to introduce extra parameter values, as latent variables, for the past history of the data as well as the error terms before measurement commences, that is for Alternatively, we can reformulate Eq. S3.1 in a state space form and use the Kalman filter algorithm to evaluate the unconditional full likelihood for all i. We refer the reader to [7] for the details of this approach. We point out here that the difference between these two approaches of calculating the likelihood is insignificant for long time series, which is the case for our calibration problems with N ≥ 80000.
Having defined the likelihood we can again adopt the Bayesian framework to infer posterior distributions of the model parameters θ and the set of ARMA parameters φ = [ϕ, ζ], by choosing suitable prior distributions p(θ) and p(φ) respectively and using the Bayes theorem to obtain the posterior given by Note that we have considered the noise variance known since its maximum likelihood estimate is given by τ 2 =

Predictions
In a purely time series modelling context, where models such as the ARMA is extensively used, predictions are used to forecast ahead in time for short intervals. In the context of our calibration problem we are generally interested in predicting outputs for a new calibration u V . However, considering the fact that we want to to predict M output values corresponding to the new validation inputs, we can simply recast our predictions as one-step-ahead forecasts.
We denote the M predicted outputs as are column vectors representing the last p observations and model evaluations with the calibration u C . We denote the vector of the last q errors as Note that in our formulation here, the m th , m = 1, . . . , M , prediction is to be considered as the (N + 1) th prediction from the model in Eq. S3.1 with the following modification: we replace Thus, for a particular value of the parameters we have where the mean and the variance of the one-step ahead prediction distribution is given by In order to quantify the uncertainty in the predictions we can integrate out the model and noise parameters [9]: where we use Monte Carlo integration as in Eq. S2.12.
In order to collect the full set of M predictions Y V we simply use the one-step-ahead forecasting distribution shown above in a recursive manner.

S4. Choice of priors for the ion channel example
Here we specify the choice of priors for the ion channel example.
• For the ion channel ODE model parameters, we chose a uniform prior specified in Beattie et al. [10] and Lei et al. [11,12]. • For the GP model, we have the unbiased noise parameter σ, the length-scale ρ i , the marginal variance α: σ: Half-Normal prior with standard deviation of 25; -ρ i : Inverse-Gamma prior with shape and scale being (5, 5); -α: Inverse-Gamma prior with shape and scale being (5, 5).
• For the ARMA model, we have the autoregressive coefficients ϕ i , and moving-average coefficients ζ i : ϕ i : Normal prior centred on the maximum likelihood estimates with standard deviation of 2.5; -ζ i : Normal prior centred on the maximum likelihood estimates with standard deviation of 2.5.

S5. Computing and representing posterior predictive
To compute the posterior predictive, we follow Girolami [13] and write the posterior predictive in Eqs. S2.12 & S3.8 as where θ k , φ k are the k th posterior sample of the parameters. We have checked the posterior predictive of the ODE models at a given time point is symmetric and similar to a Gaussian distribution, for the sake of simplicity, we therefore use summary statistics such as the predictive mean and credible intervals computed using variance to represent the posterior predictive in this paper. To obtain the predictive mean Finally, to show the 95% credible intervals of our predictions, we plot where T X is the ten Tusscher model in scenario X, and aF X is the adjusted Fink model using the individual i for scenario X. A total of 1079 candidates satisfied Error 0% < 0.1, i.e., were below the displayed threshold. Using this metric, the best candidate had Error 0% = 1.6%. B Testing the performance of the 1079 candidates with respect to the 2Hz scenario a total of 990 candidates satisfy Error 0% < 0.1 and Error 2Hz < 0.1. Using these two metrics, the best candidate had Error 0% = 1.8% and Error 2Hz = 1.8%. C Testing the performance of the 1079 candidates with respect to the 75% IKr block scenario only 80 candidates satisfy Error 0% < 0.1 and Error 75% < 0.1. Using these two metrics, the best candidate had Error 0% = 4.5% and Error 75% = 4.5%. D Testing the performance of the 1079 candidates with respect to both 75% IKr block and 2Hz scenarios only 70 candidates satisfy Error 0% < 0.1, Error 75% < 0.1, and Error 2Hz < 0.1. Using the three metrics, the best candidate had Error 0% = 4.5%, Error 2Hz = 3.7%, and Error 75%  i.e., 1 Hz, and 0% IKr block; B 2 Hz; and C 75% IKr block. "Best 0% prediction" are the results obtained using the best candidate that satisfies Error 0% < 0.1. "Best 0% + 2Hz prediction" are results obtained using the best candidate that satisfies Error 0% < 0.1 and Error 2Hz < 0.1. "Best overall prediction" are results obtained using the best candidate that satisfies Error 0% < 0.1, Error 2Hz < 0.1, and Error 75% < 0.
where θn are samples from the posterior distribution generated by MCMC. Only relative differences within a row are meaningful, and we therefore subtract the maximum log-likehood for each dataset from the results giving the best model in each row a score of zero. Note that care is needed when interpreting the log-likelihood values for the GP models due to the FITC approximation used to approximate the full likelihood.                          Matérn 3/2 covariance function. The voltage clamp protocol for calibration is the staircase protocol [11].