Surrogate modelling for the prediction of spatial fields based on simultaneous dimensionality reduction of high-dimensional input/output spaces

Time-consuming numerical simulators for solving groundwater flow and dissolution models of physico-chemical processes in deep aquifers normally require some of the model inputs to be defined in high-dimensional spaces in order to return realistic results. Sometimes, the outputs of interest are spatial fields leading to high-dimensional output spaces. Although Gaussian process emulation has been satisfactorily used for computing faithful and inexpensive approximations of complex simulators, these have been mostly applied to problems defined in low-dimensional input spaces. In this paper, we propose a method for simultaneously reducing the dimensionality of very high-dimensional input and output spaces in Gaussian process emulators for stochastic partial differential equation models while retaining the qualitative features of the original models. This allows us to build a surrogate model for the prediction of spatial fields in such time-consuming simulators. We apply the methodology to a model of convection and dissolution processes occurring during carbon capture and storage.


Introduction
The use of complex mathematical models for simulating and predicting the behaviour of physico-chemical processes is nowadays crucial in a broad range of groundwater disciplines, including contaminant transport and geological storage of CO 2 in deep saline aquifers among many others. The complexity of these models normally involves the implementation of highly high-dimensional input and output spaces, and in particular for groundwater model simulators. Note that standard scalar GP emulation has already been successfully applied to complex and time-consuming scalar valued simulators [5]. Thus, the inputs and outputs of the simulator considered here will be spatial fields defined in very high-dimensional input and output spaces. The GP emulator is able to reproduce (up to a predetermined level of accuracy) the work of the computer model much faster. This is of vital importance in applications such as uncertainty quantification, design optimization and decision theory, where a large number (sometimes millions) of calls to the numerical simulator are required in order to produce a critical assessment. The methodology of the empirical simultaneous GP model reduction (ESGPMR) approach presented in this paper consists of combining two main techniques. (i) We use the method proposed by Higdon et al. [26] to reduce the dimensionality of the output space by using principal component analysis (PCA) and separate independent GP emulators for the coefficients in the PCA basis. Higdon's approach has been successfully adapted to other applications, for example, Holden et al. [27] applied a variation of the method to high-dimensional climate model outputs. Bowman & Woods [22] adapted the method to the field of atmospheric dispersion by using the thin-plate splines technique. (ii) We capture the high-dimensional relationship between the simulator inputs and the coefficients of each of the vectors spanning the reduced output space by exploiting the properties of the KL decomposition of the input permeability fields and using cross-validation (CV). Thus, we find a subspace of the high-dimensional input space leading to an optimal representation of the GP model response surface. To test the GP emulation results, we take as reference (true value) a sample of 256 full numerical simulations. The simulations were obtained over 18 days of continuous intensive CPU computations on a 12-core Intel Xeon cluster processor. The time spent to compute the final prediction of the same 256 spatial fields with the ESGPMR approach on the same processor was 4 h.
The outline of this paper is as follows. In §2, we introduce the computationally expensive numerical simulator of a convection and dissolution process in random heterogeneous porous media. In §3, we describe the framework of the GP emulation methodology. We present the novel method for simultaneous input-output model dimension reduction and we detail how to properly estimate the hyperparameters of a high-dimensional space by using a continuation algorithm. In §4, we test the GP model reduction methodologies with the model problem introduced earlier. Concluding remarks are provided in §5.

Mathematical model and numerical simulator
Dissolution of CO 2 in deep saline aquifers is considered one of the most effective ways of carbon capture and storage [28]. The model studied here focusses on the hydrodynamical part of the problem by setting a model for CO 2 -loaded flows in an idealized two-dimensional geometry. It considers the impact of hydrodynamic dispersion (or dispersivity), permeability heterogeneity and isotropy in porous media on the development of convecting instabilities. For solving the resulting problem, the finite-element method (FEM) is employed. The existence of continuous bifurcations from the no-flow steady-state solutions of the problem adds additional challenge to the search of numerical solutions, and to overcome this an arclength continuation technique [29] is used in conjunction with the FEM.

Convectively enhanced dissolution process in porous media
The dissolution of CO 2 into the brine of the storage site causes an increase in the density of the mixture, leading the CO 2 to sink while reacting with local rock minerals to become a solid carbonate [30]. This leads to the onset of convection rolls and the resulting mixing leads to a greater contact between the injected CO 2 and local minerals, significantly enhancing the carbon capture. This process is known as convectively enhanced dissolution (C-ED) [31,32].
In recent theoretical and numerical works (e.g. [32][33][34]), researchers have investigated the behaviour of CO 2 in deep saline aquifers. These studies focussed on the understanding of a simplified and idealized case where the problem is reduced to the motion of a fluid through a porous medium and where the dispersive transport is based on pure molecular diffusion. This paper will take into account more characteristics of natural aquifers, namely the rock heterogeneity and the hydrodynamic dispersion. This later will be modelled by a dispersion tensor, D, dependent on the local Darcy velocity of the fluid u as follows [34][35][36]: D = D m I + β T u I + (β L − β T )(u ⊗ u/ u ), where ⊗ represents the tensor product, I is the unit (identity) tensor, D m is the molecular diffusion coefficient of the solute in the fluid and β L and β T are, respectively, the longitudinal and transverse dispersion coefficients, which satisfy We consider the C-ED process to occur in a two-dimensional domain representing a random, heterogeneous, isotropic porous medium of depth 2H and length L. The spatial variable is defined by The governing equations for this model are continuity (2.1), Darcy's Law (2.2) and convection-diffusion-reaction (2.3) [32,33,38]: where e z is the outward-pointing unit vector along the ordinate axis, C is the concentration of dissolved CO 2 , u = (u x , u z ) is the fluid velocity and P is the fluid pressure. The parameters K, μ, φ, γ c and g are, respectively, the medium permeability field, the fluid viscosity, the rock porosity, the reaction rate and acceleration due to gravity. The solute undergoes a first-order reaction and is converted into an inert product with no effect on the solution density, thus the density of the fluid is linearized and takes the form, ρ = ρ 0 + β c C, where ρ 0 and β c are the density of the pure fluid and the volumetric expansion coefficient. This assumption allows us to use the Boussinesq approximation [32]. The boundary conditions for the above problem are: The velocity field is represented by using a streamfunction, Ψ , formulation, u x = ∂Ψ/∂z and u z = −∂Ψ/∂x. We can eliminate the pressure field from equation ((2.2)) by satisfying the mass conservation (2.1), resulting in a new set of equations for the unknown field variables (Ψ , C). For the resulting set of governing equations and boundary conditions we follow the same dimensionless formulation used in Crevillén-García et al. [5], where the reader can find a full detailed derivation of formulae and equations. The dimensionless variables and numbers are defined by: where β T and β L are, respectively, the longitudinal and transverse dispersion coefficients [37,38]; K 0 and D 0 are reference permeability and diffusion coefficients, respectively; L is the aspect ratio of the domain; Ra is the Rayleigh number, related to the buoyancy driven flow; and Da is the Damkhöler number, which is the ratio of the chemical reaction rate to the mass transfer rate [32]. In terms of these dimensionless variables and numbers, and dropping the primes for convenience, the following dimensionless governing equations defined in R = [0, L] × [−1, 1] stay as: The Fickian mass flux J = (J x , J z ) (Scheidegger-Bear) [38] satisfies J = D∇C and its components are expressed as follows: where · 2 denotes the standard Euclidean norm. Finally, the corresponding dimensionless form of the boundary conditions is: In this study, we are interested in the long-term behaviour of the system and consequently we will restrict ourselves to the steady-state equations, i.e. by setting ∂C/∂t = 0 in equation (2.5).

Convectively enhanced dissolution numerical simulator
The numerical simulator is built based on an H 1 -conforming FEM [39], and the numerical solutions were computed on a shape-regular rectangular partition of R = [0, π/2] × [−1, 1] ⊂ R 2 comprising 2500 elements (i.e. a computational domain formed by M = 2601 nodes), employing basis functions of polynomial degree 1. All computations were performed using the AptoFEM finite-element toolkit, documented in Antonietti et al. [40], together with the MUMPS linear solver [41,42]. In terms of the CPU time spent by the numerical simulator to compute a single solution of equations (2.4) and (2.5), the choice of different values for the model parameters, Ra and Da, makes no difference. In this paper, we will restrict ourselves to the case L = π/2, Ra = 100, Da = 0.1, β L = π/2 and β T = β L /10.

Generation of random permeability fields
Natural media is heterogeneous in a hierarchy of scales, and it is virtually impossible with today's technologies to resolve this heterogeneity in detail [43]. The permeability values have shown spatial correlation [1][2][3] and a function that has been extensively used [2,5,7,18,44,45] to represent that correlation is the following squared exponential covariance function: where λ represents the correlation length and σ 2 the variance of the process. The simulator described earlier necessitates the values of the permeability K at each of the M nodes of the computational domain in order to solve the problem. It is common in groundwater flow models [18] to model K as a log-Gaussian random field; this guarantees that K > 0 in R. In this study, we will model the permeability as log-Gaussian permeability fields and generate samples of the permeability fields at the nodes with the following procedure [46]. Let (Ω, F , P) be a probability space. If we now let Z ∼ N (m, C), i.e. Z : Ω → R M be a multivariate normally distributed random vector with mean and covariance m = (m 1 , ) is a discrete Gaussian random field. Finally, we set K = exp(Z) to obtain the desired discrete log-Gaussian permeability field, where each of the components of the vector K ∈ R M corresponds to the value of the permeability at a node of the computational domain.
To generate different samples of Z, we will use the KL decomposition method (see [5,7,46,47]. This method uses an eigen-decomposition of the covariance matrix C at the nodes which is then stored for future samples generation. Moreover, the KL expansion may be truncated, which leads to a reduceddimensional formulation that is critical in the emulator construction. The KL decomposition method is summarized as follows.
The covariance matrix C is real-valued and symmetric, and therefore admits an eigen-decomposition [48]: . . , M, be independent and identically distributed (i.i.d.) random variables. We can draw samples from Z ∼ N (m, C) using the KL decomposition of Z using the following expression [46]: The terms ξ i ∼ N (0, 1) are called KL coefficients. To be consistent with the non-dimensional formulation of equations (2.4) and (2.5), we generate a set of log-Gaussian permeability fields with point-wise mean 1 by setting m = −(σ/2)I. Let us define the random vector ξ ∈ R M , distributed according to N (0, I). The numerical simulator is considered as a mapping from ξ ∈ R M to (C, Ψ ) ∈ R M × R M . If we were interested only in one of the two simulator output fields, we could also consider the simulator as either f c : In the next section, we will describe how to build a GP emulator.

Gaussian process emulation of spatial fields in complex simulators
A GP emulator is a statistical approximation of the numerical simulator. In this paper, to build an emulator for a given simulator, we use GP regression methodologies consisting of establishing a prior specification of the functional form of the target simulator which is updated in the light of data provided by using Bayes' rule, which yields a posterior distribution that can be used for inference. That prior specification consists of providing the model with a mean, a covariance structure and a set of observed values (or targets) at carefully selected inputs, so-called design points. The pair formed by the design points and the observed values at such points is called the training set. The mean and covariance functions contain parameters, so-called hyperparameters, that need to be inferred from the training data by solving an optimization problem. For high-dimensional input spaces, the GP model would be impractical. To overcome this optimization issue, we developed an ESGPMR method based on Bayesian inference that is able to recursively find the lowest dimension of the input space for which the GP emulator response surface best approximates the numerical simulator. The method incorporates a continuation routine that helps the optimization algorithm used for the MLE estimates to find adequate initial values for the successive iterations. The continuation routine can be easily made extensive to any existing moderatedimensional GP emulators that groundwater researchers using commercial GP toolboxes discarded because of the impossibility of estimating the hyperparameters appropriately. In the next section, we only give a brief description of the GP framework, stating our choices for the prior specifications of the GP model, the generation of the training data and the final predictive equations used to approximate the simulator. For a detailed description of conditional distributions and the derivation of the final formulae, we refer the reader to Rasmussen & Williams [21].

Gaussian process emulation framework
In this section, we describe the general GP emulation methodology for scalar functions, to then extend it to vector functions in the next section. Let g : R M → R be a scalar simulator. The aim of GP emulation is to learn the functional form of the target model g(·) in the light of a very reduced (due to the timeconsuming simulator) set of data. The design points can be regarded as the locations (in the input space) at which we wish to obtain the values of g(·) to determine the sensitivity of the simulator to different inputs. An exhaustive explanation of the possible choice of design points is addressed in Sacks et al. [19].
To generate the set of design points, we simply spread the points to cover the input space, in this case R M . There are several methods of sampling the inputs, the most common of which are Latin hypercube sampling (LHS) [49,50] and Sobol sequence sampling [51]. In this paper, guided by the successful results obtained in a previous work, we will use the latter. Given the particular definition of the inputs in our model simulator (ξ ∼ N (0, I)), one very intuitive way of building a set of d design points is to, first, use a Sobol sequence to generate d points in [0, 1] M , and second, push the d points component-wise through the inverse cumulative distribution function of M random variables distributed according to Note that by setting σ d > 1, we guarantee that the design points are spread enough in R M to cover all the points sampled from the input distribution (N (0, I)), and therefore we will not be missing some key information about the simulator responses at points far from the mean of the input distribution.
Let us denote by f (·) the GP used to model g(·). For any ξ , ξ ∈ R D , for some D ≤ M, the GP prior mean function is defined as: m(ξ ) := E[f (ξ )] and the covariance function as: and Cov(·, ·) denote the expectation and covariance operators, respectively. One of the methods available in the literature to select the mean and covariance functions for a given model is CV (see [5]). However, the covariance chosen for this GP makes no difference to the scope of the approaches developed in this study, and thus, for simplicity, we will use a mean-zero function and the square exponential (SE) covariance which is given in terms of three hyperparameters as follows [21]: where σ 2 f is the process variance, = ( 1 , . . . , D ) is the length scale, σ 2 n is the noise variance and δ ij is the Kronecker delta. The hyperparameters are collectively represented by θ θ θ = (σ 2 f , , σ 2 n ). Given the set of design points generated with the method described earlier, To avoid numerical instabilities (ill-conditioning of the matrix system), an i.i.d. random noise j ∼ N (0, σ 2 n ), where σ 2 n is the variance in expression (3.1), is typically introduced into the model, and thus the observed values will take the form y j = f c (ξ j ) + j , where y j is the perturbed simulator output at the design pointξ j ∈ R M . If we now write y = [y 1 , . . . , y d ] , we can define the training set as the pair D = {X, y}. Once we have provided the model with a mean-zero function, the SE covariance function (3.1) and the training set D, we can make predictions for new untested inputs ξ * ∈ R D by using the predictive equations for GP regression [21]: and where the (i, j)th entry of Σ(X, X) ∈ R d×d is given by k(ξ i ,ξ j ) and Σ(ξ * , X) = (k(ξ * ,ξ 1 ), . . . , k(ξ * ,ξ d )) . Expression

Reduced-rank approximation of the output space
In this section, we use the method proposed by Higdon et al. [26]. The idea is to use PCA to project the original simulator outputs onto a lower-dimensional space spanned by an orthogonal basis. This is done via singular value decomposition (SVD) as we detail later. Once in the PCA framework, the outputs can be expressed as a linear combination of PCA basis vectors (or the principal components (PCs)) with coefficients treated as independent univariate GPs with distinct sets of correlation lengths. This allows us to build separate GP emulators (as many as PCs considered) to estimate the coefficients of new outputs at untested inputs in the PCA basis. Then we use a linear map for reconstruction to the original output space. By using orthogonal projection, we guarantees a minimal average reconstruction error. The error considered for comparisons between two vectors throughout this paper will be the L 2 -norm relative error (RE) unless stated otherwise. For the two vectors x = (x 1 , . . . , x M ) and y = (y 1 , . . . , y M ), we define the L 2 -norm RE between x and y as Now we can build r separate and independent GPs from the input space R M to R as described in §3.1 by generating also r separate training sets. In this case, the design points in all the training sets are the same, X = {ξ j } d j=1 , withξ j ∈ R M , and the set of observed values are the coefficients of the PCs computed above, i.e. the first r rows of α. Thus, for any untested input ξ * ∈ R M , we use expression (3.2) for each of the r GPs to estimate the r coefficients. These are then stored in vector form and can be mapped back to the original output space to obtain the final GP prediction y * ∈ R M . We can test the accuracy in the prediction by running the numerical simulator at the same input ξ * and compare the result y true ∈ R M with y * . Unfortunately for high-dimensional input spaces this approach is not valid and an additional input space dimension reduction must be performed.
In the next section, we propose a method for overcoming the limitation that GPs have in highdimensional input spaces. This methodology is then combined with the output space reduction method above to build a GP emulator for the full simulator.

The empirical simultaneous Gaussian process model reduction method
Let us clarify the notation first. Suppose we have a set of d design pointsξ j ∈ R M generated as described in §3.1. We run the simulator at those points to obtain the corresponding true d output fields y 1 , . . . , y d where the fields y j are reshaped to form the columns of the M × d outputs matrix Y. Then we use the dimension-reduction method described in §3.2 to obtain the PCA basis and the matrix of coefficients α = (α ij ), i = 1, . . . , M, j = 1, . . . , d. We denote byỸ r the reduced-rank approximation of Y obtained by considering the first r ≤ M PCs of the PCA basis whose columnsỹ r j , j = 1, . . . , d are the correspondent reduced-rank approximations of the observed fields y j , j = 1, . . . , d. As we wish to reduce the dimension M of the original input space, let us define, for any D ≤ M, the training sets as: where y j are the columns of Y (the true fields) andŷ D j denotes the predicted field at pointξ D j . If expression (3.5) does not hold, set r = r + 1 and go to (iii) (to refine the reduced-rank approximation error). If expression (3.5) holds, set D = D max − 1 and go to (v) (to reduce the dimension of the input space) until the expression does not hold, and then return D and r.
Note that the choices of ε and D max are problem-dependent and completely heuristic. In the next section, we will discuss how to choose values for those tolerances by examining the data, although this ultimately depends on the end-user criteria.

Leave-one-out cross-validation and hyperparameters estimation
Estimates of the unknown hyperparameters θ θ θ = (σ 2 f , , σ 2 n ) in expression (3.1) need to be inferred from the data provided to the model. This step is a crucial part of GP emulation. While this is quite simple to solve in one-dimensional problems, it really becomes an optimization issue when dimension increases. In this section, we describe how to estimate those parameters from the training set even if the dimension of the input space is large. To estimate the hyperparameters, we use a technique known as leave-one-out cross-validation (LOO-CV) (see [5,21]). LOO-CV consists of using all training set data but one point (the leave-out) for training, and computing the model prediction error on the leave-out point. This process is repeated until all available d points have been exhausted. We use each of the d leave-out training sets and a conjugate gradient optimizer to obtain estimates of the hyperparameters by maximizing the log marginal likelihood (3.6) with respect to the hyperparameters: The prediction errors during the LOO-CV scheme are quantified through the mean square error (MSE): where m j is the predicted expected value given by expression (3.2) and y j is the corresponding observed value both at the same (leave-out) inputξ j . Note that the MSE depends only on the mean predictions, and thus, sometimes, different CV measures which also take into account predictive variances, such as the negative log validation density loss [21] or the Dawid score [52], might be preferred. For the purpose of this study, the MSE gives us the relevant information about the LOO-CV predictions we need for assessment. For an optimal performance of the optimization algorithm and for avoiding failure due to the existence of possible marginal likelihood multiple optima, a continuation routine must be included in all the independent GP emulators described earlier. This is straightforward and can be implemented as follows: (i) Consider the training sets as in §3.3, i.e. for any D ≤ M: Without loss of generality, let us set r = 1. The method is exactly the same for the other r − 1 GPs. (ii) Estimate the hyperparameters by finding the MLE of expression (3.6) for the D one-dimensional problems until a maximum value D max (large), i.e. by using the Dth KL coefficient of each of the d design points in the training set. We obtain θ θ θ ) . Note that the importance here lies in the length scales arising from the anisotropic covariance function. Note also that we do not need to compute these values a priori and we can do each calculation just before each iteration as needed (depending on D max ). (iii) Start the iteration over the number of KL coefficients D. For D = 1 perform a LOO-CV scheme by using the values obtained in (ii) as initial guess for the estimation of hyperparameters to be used in the GP. Store the hyperparameters as θ θ θ (1) n ) . (iv) Repeat the iterations until D = D max , and for D > 2 take as initial guess the previous estimation of the hyperparameters in the lower-dimensional space and the first estimation obtained in (ii) for the next component. For example, for θ θ θ (2) take as initial guess: (σ 1 , n ) . Store the MSE for all the D iterations to examine convergence. By inspecting figure 2, we can estimate a value for D max and refine the model.

Numerical results
In this section, we discuss the results obtained by applying the reduction and GP emulation methods to the model problem introduced in §2. Let us consider the numerical simulators f c : ξ → C and f Ψ : ξ → Ψ with ξ ∈ R M , distributed according to N (0, I). C ∈ R M and Ψ ∈ R M are the concentration and streamfunction, respectively. Let us show how we build the GP emulator for the concentration simulator f c . Exactly the same procedure applies to f Ψ . The first step is to build a training set. We generate d = 256 points as described in §3.1, i.e. we use a Sobol sequence to generate d points in [0, 1] M , and, second, push the d points component wise through the inverse cumulative distribution function of M random variables distributed according to N (0, σ 2 d ), with σ d = 1.32, to, jointly, form the set of design pointsξ j , j = 1, . . . , d. For the choice of σ d , we tested the GP simulator for three different values: σ d = 1.32 was the one providing more accuracy in the LOO-CV test. To exploit the properties of Sobol sequences and spread the points in the space in an optimal manner, it is recommended [53] that the generated samples are a power of 2. In this study, we use d = 2 8 = 256. A lower number of design points might be used with the same degree of accuracy in the results (see [5]), although as our decisions for the model specifications are based mainly in the LOO-CV technique, we need a relatively large amount of experimental data. For those design points we run the simulator and obtain the correspondent concentration fields f c (ξ 1 Table 1. Relative errors between the true and reduced-rank approximation RE true-red and between and the true and the predicted concentration fields RE true-pred for three different tolerances ε. The numbers of PCs (PC) and KL coefficients (KL) used are also provided. The key parameters used to characterize the heterogeneity of the porous medium appeared in expression (2.6). A detailed analysis of the impact of heterogeneity on the concentration profiles and the streamfunction fields has been conducted previously [5,24]; in these earlier works, a measure of the amount of CO 2 adsorbed through the top boundary in a process of CO 2 storage is computed from both the heterogeneous case and the one for an equivalent homogeneous medium characterized by a constant permeability equal to the mean permeability in the domain. For our simulations, the value of λ will be set to λ = 0.5. This value has been taken from the ranges suggested in the literature (see [18]). The existence of bifurcating branches of solutions in the C-ED model (see [32]), i.e. there is not always guarantee of a unique observed value at a single design point, might lead to inaccurate training data if large values of σ 2 are considered and no classification techniques are employed [5]. Thus, for simplicity and without loss of generality, we will set σ 2 = 0.1. Note that the choice of σ 2 does not directly affect the applicability of the ESGPMR method proposed in this paper but the uniqueness of the simulator outputs. Consequently, the use of larger values for σ 2 would probably necessitate using additional pre-processing tools to classify the variety of branches of observed data before forming the training set. This scenario has been treated in detail previously [5]. Note that for models where there is a one-to-one correspondence between inputs and outputs, the ESGPMR can be applied without restriction.
Once we have generated the training set and decided the prior specifications for the GPs, we use the ESGPMR algorithm to reduce the dimensionality of the input and output spaces in order to bring the original model to a more computationally tractable and accurate problem. Tables 1 and 2 show the results for different accuracy tolerances ε obtained from the set of concentration and streamfunction fields, respectively. It also shows the number of KL coefficients used for the input space, the number of PCs from the PCA basis for the output space and the overall relative (maximum) error achieved. We observe that (as one might expect) the more dimensions that are considered, the more accurate is the overall approximation of the GP emulator. This is also a sign that the ESGPMR algorithm is well    designed. To set an optimal value for D max , we need to conduct a first experiment allowing D max to be large enough to allow us to investigate, for instance by visual inspection, some signs or numerical evidence of convergence. Figure 2 provides us with a valuable information about the latter. For this model, a sensible value for D max might be 100 (higher or lower values are at user discretion). Note that this limit is only illustrative as figure 2 is only considering the first GP emulator or r = 1. Although  the value of D max is just a reference, using more DoF does not seem to be sensible as it could lead to additional numerical errors as well as an exponential increase of the computational cost. It is important to note that D max is just a user's auto-imposed limit depending on the user's own computational resources, and therefore it is related to dimension reduction, while ε is related to the accuracy of the GP results. Thus, the choices of D max and ε are made independently. In these terms, if the user has not reached the desired accuracy in the GP predictions for either a tolerance ε or a relatively large value of D max , it can be concluded that GP emulation is not a priori (one can always try with different prior mean, covariance or likelihood functions) a recommendable surrogate model for the numerical simulator. Figure 3 shows numerical evidence of how the reduction of the MSE depends on the number of PCs considered. Figure 4 shows the permeability field used to compute the concentration and the streamfunction outputs shown in figures 5 and 6. Figure 5 shows the results obtained by using the GP emulator with D = 82 and r = 10 to predict the concentration output field at one untested point ξ * ∈ R M . The RE between the true and the reduced rank approximation was 0.005. The RE between the true and the predicted was 0.01. Figure 6 shows the results for the same input considered in figure 5, where in this case the best resolution achieved was using D = 55 and r = 14. We highlight here that different values of D and r are needed for each GP emulator to achieve the desired tolerance, i.e. while for the concentration we needed D = 5 and r = 3 to achieve a tolerance of 0.05, for the streamfunction we needed D = 55 and r = 14. Furthermore, both algorithms were unable to refine further, and thus the lowest tolerances we could achieve in this study were ε = 0.01 for the concentration and ε = 0.05 for the streamfunction.
Before finishing this section, let us see another application. We can use the GP emulator to approximate scalar-valued quantities of interest from any of the output fields, for instance we can compute the total mass of dissolved solute in the domain R given by BarbaRossa et al. [34]: M = R C. In this case, we only need to emulate the concentration field C for the new input and then compute the integral over R. An intuitive and qualitative way of measuring how close our GP predictions are to the observed values is through scatterplots. This consists of plotting the pairs (predicted outputs, observed outputs) along with the line y = x and checking that the scattered points are not 'far away' from the straight line. For this application, we used the data stored during the LOO-CV scheme for the GP above and computed the set M * 1 , . . . , M * d from the emulated concentration fields. Then we computed M 1 , . . . , M d directly from the simulator concentration outputs C 1 , . . . , C d . Figure 7 shows the scatterplot for the observed values (Y-axis) against the predicted values (X-axis).
In this study, the GP emulation was implemented using the GPML MATLAB TOOLBOX v.3.4 [21].

Conclusion
In this paper, we developed a methodology based on dimensionality reduction and GP emulation for surrogate modelling in SPDEs. The technique can be applied without modification to any model involving vector-valued functions and vector-valued inputs. The ESGPMR algorithm was able to simplify the original mathematical problem while retaining the accuracy in the results. In particular, for the emulation of concentration fields, one of the GP emulators was able to reduce the dimensionality of the input and output spaces from M = 2601 to D = 112 and from M = 2601 to r = 20, respectively, for an overall tolerance of ε = 0.01. The ESGPMR algorithm provides the end-user with a tool for assessing if GP emulation is an efficient surrogate model for a given computationally expensive numerical simulator. This applies to either numerical models where it is not feasible to meet the desired resolution in the GP predictions or when the original problem cannot be adequately reduced to a more tractable model (i.e. D max and r are found to be extremely large for the tolerances given).
Data accessibility. Our data are available in Dryad at https://doi.org/10.5061/dryad.3g280 [54]. Competing interests. The author declares that there are no competing interests. Funding. This research was funded by the EU Panacea project, FP7, grant agreement 282900.