Fast methods for training Gaussian processes on large datasets

Gaussian process regression (GPR) is a non-parametric Bayesian technique for interpolating or fitting data. The main barrier to further uptake of this powerful tool rests in the computational costs associated with the matrices which arise when dealing with large datasets. Here, we derive some simple results which we have found useful for speeding up the learning stage in the GPR algorithm, and especially for performing Bayesian model comparison between different covariance functions. We apply our techniques to both synthetic and real data and quantify the speed-up relative to using nested sampling to numerically evaluate model evidences.


Introduction
A wide range of commonly occurring inference problems can be fruitfully tackled using Bayesian methods. A particularly common inference problem is that of regression; determining the relationship of a control variable x to an output variable y given a set of measurements of {y i } at points {x i }. The solution requires a model y = f (x), which allows us to predict the value of y at an untested value of x. From a Bayesian standpoint, this can be achieved using Gaussian processes (GPs): a GP is a collection of random variables, of which any finite subset have a joint Gaussian probability distribution [1].
Gaussian process regression (GPR) is a powerful mathematical technique for performing non-parametric regression in a Bayesian framework [1][2][3][4][5]. The key assumption underpinning the method is that the observed dataset being interpolated is a realization of a GP with a particular covariance function. This assumption presents us with a challenge: how do we choose the covariance function which gives the best interpolant? The process of choosing the covariance function is known as learning, or training of the GP. In this training process, it is necessary to compute the inverse of the covariance matrix (the matrix formed by evaluating the covariance function pairwise between all n observed points). The time taken to evaluate the inverse of the covariance matrix scales as O(n 3 ) [1], where n is the number of points being interpolated; this has typically restricted the application of GPR to smaller problems (n 10 5 ), although work has been done on extending its applicability to larger datasets [6][7][8][9].
In this paper, we present two techniques that speed up the training stage of the GPR algorithm. The first aims to reduce the dimensionality of the problem, and hence speed up the learning of the hyperparameters for a single covariance function. This does not change the fact that the cost of this process is O(n 3 ); instead it simply reduces the constant in this scaling. The second aims to enable fast Bayesian model comparison between different covariance functions while also incorporating the benefits of the first technique.
We consider maximizing the hyperlikelihood: the conditional probability of the data given a particular set of hyperparameters used to specify the covariance function. 1 We provide an expression for the Hessian matrix of the hyperlikelihood surface and show how this can be used as a valuable tool for comparing the performance of two different covariance functions. We also present modified expressions for the hyperlikelihood, its gradient and its Hessian matrix, which have all been analytically maximized and marginalized over a single-scale hyperparameter. This analytic maximization or marginalization reduces the dimensionality of the subsequent optimization problem and hence further speeds up the training and comparison of GPs.
These techniques are useful when attempting to rapidly fit large, irregularly sampled datasets with a variety of covariance function models. The authors have previously made use of these techniques in exploring the correlation structure of the differences between complicated waveform models in the field of gravitational-wave astronomy [10,11]; this was done so that the effect of different models on the parameter inferences could be marginalized over. There, the behaviour of the data was largely unknown a priori and it was necessary to quantitatively compare a wide range of different covariance functions. Work in this area with larger datasets is ongoing.
In §2, we review the GPR method and discuss methods of efficiently determining a covariance function. In §2.1, we present our expression for the Hessian of the hyperlikelihood along with a discussion of how it can be used for model comparison, and in §2.2, we show how the training of the GP can be accelerated by analytically maximizing or marginalizing the hyperlikelihood over a singlescale parameter. In §3, we apply these methods to both synthetic and real datasets, and compare the computational cost to that of a full numerical evaluation of the Bayesian model evidences. Finally, a brief discussion and concluding remarks are given in §4.

Gaussian process regression and training
The technique of GPR is a method for interpolating (or extrapolating) the data contained in a training set D = {x, y}. The vector x = {x i | i = 1, 2, . . . , n} is called the input vector and the output vector is given by y i = f (x i ) for some unknown function f . The method works by assuming that the data have been drawn from an underlying GP f (x) ∼ GP(μ(x), k(x, x )) with specified mean μ(x) (usually assumed to be zero) and covariance function k(x, x ).
There is freedom in specifying the covariance function; common choices, such as the squared exponential and Matérn function, include a number m of free hyperparameters θ = {θ i | i = 1, 2, . . . , m} that control the properties of the GP, i.e. k(x, x ) = k(x, x ; θ).
The predictive power of the method comes from computing the conditional probability of the function taking a given value at some new (n + 1)th input point x * , given the observed values in D and the values of the hyperparameters θ. This predictive probability distribution P(y(x * ) | D, θ) for the function at the new point is a Gaussian with meanȳ(x * ) and variance [σ y (x * )] 2 [1], and Since the posterior distribution for (2.2) relies upon the form of the covariance, GPR cannot be used to make definite predictions until we have fixed a method for dealing with the unknown hyperparameters θ . Ideally, we place a prior probability distribution on θ and make predictions by evaluating the integral where we have used Bayes' theorem to obtain the second equality. We have introduced the hyperlikelihood given by which encodes the probability that the observed (training) data were drawn from a GP with covariance function k. The integral (2.4) is almost always analytically intractable and prohibitively expensive to evaluate numerically. A common approximate approach is to use the most probable values of the hyperparametersθ, which maximize P(θ | D) [12][13][14].
Assuming the prior distribution is sufficiently flat (or uninformative) over the region of interest, this is equivalent to maximizing the hyperlikelihood P(y | x, θ). Under this approximation, the predictive distribution becomes which is simply the Gaussian in (2.2) with mean and variance evaluated atθ. Implementing the above procedure requires numerically maximizing the hyperlikelihood in (2.5). This can be computationally expensive; in §2.1 and §2.2, we present methods for reducing the cost of maximizing the hyperlikelihood.

Using the gradient and Hessian
The maximization process may be accelerated if the gradient of the hyperlikelihood is known and a gradient-based algorithm, such as a conjugate gradient method [13,15], can be used. The gradient of the logarithm of the hyperlikelihood is given by [1] ∂ θ ln P(y | x, θ) = 1 2 y T K −1 · ∂ θ K · K −1 y − 1 2 Tr(K −1 · ∂ θ K). (2.7) This can be shown by differentiating (2.5) and making use of the standard results The gradient in (2.7) is useful because the rate-determining step in computing the hyperlikelihood is computing the inverse matrix K −1 (usually achieved through a Cholesky decomposition in practice), which is an O(n 3 ) operation. All other steps in (2.5) scale as O(n 2 ) or less. 2 Once the inverse has been calculated, the gradient in (2.7) may also be evaluated in O(n 2 ); so in evaluating the hyperlikelihood for a large training set we can also get the gradient for negligible extra cost. The procedure outlined above can be performed for multiple covariance functions, each yielding a different GP interpolant. It is, therefore, necessary to have a method of comparing the performance of different interpolants to decide which to use. One way to achieve this is to evaluate the (hyperpriorweighted) volume under the hyperlikelihood surface, the hyperevidence, and use this as a figure of merit for the performance. Evaluating this integral is prohibitive, so an approximation is to calculate the Hessian matrix of the ln P(y | x, θ) surface at the peak (the position and value of which have already been found) and to analytically integrate the resulting Gaussian. This procedure assumes flat (or slowly varying) hyperpriors in the vicinity of the peak, but this has already been assumed in going from (2.4) to (2.6). Differentiating the gradient in (2.7), again making use of the results in (2.
This expression has the same advantages as the expression for the gradient; as the inverse of the covariance matrix has already been computed, the Hessian may be evaluated at negligible extra cost. The hyperlikelihood surface may therefore be approximated by the Gaussian [12,16] ln We seek the hyperevidence, which is given by the following integral of the hyperposterior, where we have specified a prior Π (θ) on the hyperparameters; Assuming the hyperposterior is a sufficiently well-peaked distribution, with peak at position θ =θ, the hyperevidence may be written using the Laplace approximation [2] as It is always possible to change the hyperparametrization so that the prior is flat in which case the hyperposterior is proportional to the hyperlikelihood. 3 If such a hyperparametrization has been chosen then Π (θ ) = 1/V (where V is the hyperprior volume, or range of integration), H Π = 0 andθ =θ; therefore This expression is now invariant under further changes to the hyperparameter specification which preserve the property that the prior is constant. We use hyperparametrizations with flat hyperpriors as this choice uniquely specifies the approximation in equation (2.13); although there remains the possibility that another hyperparametrization exists in which the posterior is better approximated as a Gaussian. For two covariance functions, k 1 and k 2 , the odds ratio may be defined as the ratio of the value of (2.13) evaluated with k 1 to the value evaluated using k 2 , and this may be used to discriminate among competing models. The hyperprior volume V in (2.13) acts as an Occam factor, penalizing models with greater complexity [2]. Once suitable prior volumes have been fixed, the Hessian approximation to the hyperevidence is a computationally inexpensive means of comparing covariance functions.
The Hessian may also be used to provide error estimates for the hyperparameters; from (2.10) it can be seen that the inverse of the Hessian is the covariance matrix of the maximum hyperlikelihood estimator of the hyperparameters.

Partial analytic maximization
In general, covariance functions can be arbitrarily complicated, with large numbers of hyperparameters. Inevitably, simple covariance functions are the most prevalent in the literature. If there are a small number of hyperparameters, then even reducing the number of hyperparameters by one can have a great impact on the length of time taken to maximize the hyperlikelihood. In this section, we show how the hyperlikelihood for any covariance function, regardless of complexity, can be analytically maximized over an overall scale parameter, thereby reducing the number of remaining hyperparameters. We also generalize the expressions for the gradient and the Hessian found in §2.1 to this case.
Consider the following transformation of the covariance, k( ; substituting this into the expression for the hyperlikelihood gives, This function always has a unique maximum with respect to variations in σ 2 f at the position These are not the same as the derivatives in (2.7).
As well as maximizing, we can also consider marginalizing over σ f [16]. As we are marginalizing over a scale parameter we use the (improper) Jeffreys prior P(σ f ) = c/σ f [17]. The result is equal to the maximized form, up to a multiplicative constant, θ). (2.18) As before, once the peak hyperlikelihood has been found, the Hessian at the peak position can aid in model comparison. In this case, the Hessian should be calculated using the second derivatives of ln P marg . However, we may instead differentiate ln P max , as this differs only by a constant which will cancel when using the Hessian to compare two models. Differentiating (2.17) with respect to ϑ , 4 Again, these are not the same as the derivatives in (2.9). These expressions for the gradient and the Hessian of the hyperlikelihood, maximized or marginalized over σ 2 f , share the same advantages as the analogous expressions in §2: they may be evaluated in O(n 2 ) time once the hyperlikelihood itself has been evaluated in O(n 3 ) time.

Numerical results
In order to perform model comparison calculations between competing covariance functions, we must first specify at least two different covariance functions. We choose the two functions in (3.1) and (3.2), where (t, t ) ≡ (x, x ). These functions are both based on the periodic covariance function proposed by [2]. The first function k 1 is the product of a single periodic component with time scale T 1 and a simple compact-support polynomial covariance function [18] to describe any non-periodic component of the data. The choice of a compact-support covariance function is especially useful when working with large datasets; this is precisely the situation where the techniques described above are also designed to be of maximum benefit. The second function k 2 includes an additional periodic component with time scale T 2 . In order to avoid double-counting in k 2 , we impose the constraint T 2 ≥ T 1 . Both covariance functions also include an uncorrelated noise term; we define this in such a way that σ f remains an overall scale 4 Here,σ f retains its maximum ln P value from (2.15), although the new maximum ln P marg value has actually now shifted to (σ f ) 2 = nσ 2 f /(n − 1) due to the effect of the hyperprior. For large datasets (n 1), the difference between the two is negligible (the hyperprior becomes uninformative as it is overwhelmed by the hyperlikelihood). hyperparameter which can be maximized or marginalized over analytically as described in §3.2. and The covariance functions are completely specified by the hyperparameters σ f (overall scale), T j (j = 0, 1, 2; time scales) and l j (j = 1, 2; smoothing parameters for the periodic components). The noise parameter σ n could also be taken to be a hyperparameter; instead, for simplicity, we here take σ n to be fixed. As σ n appears in k multiplied by the overall scale, σ f , fixing σ n is roughly equivalent to specifying a fixed fractional error.
We want to perform model comparison using the Laplace approximation outlined previously. This technique requires reparametrizing the covariance function such that the hyperpriors are flat. For the time-scale hyperparameters, which are dimensionful, we choose to use the scale-invariant Jeffreys prior, P(T j ) ∝ 1/T j . This prior is improper if the range of T j is (0, ∞), so we restrict the range to (δt, T), where δt and T are respectively the smallest and largest separations between the sampling points. If there was a time scale in the problem outside of this range, we would be unable to resolve it from the data. We now seek a transformation φ j ≡ φ j (T j ) to a new hyperparameter φ j such that the prior is flat in this parameter, P(φ j ) = const. The conservation of probability gives a differential equation relating the two where the A j s are constants which we can set equal to 1. The range of these new hyperparameters is φ j ∈ (ln(δt), ln( T)) and P(φ j ) = 1/ ln( T/δt). For the smoothness parameters l j , we choose to use lognormal priors, P(l j ) = exp[−(μ − log l j ) 2 / (2σ 2 l )] 2πσ 2 l , with mean μ = 1 and variance σ 2 l = 4. As before, we seek a transformation to some new hyperparameters ξ j in which the prior is flat. The desired transformation is given by where ξ j ∈ (−0.5, 0.5).

Synthetic data
Shown in figure 1 are realizations of GPs with covariance functions k 1 and k 2 . 5 In order to perform test model comparison calculations, a realization of the k 2 GP with n points was drawn and analysed using both the k 1 and k 2 covariance functions. For each covariance, the peak hyperlikelihood was found by numerically maximizing (2.14) using a conjugate gradient method, making use of the gradient in (2.17). The hyperevidence was estimated using (2.13) and the expression for the Hessian in (2.19); the results are summarized in table 1. To verify the accuracy of this estimate, the hyperevidence was also integrated numerically using MULTINEST [19][20][21], which implements a nested sampling algorithm [22]. This was repeated for three different values of n (in the case n = 100, the synthetic data are plotted in the right-hand panel of figure 1), and the results are also summarized in table 1. From table 1 it can be seen that as n is increased, the Bayes factors increasingly favour the more complicated covariance function (and in this case the correct covariance function from which the data was drawn). In almost all cases, the Laplace approximation gives a value ln Z est which is in agreement at better than 2σ with the numerically integrated value ln Z num . There is one exception which is highlighted in italic; this occurs for the most complicated covariance function (with the largest number of hyperparameters) and when the number of data points is smallest. In this situation, it would be expected that the posterior distribution on the hyperparameters may be highly multimodal and/or exhibit strong degeneracies (both of these expectations were confirmed by examining the posterior distribution on the  (a,b) panels, respectively. The horizontal black lines indicate the length scales associated with the different terms in the covariance functions. The hyperparameters for k 1 were chosen to be σ f = 1, φ 0 = 3.5, φ 1 = 1.5 and x 1 = 0. The hyperparameters for k 2 were chosen to be the same as for k 1 and φ 2 = 3 and x 2 = 0. In both cases, the noise was fixed to σ n = 10 −2 . Table 1. A summary of the results of the analysis of synthetic data for three different-sized datasets. The first set of two columns is for a dataset drawn from the k 2 covariance function and analysed with the k 1 covariance function. The first column is the estimated hyperevidence using the Laplace approximation where Z is as given in equation (2.13), while the second is the numerically calculated hyperevidence. The second set of two columns shows results for the same data, but analysed with the k 2 covariance function. The final pair of columns shows the log Bayes factor, ln B ≡ ln Z k 2 − ln Z k 1 , calculated using the approximate and numerical values for the hyperevidence. hyperparameters returned by MULTINEST). This exceptional case serves to highlight situations in which the Laplace approximation should not be trusted. The MULTINEST posteriors in all other cases were verified to be well approximated by a single Gaussian mode. Figure 2 shows the posterior distribution for the parameters of k 2 obtained from the largest (n = 300) synthetic dataset. Our method of model comparison is proposed as a faster alternative to model comparison using numerically evaluated Bayes factors. Simply comparing the peak hyperlikelihood (marginal likelihood) values would also give a measure of the goodness of fit, but this tends to favour more complex models and incurs the risk of overfitting. More sophisticated methods of model selection exist in the literature (see [23,24] and references within), e.g. the comparison of models based on estimated predictive criteria [25][26][27], or the construction of a larger reference model and the subsequent selection of a simpler submodel with similar predictions [28,29]. A detailed numerical comparison to these methods is left for future investigation.
The ln Z num values in table 1, evaluated using MULTINEST, required between 20 000 and 50 000 likelihood evaluations. The maximization routines typically took fewer than 100 likelihood evaluations to find the peak, and then one additional evaluation to calculate the Hessian and hence ln Z est . In order to guard against the possibility of the maximization routines becoming trapped in local maxima, as opposed to the global maximum, the algorithm was run multiple times from randomly selected starting positions. The typical number of runs required to find the global maximum was approximately 10. After these duplicate runs are accounted for, the speed-up factor in calculating ln Z est compared with ln Z num was between 20 and 50 in all cases.

Tidal data from Woods Hole
In order to illustrate the effectiveness of the techniques described above on real data, we consider several tidal datasets of different sizes from Woods Hole, MA, USA [30]. 6 We consider the mean sea-level offset   Figure 2. The one-and two-dimensional marginalized posterior distributions on the hyperparameters of the k 2 covariance function obtained from the largest (n = 300) synthetic dataset. The posterior is well approximated as a normal distribution. Shown in the black curves in the one-dimensional marginalized posterior distributions along the diagonal are the normal approximations obtained by using the techniques described in §2 to maximize and find the Hessian. Using the Hessian to approximate the integral of this distribution (the hyperevidence) leads to an error of approximately 10% (table 1). recorded between 3 January and 15 June 2014, six lunar months, sampled at 2 h intervals (giving n = 1968 data points). This is plotted in figure 3. We also consider a smaller subset of the data (the first lunar month), with n = 328 data points. 7 We interpolate the data using the two covariance functions in (3.1) and (3.2); these functions are well suited to the data, as we expect the sea level to contain harmonics of the various time scales associated with the daily, monthly and yearly cycles of the tides. For simplicity, we fix σ n = 10 −2 , which is the typical fractional error in the sea-level measurements. As in §3.1, we reparametrize the covariance functions so that the T j have Jeffreys priors, and the smoothness parameters have lognormal priors. We use a conjugate gradient maximization algorithm with (2.17) and the Hessian in (2.19) to evaluate the volume in (2.13) and perform model comparison between the two covariance functions.
For the smaller dataset, we find the time scale T 1 = (12.8 ± 0.2) h with k 1 , which corresponds to the two main tides per day. With k 2 , we find the time scales T 1 = (12.44 ± 0.07) h and T 2 = (24.3 ± 1.0) h; the second time scale corresponds to the height difference between the first and second tides of the day. The two time-scale model is highly favoured with a log Bayes factor of 57.8.  For the larger dataset, we find T 1 = (12.80 ± 0.11) h with k 1 , and T 1 = (12.40 ± 0.03) h and T 2 = (23.3 ± 0.3) h with k 2 . In all cases, the (squared) errors are estimated using the diagonal components of the inverse Hessian; it can be seen that the time scales are more precisely measured for the larger dataset, as expected. The two time-scale model is even more conclusively favoured for the larger dataset, with a log Bayes factor of 538. We also find a number of subsidiary hyperlikelihood peaks associated with other time scales in the data, but all subsidiary peaks are strongly suppressed relative to the global peak (by at least ln P of approx. 100) and so we expect our Bayes factor estimates to be robust. Sea-level data are known to contain a large number of different frequencies, which necessitates the use of harmonic analysis in tidal modelling; the number of constituents included in tide prediction calculations has increased from tens [31] to thousands [32] over the past century. Clearly any k 2 -like covariance function with fewer than 10 time scales is simplistic, but the construction of a more detailed tidal model is beyond the scope of this paper. 8 The number of evaluations of (2.13) needed to obtain these results was comparable with the numbers for the synthetic data discussed in §3.1. However, each evaluation here was more expensive (approx. 10 s) due to the size of the dataset. Based on the speed-ups found in §3.1, it would be expected that MULTINEST would take up to approximately one week to calculate the Bayes factor.
Shown in the inset plot in figure 3 are the two interpolants from k 1 and k 2 for the larger dataset, which both perform equally well on the time scale of one week. These interpolants, which are the result of the regression analysis, may be used to estimate the tidal height at a time where a measurement is not available.

Summary
We have described some simple ways in which the computationally expensive training stage of implementing GPR can be accelerated. The analytic maximization of the hyperlikelihood over a single-scale hyperparameter of the covariance function aids in speeding up the maximization of the hyperlikelihood by reducing the dimensionality of the problem; the advantages of this will be most keenly felt in (common) problems where relatively simple covariance functions are used. Meanwhile, the analytic evaluation of the Hessian matrix, either in the manner of (2.9) or (2.19), aids in speeding up the process of model comparison between different types of covariance function. We have successfully demonstrated these techniques on a synthetic dataset where the data was drawn from one of two covariance functions under consideration. In the case of the synthetic data, the size of the dataset was limited to fewer than 300 points, so that the results could be verified by using the MULTINEST algorithm to numerically sample and integrate the posteriors. We also demonstrated the techniques by applying them to a larger real dataset of mean sea-level measurements, where the full MULTINEST calculation would have taken too long to perform. It is to be hoped that these techniques will aid in the wider application of GP methods to larger datasets.
Data accessibility. The original source of the tidal data is the National Oceanic and Atmospheric Administration (http://tidesandcurrents.noaa.gov/waterlevels.html, accessed August 2015). The subset of data used here, as well as our synthetic datasets and our code is available at http://hdl.handle.net/10283/1924.