How real are observed trends in small correlated datasets?

The eye may perceive a significant trend in plotted time-series data, but if the model errors of nearby data points are correlated, the trend may be an illusion. We examine generalized least-squares (GLS) estimation, finding that error correlation may be underestimated in highly correlated small datasets by conventional techniques. This risks indicating a significant trend when there is none. A new correlation estimate based on the Durbin–Watson statistic is developed, leading to an improved estimate of autoregression with highly correlated data, thus reducing this risk. These techniques are generalized to randomly located data points in space, through the new concept of the nearest new neighbour path. We describe tests on the validity of the GLS schemes, allowing verification of the models employed. Examples illustrating our method include a 40-year record of atmospheric carbon dioxide, and Antarctic ice core data. While more conservative than existing techniques, our new GLS estimate finds a statistically significant increase in background carbon dioxide concentration, with an accelerating trend. We conclude with an example of a worldwide empirical climate model for radio propagation studies, to illustrate dealing with spatial correlation in unevenly distributed data points over the surface of the Earth. The method is generally applicable, not only to climate-related data, but to many other kinds of problems (e.g. biological, medical and geological data), where there are unequally (or randomly) spaced observations in temporally or spatially distributed datasets.


Introduction
Ordinary least-squares (OLS) estimation, applied to time-series data, identifies a model having the lowest squared difference from the observed data. A t-test may be used [1] to determine the confidence interval of the regression slope, or to test its significance, but the t-statistic may be invalid if the residuals are correlated with each other. This has been understood for some time [2][3][4], but we review existing techniques, concentrating on the generalized least-squares (GLS) formulation, and provide new techniques to make valid tests of trend or empirical model significance available for the challenging case of small datasets with strong positive error correlation. This approach is then generalized to randomly located data points in space.
We begin with a brief description of OLS and GLS, followed by an example to demonstrate problems with these conventional approaches. For the purposes of this study, we assume we are taking a limited set of samples from a field of values that have homoskedastic normally distributed error with respect to some model, with the error correlated as a function of distance in time or space between the samples.
In §2, we describe the conventional background of OLS estimation and GLS estimation of correlated time-series data, where a first-order autoregressive process is assumed to describe the residual correlation. As an example, we take a 40-year time series of atmospheric background carbon dioxide, comparing linear and parabolic regression models from conventional GLS estimation. This section concludes with some early published alternatives to the matrix algebra GLS approach.
Section 3 develops a new GLS estimate for equally spaced time-series data with highly correlated residuals. The performance of this new estimate is considered both from the point of view of providing an accurate t-distribution for regression slopes in the absence of any real trend, and for its ability to identify a real trend when one is present. The effect of this new estimate on the model confidence interval for our carbon dioxide example is presented.
The new GLS estimate of §3, for equally spaced data points in one-dimensional space, is generalized to randomly spaced data points in multi-dimensional space in §4, by replacing a first-order autoregressive model of error correlation with its equivalent exponential model in multi-dimensional space. We describe techniques to deal with data points co-located in that space. As an initial example, we apply this to the problem of combining two separate time-series datasets of monthly global temperature.
Section 5 provides examples applying the techniques of previous sections to real data, with two examples of time-series climatic data, both equally and unequally spaced, and an example of radioclimate data points unevenly distributed in two dimensions over the surface of the Earth.

Ordinary least-squares estimation
Formulating OLS estimation in matrix algebra, define y as the N Â 1 vector of N response or time-series values, and X as the N Â k þ 1 matrix of model parameters, each row beginning with 1 (to identify the intercept) followed by x 1i , . . . , x ki . If looking for a trend in a time series, we may have k ¼ 1, and only one parameter x 1i , representing time. In general, k is the number of parameters in the model, not including the intercept.
If e is the N Â 1 vector of errors, independently normally distributed and homoskedastic, then the OLS estimator b OLS for b in model y ¼ Xb þ e is b OLS ¼ (X 0 X) À1 X 0 y with residuals e ¼ y À Xb OLS : 9 > = > ; (2:1)

Data with correlated error
If the residuals are correlated, this suggests correlation between the errors, although the two are not the same, as b OLS is only an estimate of b. If the errors are correlated, a better estimator may be found, if a reasonable model for this correlation is available. A first-order autoregressive process AR1(r) is often a good choice. In this paper, we restrict our consideration to that model, and an equivalent model for datasets randomly distributed in space. This model of error correlation is a short-memory model with exponential decay, which may overestimate trend significance in some cases [5], as compared with long-memory models with hyperbolic decay. However, we find that although conventional techniques described below work well in estimating negative r in the AR1(r) for small datasets, they provide a biased underestimate of positive r-values for small sample sizes, so in this paper, we address that issue.
The parameter r may be estimated from the lag 1 component, r 1 , of the autocorrelation function (ACF) of the OLS residuals. In OLS estimation, residual mean will be zero, so we may write this estimate of r aŝ r ¼ r 1 , with

Correlation testing the ordinary least-squares scheme
Departure of either the ACF lag terms, or the Durbin -Watson statistic d, of the OLS residuals from the range of variation expected for uncorrelated errors, may be used as a test for serial correlation affecting the validity of the OLS estimate. Our main concern is short-memory correlation, so we use d, with a confidence interval based on exact expressions for its expected value E(d), and variance V(d), provided by Durbin & Watson [6] for the case of uncorrelated errors: Thus the distribution of d depends to some extent on the data in X. A number of methods have been proposed [7] to calculate tail-points of the distribution of d, but the beta approximation has been royalsocietypublishing.org/journal/rsos R. Soc. open sci. 6: 181089 found [7] to perform well, and is now very convenient, given the availability of functions for the inverse cumulative beta distribution in current computer languages. In this approximation, d/4 is assumed to have a beta distribution b a,b , with mean and variance equal to E(d) and V(d), respectively, by setting Taking upper and lower cumulative probabilities as 0.975 and 0.025, respectively, we find the limits of the 95% confidence interval for d, given the null hypothesis of no serial correlation. If d falls outside this range, we conclude that either OLS estimation is unsatisfactory, and an alternative such as GLS estimation must be used, or the model being fitted to the data is unsuitable.

GLS estimation
If a correlation test, such as the Durbin-Watson test described above, suggests that OLS estimation is unsatisfactory, then the OLS estimator of (2.1) is modified, by pre-multiplying both y and X by a matrix P. This matrix P aims to transform the error vector e, with correlated elements, into vector Pe, with uncorrelated elements of equal variance. Pre-multiplying the problem by P results, observing the correct order of multiplication of transposed matrices, in the GLS estimate For the AR1(r) first-order autoregressive error process, with our homoskedastic assumption, we may take symmetrical matrix S to be the matrix of correlations between samples, as which has a readily calculated inverse S 21 , consisting of only diagonal, superdiagonal and subdiagonal elements (2:10)

Maximum-likelihood estimation
A standard method of estimating r is maximum-likelihood (ML) estimation. Here, we jointly find the most likely estimates for r and the model parameters b to explain the observations, assuming a Gaussian error process. In the case of the AR1(r) process, numerical evaluation may be avoided by obtaining the ML estimate of r conditioned on e 1 [8]: (2) )(e iÀ1 À e (1) ) P N i¼2 (e iÀ1 À e (1) ) 2 (2:11) with e (1) ¼ (N À 1) À1 P NÀ1 i¼1 e i and e (2) ¼ (N À 1) À1 P N i¼2 e i . An iterative scheme, re-evaluating the residuals as e ¼ y 2 Xb GLS , and re-evaluatingr from (2.11), until it converges, may be used [9] to obtain the joint ML estimate of r and b. However, this is only valid for equally spaced data in one dimension. Matrix S 21 only, not P, appears in the GLS estimation b GLS ¼ (X 0 S 21 X) 21 X 0 S 21 y, but P is generally not uniquely defined by P 0 P ¼ S 21 . As different real matrices P satisfying this give slightly different values of d, we need to choose it in a consistent way. For the GLS models described in this paper, we may choose P to be the real symmetric matrix that is the principal square root of S 21 . The confidence interval for d may be determined in the same way as before, except p and q are now evaluated as (2:13) If d falls outside the chosen confidence interval, it may indicate that S 21 does not adequately compensate for the correlation between errors, or it may indicate that the model being fitted to the data is not appropriate. We see an example of this in §2.8.

Alternatives to GLS
The GLS formulation above is by no means the only way of analysing time-series data with autoregressive errors. The simplest approach [3,10] is to calculate an effective number of samples N eff in calculating the standard error of regression coefficients We find, by comparing this method with GLS estimation with known r, that this approach is only applicable to calculating the standard errors of the regression coefficients, not the residual standard error. A recent study, detailing the calculation of regression coefficient standard errors with this method, is provided by [11]. However, this method may fail with small datasets with highly correlated residuals; e.g. if N ¼ 38 and r ¼ 0.9, we have N eff ¼ 2, leading to zero degrees of freedom for k ¼ 2, and infinite standard errors. Even for less extreme cases, there is a risk of this approach over-compensating.
We propose a simple alternative, almost identical for large N, but avoiding this risk, by applying the scaling of (2.14) to the regression coefficient variance estimates, or the square root of this scaling to the OLS-calculated regression coefficient standard errors s bOLS , to obtain an estimate for GLS estimation with known r:ŝ bGLS ¼ s bOLS ffiffiffiffiffiffiffiffiffiffiffi 1 þ r 1 À r s : (2:15) The regression coefficient t-value is then obtained by dividing the coefficient by s bGLS as usual. A weakness of the schemes of (2.14) and (2.15) are that they do not allow for testing of the correctness of the error correlation model, as described in §2.6 above. Another simple scheme for dealing with correlated time-series data is Cochrane-Orcutt [4] estimation. This involves transforming the dataset by subtracting r times the previous sample value from each current sample value in y and X, to provide uncorrelated residuals in OLS estimation of the transformed dataset. The regression coefficients and t-values obtained converge to those of GLS estimation with known r, as N becomes large compared to 21/ln(r), although the first observation is lost, leaving a dataset of size N 2 1.

GLS estimation applied to atmospheric carbon dioxide background
As an example of GLS estimation, consider the 40-year record of monthly mean background atmospheric carbon dioxide concentration, measured at Cape Grim, Tasmania [12]. As there is a small annual royalsocietypublishing.org/journal/rsos R. Soc. open sci. 6: 181089 variation with the seasons in the data, we take the January to December mean for each complete year, from 1977 to 2016. Using a linear model, of the form y ¼ b 0 þ b 1 t, the residuals are very highly correlated, with an extremely low Durbin-Watson statistic of d ¼ 0.105. This indicates OLS estimation with the linear model is not appropriate; this may be either due to error correlation, or due to the linear model being unsuitable, or both.
As described above, we are assuming a first-order autoregressive (AR) process for the errors, and estimate r, to perform GLS estimation. The methods for estimating r described so far yield different results, but the Durbin -Watson tests on the transformed residuals, as described above in §2.6, do not yield results within the 95% confidence interval for uncorrelated residuals, for any of the r estimates. The ACF lag 1 estimate is r ¼ 0.825, giving d ¼ 0.860, and the probability (two-tailed test assuming a beta distribution for d) of d being that far removed from E(d ) is only 0.00003.
ML, iterated jointly for r and b, estimates r ¼ 0.931, giving d ¼ 1.167, and two-tailed probability of 0.003, but still well outside the 95% confidence interval. The Durbin-Watson estimate r ¼ 1 2 d/2 is slightly higher with r ¼ 0.948, giving d ¼ 1.212, but the two-tailed probability is still only 0.005, well outside the 95% confidence interval. This may suggest that the first-order AR process is an unsuitable model for the errors, but an alternative explanation is that the linear trend model is unsuitable for this data.
The latter explanation tends to be supported when we try a quadratic model to fit the data, of the where the times t have been shifted to zero mean. The OLS residuals now give d ¼ 0.850, still indicating that GLS is required, but the estimates of r are now smaller: 0.446 for ACF lag 1; 0.575 for Durbin-Watson 1 2 d/2, and 0.535 for MLE. Of these three, only the Durbin-Watson estimate gives a two-tailed test on the GLS residuals within the 95% confidence interval, with probability 0.065.
Only by assuming a cubic model, of the form do we get a satisfactory result for the MLE r. We now have r ¼ 0.518, giving d ¼ 1.617, and two-tailed probability of 0.078, or just inside the 95% confidence interval. Durbin-Watson estimation of r yields a very similar result, with r ¼ 0.508, giving d ¼ 1.604, and two-tailed probability of 0.071. The ACF lag 1 estimate is now r ¼ 0.455, yielding a still unsatisfactory d ¼ 1.537, and two-tailed probability of 0.042. All these GLS estimates yield a significant t-value for the b 3 coefficient, in the region of 2.7, indicating the cubic model is more appropriate than linear or quadratic models.
The quadratic model for the Durbin-Watson estimate r ¼ 1 2 d/2 and the cubic model for the ML estimate of r are shown in figure 1. The cubic model for the Durbin -Watson estimate is not shown, as it is almost identical to MLE. Prediction intervals, as defined in §2.9 below, are shown. However, the size of these prediction intervals may be underestimated, since the Durbin-Watson test on the transformed residuals in both cases only just falls within the 95% confidence interval.

The prediction interval-supervised machine learning
The prediction interval for OLS estimation, with a linear model of one parameter (k ¼ 1), is defined in texts [1], so consistent with that definition, we define it here for all k, and GLS estimation. If the bestfit curve and prediction interval are evaluated in the region beyond the available data, we may refer to it as supervised machine learning. If the mean prediction at [x 02 , . . ., x 0kþ1 ] isŷ 0 , then the 100(1 2 a)% prediction interval of y 0 isŷ 0 À t a=2 s p , y 0 ,ŷ 0 þ t a=2 s p , where and for GLS, (2: 16) royalsocietypublishing.org/journal/rsos R. Soc. open sci. 6: 181089 In (2.16), t a/2 is the upper a/2 tail-point of the t-distribution with N 2 k 2 1 degrees of freedom, s is the residual standard error, and the s b i are the standard errors of the elements of b GLS . This expression assumes the explanatory variables in X are uncorrelated. This may be achieved when the data is assembled in X by subtracting from each new variable its OLS estimate in terms of the previous variables. This decorrelation procedure is only required for an accurate estimate of the prediction interval by (2.16), but has no effect on OLS or GLS estimation y. However, (2.16) assumes the GLS transformation defined by S 21 accurately accounts for the error correlation. If the model of error correlation is an estimate, (2.16) may underestimate the prediction interval. Figure 2 demonstrates this potential problem, using both quadratic and cubic MLE fits to the first 25 years of the Cape Grim carbon dioxide background data, and using the subsequent 15 years to test the prediction interval. Training from just the first 25 years, a quadratic model with OLS estimation still does not pass the Durbin-Watson test on its residuals, but all GLS methods mentioned above do pass this test on their transformed residuals.
In figure 2, we see that two of the 15 test points (13%) fall outside the MLE 95% prediction interval. A cubic model of carbon dioxide background concentration is not justified by the first 25 years of data, as the b 3 regression coefficient is not statistically significant (t ¼ 0.58), resulting in an excessive prediction interval in the forecast region. Thus with MLE, as defined by (2.11), neither of these forecasts appear completely satisfactory. Very similar results are obtained with the other GLS methods described above; ACF in (2.2), and Durbin -Watson in (2.3).
3. GLS analysis of highly correlated data 3

.1. A fundamental requirement
Our example of carbon dioxide measurements demonstrates highly correlated residuals, especially in the case of an ill-fitting regression model. It has been suggested [13] that a number of climatic parameters may naturally have a spectral characteristic with a slope approximating f 22 , or in terms of a first-order autoregressive process, r approaching +1. Depending on the type of model chosen to fit the data, we have seen that high correlation between residuals of adjacent data samples may be present. Assuming A prime requirement of the GLS estimate is that for the null hypothesis, i.e. when there is no real trend in the data, the estimated trend divided by the standard error of that trend, should follow some known distribution. If r is known, this ratio is the t-distribution for N 2 k 2 1 degrees of freedom, as demonstrated by the dark blue curves in figure 3. If r is estimated rather than known, the distribution of this ratio will be a different distribution. Figure 3 demonstrates this for the non-iterated Yule-Walker (ACF lag 1) and Durbin -Watson estimates of r, as well as the iterated ML joint estimate of r and b of (2.11). All of these depart significantly from the t-distribution for N 2 k 2 1 degrees of freedom, though the agreement between the non-iterated Durbin -Watson estimate and the iterated ML estimate is surprisingly close.

Improving the estimate of r
The curves of figure 4 show the mean value of various estimates of r, for different sample sizes N, with 1000 trials with AR(r ¼ 0.9) noise for each N. The 95% confidence interval for the ML case is indicated by the light green error bars. These curves suggest the differences between t-statistics with known r and estimated r, in figure 3, may be partly due to a bias towards underestimating r as it approaches 1.
A new estimate of r, with similar variance to MLE, but with greatly reduced bias for positive values approaching 1, is available by considering Durbin and Watson's exact expressions for E(d) and V(d) in the residuals of OLS estimation, for the r ¼ 0 case, shown above at (2.6). The general idea is first to , to ensurer is always unbiased for the r ¼ 0 case. This still leaves a variance inr that becomes unduly compressed for small N, so we scale the variance to reduce the bias for non-zero r.
The remaining problem is thatr must be restricted to the range À1 ,r , 1. As the beta distribution has been found [7] to be a good approximation to the distribution of d, an obvious way to restrict the range was by means of a transform and inverse transform using the cumulative beta distribution, but this proved to be numerically impractical for r well removed from zero. Instead, we use a hyperbolic tangent transform, to predict r from the Durbin-Watson statistic of OLS residuals d, and expectation E(d) and variance V(d) from (2.6): We refer to this as the tanh adjusted Durbin -Watson (TADW) estimate of r. Although (3.1) is theoretically real, in practice, slight numerical error sometimes generates a miniscule imaginary component, which must be removed to eliminate problems in subsequent calculations.
The red curve in figure 4 demonstrates the mean r prediction performance of TADW, together with 95% error bars, which are of similar magnitude to the MLE error bars. However, the bias in the TADW prediction is much less than MLE, so the estimate of r from (3.1) is generally a better estimate than provided by ML.

Accurate significance testing
Although the TADW estimate of r in (3.1) is less biased than the other estimates in figure 4, the calculated t-statistic for the regression slope still does not accurately follow the t-distribution for N 2 k 2 1 degrees of freedom when N is relatively small, as demonstrated by the continuous brown line in figure 3a. This is to be expected, but it would be convenient to have an estimated t-value following the N 2 k 2 1 degrees of freedom distribution, for consistency with the known r case. From testing of many different cases, we find that a t-value reasonably accurately achieving this in all but the extreme tails of the distribution is obtained by simple extrapolation from the t-values t dw obtained from GLS estimation with the simple Durbin-Watson r estimate from (2.3), and t tadw obtained from GLS estimation with the improved Durbin-Watson r estimate from (3.1). We then have t extrap as Similar results, only slightly inferior, are obtained by calculating the t-value from the regression slope divided by an extrapolated standard error s bextrap , similarly given by This extrapolated standard error s bextrap may then be used in constructing the prediction interval as defined in (2.16). Separating actual trend from error correlation in small datasets is the significant challenge addressed by this study. As well as demonstrating a valid test of significance in the null hypothesis case, as shown in figure 3, we must demonstrate that real trends may be recognized with adequate sensitivity. In figure 5 we see the result of 100 tests with correlated noise plus a real trend, varying from a significant negative value, through zero, to significant positive. All methods recognize a very significant trend as such, but using MLE and assuming a t-distribution with N 2 k 2 1 degrees of freedom overestimates significance relative to GLS with known r for 77% of cases in this test. On the other hand, the extrapolated t-value from (3.2) overestimates significance relative to known r for 38% of the time, or underestimates 62% of the time, so is slightly less sensitive to real trends than if r is known. This is to be expected, as we are relying on an estimate of r.

Cape Grim carbon dioxide data; a revised trend estimate
In figure 6, we repeat the plot of figure 1, but this time, the GLS estimation uses the new TADW estimate of r from (3.2), together with extrapolated standard errors from (3.4) to calculate the prediction intervals. As the cubic model has a statistically significant t 3 coefficient, we conclude, based only on

GLS for randomly sampled data in N-dimensional space
So far we have examined uniformly spaced data points on a one-dimensional time-line, assuming an autoregressive model for the errors, which has a particular direction in time. However the correlation matrix S in (2.9) is symmetric, and hence completely agnostic to the direction of time. This suggests generalizing the analysis, by considering its elements as radial basis functions of the radial distance r ij , however defined, between the data points i and j, of form to give matrix S as . . .
This matrix must be inverted to perform the GLS estimation of (2.8). This is generally successful, due to the exponential decay of the f ij terms with distance. If k is large and correlation is considerable, an occasional problem is inversion of (X 0 S 21 X), so the availability of a GLS estimate for a particular type of model is not necessarily guaranteed.  Figure 5. GLS regression slope t-values for 100 randomly generated datasets, each with 100 data points, consisting of AR1(r ¼ 0.75) noise e i ¼ N(0, 1) þ r Á e iÀ1 , plus a known slope. Here, N(0, 1) is a zero mean unit variance normally distributed variate. We set e 1 ¼ N(0, 1)= ffiffiffiffiffiffiffiffiffiffiffiffi 1 À r 2 p , to ensure stationarity. The GLS calculation is performed both with known and estimated values of r. The MLE (green trace) jtj exceeds the known r jtj for 77% of these trials, while the extrapolated (dashed brown trace) jtj from (3.2) exceeds the known r jtj in 38% of cases.
royalsocietypublishing.org/journal/rsos R. Soc. open sci. 6: 181089 However, as we saw with the Cape Grim carbon dioxide data, residual correlation may be partly due to the choice of an unsuitable model for the data. A more suitable model may significantly reduce residual correlation, and resolve the problem if it occurs.

Estimating r 0
A useful tool to model the correlation between spatially distributed residuals is likely to be the semivariogram, used in kriging estimation [14,15]. This plots half the squared difference between all pairs of values, as a function of distance between the locations of those values, so the semivariogram plots g ij , against distance r ij . The aim is to find a function that is a reasonable fit to the scatter of points. At large distances r ij the semivariogram is expected to tend towards a sill value, equal to the residual variance s 2 e , and at smaller distances, a scatter of points around a function of the form g(r) ¼ s 2 e 1 À exp À In kriging, this is known as the exponential model, one of several commonly used semivariogram models. We use it here partly for consistency with the first-order autoregressive model for time-series data, and partly because in kriging it tends to have a stability advantage over some other models, without resorting to the nugget effect, due to its large slope near the origin, so it may have a similar advantage here. In kriging, the nugget effect relaxes the constraint that the fitted surface must pass through the known points. Another reason for using the exponential model here is that it is valid in space of any number of dimensions [15]. Naturally, residual correlation in equally spaced time-series data, as discussed above, may be analysed with a semivariogram; in fact, the asymptotic expected value of the semivariogram for this case, with first-order AR errors, is where m is the distance in sample spacings, and r 0 ¼ À 1 ln (r) :

> > > > > > = > > > > > > ;
(4:5) Repeating the tests in figure 3, a least-squares inverse distance squared weighted fit of class means (one m-value per class) proved to be an effective estimate of r, quite accurate for large N. However, these semivariogram results are almost identical to those provided by the simple Durbin-Watson estimate of (2.3) or the MLE estimate from (2.11), underestimating r when it approaches 1 with smaller N. In the case of errors due to a first-order AR process, taking all residual differences into account with a semivariogram appears to have no advantage over the much simpler Durbin-Watson technique, which only considers residual differences between nearest neighbours. Accordingly, we generalize the concept of the sum of squares of forward differences from a onedimensional time-line, to space of any number of dimensions, by constructing the nearest new neighbour path through space from data point to data point, starting with the point having the greatest sum of distances to all other data points. Then move from point to point, each time going to the nearest neighbour not already selected, until all data points are included in the path, as demonstrated in figure 7. Considering the OLS residuals at those points, we define the residual difference between each point and the next on the path to be the residual forward difference. In the case of equally spaced data points on a one-dimensional time-line, the sum of squares of these residual forward differences is the same SSRFD used to calculate the Durbin -Watson statistic in (2.3), so in the general case we use the term SSRFD for the sum of squares of the residual forward differences described here.
For equally spaced time-series data with r . 0, matrix S in (2.9) and (4.2) are identical if we set r 0 ¼ 2t s /ln(r), where t s is the time between adjacent data points. In the general case of points unevenly distributed in space of one or more dimensions, we no longer have a uniform distance between the points along our path through space from point to point, so conventional methods described in §2, or our new methods described in §3, for specifying matrix S from d ¼ SSRFD/SSR, may not necessarily be appropriate; this needs to be tested.

Random sampling null hypothesis testing
The techniques described above may be tested for sampling at random locations in space, for the null hypothesis of no actual trend, by first generating N independent normally distributed samples, and then introducing correlation consistent with matrix S in (4.2), by pre-multiplying the vector of independent samples, by symmetric real matrx S 0.5 .
The magenta dashed and dotted lines in figure 8 demonstrate that OLS estimation, or assuming r 0 ¼ 0, provides a completely unrealistic confidence test on trend or regression model significance, if the regression coefficient divided by its standard error is assumed to be t-distributed with N 2 k 2 1 degrees of freedom, while GLS estimation with known r 0 is reasonably accurate. The same semivariogram method that worked well with the equally spaced data now performs poorly with the data randomly located in one-dimensional space.
The Durbin-Watson-based techniques of (2.3) and (3.1) perform well, if we definê where r is the mean distance between nearest new neighbourŝ

Dealing with data points co-located in space or time
There may be more than one dataset available for a particular measured quantity, and they may each have measured values for the same time, or location in space. If the datasets have similar residual variance s 2 e , this may be handled by introducing distance r d , assumed to be constant for all pairs of data points at the same time or location, from the two datasets. This r d represents the difference between datasets as equivalent to differences within the same dataset for that same distance. The value of the ratio k d ¼ r d /r 0 may then be estimated from the m pairs of residuals e i and e j that are at the same location in time or space as If dealing with a single dataset that has multiple measurements at some of the locations, then for all cases, except in calculating the diagonal elements of S where i ¼ j, replace (4.1) with Alternatively, if combining two or more datasets, each with data at unique locations, but with the same locations in the datasets, use (4.1) when i and j are from the same dataset, but when i and j are from different datasets, use (4.8).
The semivariogram approach may still be used to fit the exponential function to the OLS residuals despite the m co-located data points, but to estimate r and hence r 0 from an equivalent to the Durbin-Watson described in §4.1, it is useful to construct a pre-multiplication matrix with N columns and N 2 m rows, which both sorts X and e in nearest new neighbour path order, and takes means of co-located rows. The resulting reduced vectors and matrices may be used in estimating r, the calculation of mean distance between nearest new neighbours r, and the GLS validity testing of §2.6.

Combining two datasets-global temperature anomaly data
To demonstrate the use of (4.7) and (4.8), our example is global monthly temperature data from two recently studied [16] surface climatological datasets, noting that NOAAGlobalTemp [17] provides temperature anomaly data from 1880 to the present, and HADCrut4 [18] provides similar data from 1850 to the present. For times after 1880, both datasets provide global temperature anomaly for the same times, creating pairs of points with zero time difference, if the two datasets are combined into one to obtain a combined model. The annual means for the 120-year period 1897-2016 for the two datasets are shown in figure 9, together with quadratic best-fit curves and prediction intervals obtained by two different methods.
For both datasets, the Durbin-Watson test on OLS residuals of the monthly data fails, indicating GLS estimation may be required. However, in both cases, the Durbin -Watson test on the transformed residuals of the monthly data, described in §2.6, fails for GLS estimation assuming a first-order AR model. While this may be corrected with a higher-order model for the correlation, a simple alternative is to take annual means of the monthly data. Using the annual means, GLS estimation assuming a first-order AR model is successful.
The simple approach is to take the mean of the two values for each year, and treat it as an equally spaced time series. Fitting a quadratic model, OLS estimation results in a residual Durbin-Watson statistic of 0.802, so clearly either GLS estimation is required, or the quadratic model is not appropriate, or both. Assuming a first-order AR process, the estimates of r from (2.2), (2.3), (2.11) and (3.1) are similar, yielding values of 0.582, 0.599, 0.592 and 0.637, respectively. All pass the Durbin-Watson test on the transformed residuals, with this test indicating (2.3) to be the best transformation, followed by (3.1), (2.11) and (2.2).
The other approach retains the 120 annual means from each dataset as separate data points, dealing with the co-located points as described above in §4.3. Again we have r estimates from (2.3) and (3.1) of 0.599 and 0.637, respectively, and with r ¼ 1 year, we have r 0 ¼ 1.951 years and r 0 ¼ 2.216 years, respectively. Of these, the estimate from (2.3) again is indicated by the Durbin-Watson test on the transformed residuals to be the better, although both are acceptable with d ¼ 2.083 and d ¼ 2.165, respectively. The semivariogram estimate is r 0 ¼ 3.27, which appears to be an overestimate, as it only just passes the transformed residual test with d ¼ 2.355.
The best-fit curves from both approaches are almost identical, with positive slope of þ0.00798 degrees per year for the period shown in figure 9, and curvature of þ0.00007 degrees per year squared, and similar standard errors and prediction intervals. Both methods still indicate statistically significant positive slope and curvature of global temperature anomaly. Slope t ¼ 12.84 for the mean royalsocietypublishing.org/journal/rsos R. Soc. open sci. 6: 181089 approach, and t ¼ 12.93 for the separate approach, while curvature t ¼ 3.59 for the mean approach, and t ¼ 3.64 for the separate approach.

Summary of the proposed GLS, for data in multiple dimensions
The procedure developed above may be summarized as follows: Initially perform OLS analysis of the data using the proposed model parameters. Then check for correlation between the residuals to test the validity of the OLS regression analysis. If the data is equally spaced in one dimension, this check is conventionally done by examining the ACF of the residuals, or their Durbin -Watson statistic, comparing with appropriate confidence limits. These are approximated in (2.7) for Durbin-Watson, as given in [6].
In the case of multiple dimensions, or unequally spaced data in one dimension, conventionally the semivariogram of residuals may be constructed. We propose a new alternative, that of finding the nearest new neighbour path connecting the data point locations, described in §4.1 above and figure 7. Order the residuals according to this path and calculate the effective Durbin-Watson statistic accordingly, and test against the confidence limits given in (2.7).
If these tests indicate correlation between residuals is unlikely, then assume that the OLS estimate and its t-test results on the significance of regression coefficients are reliable, and GLS estimation need not be undertaken.
If the Durbin-Watson statistic indicates spatial correlation is likely, a model for this correlation is required. Assuming an exponential model may be appropriate, we provide two to trial: the estimates of either (2.3) or (3.1), combined with (4.6) to estimater 0 in the model. Perform GLS analysis using these two options, and the Durbin -Watson test on transformed residuals as described in (2.12) and (2.13). The better of the two options from (2.3) and (3.1) is indicated by the better of the two Durbin-Watson test results. If both fall outside the required confidence limits, then conclude that our simple exponential model is unsuitable, and instead use conventional semivariogam analysis, together with a wider range of models [15].  Figure 9. The two datasets of figure 5, NOAAGlobalTemp and HADCrut4, combined into one dataset, with GLS estimation, based on the simple Durbin -Watson estimate, r ¼ 1 2 d/2. Two approaches are used. One takes the mean of the two datasets for each year to combine into a single dataset (the red dashed best fit, and red dashed and dotted prediction interval), and uses conventional methods for equally spaced time-series data described in §2. The other (the green continuous best fit, and green dashed prediction interval) takes the 120 1-year means from each dataset, keeping the 240 observations separate in a dataset of 240 points, analysed as described in §4.3. The best-fit curves and the calculated prediction intervals are almost the same. However, the difference is barely visible here, due to the high correlation between the two datasets. Both methods indicate statistically significant positive slope and curvature of global temperature anomaly.
royalsocietypublishing.org/journal/rsos R. Soc. open sci. 6: 181089 If the two Durbin-Watson test results for the models from (2.3) and (3.1) are both within confidence limits, and both less than E(d), with (3.1) closer than (2.3), then the t-value extrapolation (3.2) may be used. We suggest this criterion limiting its use, as real data may not behave the same as the ideal artificial data in our Monte Carlo testing in figures 3, 8 and 10, where (3.2) seemed generally a better option than (2.3) or (3.1).
There are a number of assumptions in the above procedure. The derivation of E(d) and V(d) in [6] was for OLS analysis of equally spaced one-dimensional data, but will be asymptotically correct in this more general case, so we assume these limits are reasonably accurate for small datasets in multiple dimensions. Another assumption of our method is that the spatial correlation is isotropic within the dimensions occupied by the data. Anisotropic semivariograms and their characterization are discussed in detail in [19]. Further detail on spatial statistical methods is provided in [20].
We assume that if the Durbin -Watson test on the pre-multiplied GLS residuals is satisfactory, then an exponential model is appropriate. If there is doubt about this last assumption, a useful test is to construct a semivariogram of the GLS pre-multiplied residuals Pe defined in (2.12).
We note that our method described here only tests for and corrects short-range dependence of the error process. If long-range dependence of the error process is suspected, either from the physical nature of the problem, or from long lag results in the ACF or semivariogram of residuals, then the simple procedure described in this paper may not be appropriate. A comparison of short and long memory models is given in [5].
5. Examples of GLS with regular and irregularly sampled data 5.1. Antarctic ice core temperature data As an example of uniformly sampled time-series data, we analyse data at 1-year intervals between AD 1800 and AD 1999 from several ice core locations within central Antarctica [21]. The authors describe a modest warming over the last 150 years, based on this data combined with other evidence, so we take their reconstructed temperature data for 1850-1999, as our example.
A significant positive trend of 0.184 degrees per century is suggested by OLS estimation, with t ¼ 2.23, but this apparent significance may not be valid, as the residual Durbin-Watson statistic is d ¼ 1.6. The beta model of the distribution of d indicates a probability of only 1% that d would be that far removed from the uncorrelated expected value, E(d) ¼ 2.01 in this case. Inspection of the temperature data, plotted in figure 11, suggests a linear trend model to be appropriate, so we conclude from d ¼ 1.6 that there is modest positive error correlation, and try GLS estimation using a first-order AR model. This proves to be successful for all the estimates of r described in this study, in that the The danger of drawing conclusions from temperature trends observed over tens of years limited to the region of Antarctica has been described [21], and this caution is supported by our analysis of only this reconstructed temperature data. This result in figure 11 is very different from the trend seen in the two worldwide datasets shown in figure 9.

Cape Grim atmospheric methane concentration
We now take methane measurements at Cape Grim, figure 12, as an example of non-uniformly spaced data. These measurements [12] are available monthly from August 1984 onwards, but the first five measurements between April 1978 and 1984 were less frequent. This presents a problem for regression analysis, as the monthly measurements reveal a sinusoidal oscillation with a period of 1 year, as seen in the red line in figure 12. This leads to severe underestimation of r 0 using a simple polynomial model for the data, whether by our Durbin -Watson approach, or the conventional semivariogram method.
One potential solution to this problem is to add two parameters to the regression model for the data: the sine and cosine of 2pt where t is the time in years. This provides an estimate of the phase and amplitude of the sinusoidal variation as well as the overall trend, if that is required. However, to fit just the overall trend, a simpler solution is to analyse annual means only, after the start of the monthly data. This approach has been used in the model fitting of figure 12.
The linear model from OLS estimation is shown in yellow, although a very low d ¼ 0.096 indicates this is clearly not a valid result. The sparse measurements at the beginning of the series have greater influence on the GLS best-fit line. This is based on the estimate of r ¼ 0.978 from our new model (3.1), leading to r 0 ¼ 47.84 years. This very high apparent correlation may be due to an unsuitable  Figure 11. The last 150 years of annual data from a 200-year Antarctic ice core temperature record, with linear best fit and 95% prediction interval from GLS estimation. The estimate of r ¼ 0.216 for the AR process is from (3.1). The trend is 0.018 degrees per century, which is not statistically significant. The 150-year result from Antarctic ice cores is in contrast to the 120-year result from the two worldwide datasets of figure 9, which has statistically significant slope of 0.8 degrees per century (t ¼ 12.8) as well as upward curvature (t ¼ 3.6).
royalsocietypublishing.org/journal/rsos R. Soc. open sci. 6: 181089 model, as we saw earlier with the Cape Grim carbon dioxide data. In this case of methane data, the unsuitable linear model has the added disadvantage of overestimated errors, and an excessively large prediction interval.
A cubic model has regression parameters that are all statistically significant, and the estimate of correlation is much lower. Using (2.2) we have r ¼ 1 2 d/2 ¼ 0.451, or r 0 ¼ 1.31 years. The slope, curvature and cubic GLS regression coefficients are 3.88, 20.172, and 0.0105, respectively, with t-values 11.0, 212.6 and 8.3, respectively. The residual standard error is now 6.56 ppb, when compared with 45.3 ppb for the linear GLS model.

Radio climate modelling in two dimensions
As our final example, we examine the regression modelling of clear-air atmospheric effects on microwave radio propagation in the surface layer of the atmosphere. The dominant clear-air characteristics of the atmosphere affecting radio propagation are vertical moisture gradients and temperature gradients, as they produce vertical refractive index gradients, which cause the predominantly horizontally propagating radio signals to curve up or down. It is important to not only know the mean gradient over the surface layer; gradient variations within the surface layer may have a significant effect [22].
Ideally, we may imagine this problem being solved by numerical weather prediction (NWP) estimating the state of the lower atmosphere with sufficient resolution, of the order of 1 m vertically, and accuracy, to predict the radio propagation with a technique such as the parabolic equation model [23]. However NWP models do not yet have sufficient resolution, or accuracy in many locations [24], so empirical regression models are used. The current accepted prediction model [25] is an OLS estimate from a number of climatic and physical parameters [26]. Recently, an improved OLS regression model, taking advantage of new multipath fading data from countries not previously included in the dataset, and using additional parameters from surface weather station data, has been developed [27]. year (AD) background methane concentration (ppb) Figure 12. Cape Grim, Tasmania, background methane concentration measurements (thin red line), with annual sinusoidal variations evident after commencement of monthly measurements in mid-1984. This oscillation results in unacceptable values of the residual Durbin -Watson statistic d, so from this time on, annual means (dark blue line) are analysed instead. The bestfit line from OLS estimation is shown in yellow, although a very low d ¼ 0.096 indicates this is clearly not a valid result. The sparse measurements at the beginning of the series have greater influence on the GLS best-fit line (green), in this case, based on the estimate of r ¼ 0.978 and hence r 0 ¼ 47.84 years from (3.1). Such high apparent correlation may suggest an unsuitable model, as does the excessive size of the prediction interval. A cubic model (magenta curve) has statistically significant regression parameters, lower residual standard error, and much more appropriate calculated prediction interval. The cubic model has r ¼ 0.451 and hence r 0 ¼ 1. 31  These new parameters were calculated from weather station temperature, pressure, and dew-point measurements, and then interpolated for the radio link location by natural neighbour interpolation [28], using spherical geometry. This very effective technique [29] is perhaps under-used, as it is more difficult to code than kriging, yet it has some significant advantages: it only uses local data, does not require fitting of a semivariogram function to the data, does not assume stationarity of the data, yet provides smooth contours, and always provides stable exact interpolation, unlike kriging, which often may only provide an approximate fit, by employing the nugget effect [14], to remove the constraint of exactly fitting the data.
A significant motivation for this GLS study is that these fading prediction models were fitted to data from radio links around the world, with a very non-uniform distribution, seen in figure 13, and concern that the statistical significance attributed to regression coefficients may be overestimated.
The most significant predictor of fade depth A 0:01% was found [27] to be a composite parameter v 1 , in terms of path length D (km), mean rayline height at standard refractivity gradient H 8500 , and N sA90210 , the interdecile range of the time-series of surface refractivity anomaly N sA [24]: Here, N sA is the difference between surface atmospheric radio refractive index, and its median value for the same hour of the day, and month of the year. The radio refractive index is estimated from atmospheric temperature, pressure, and water vapour pressure [30,31].
The correlation of this purely empirical v 1 parameter with fading, as a single-variable model, seemed extraordinary, with OLS estimation indicating t ¼ 18.94. The GLS estimates described here are unanimous in relegating this parameter to t ¼ 10.0 as a single-variable model. We now review the full nine-variable model [27].
Although OLS estimation was used in the initial study [27], the spatial Durbin -Watson test we introduce in this study indicates moderate but significant positive spatial correlation, with d ¼ 1.504, so we repeat the regression analysis with GLS estimation. Results of this are summarized in table 1. The parameters in this table are described in detail in [27].
There are 327 available data points [27], but only 180 of them are at unique locations; the rest are at locations with more than one data point, due to differences in parameters such as radio frequency or antenna height. For this nine-variable model, we estimate r ¼ 1 2 d/2 ¼ 0.248, resulting in r 0 ¼ 292.8 km, and with k d ¼ 0.3095 we have an effective co-located distance r d ¼ 90.6 km. The r ¼ 1 2 d/2 estimate from (2.2) is used, as the resulting Durbin-Watson statistic of transformed GLS residuals is d ¼ 2.001, which is slightly better than with (3.1): r 0 ¼ 319.5 km and d ¼ 1.989, or with the  Figure 13. Radio links providing mutipath fading data points for the OLS models. Data collected prior to 2007 was included in development of the current ITU-R model [25], and the later data was added in developing the recent model [27].
semivariogram estimate r 0 ¼ 69 km and d ¼ 1.997. The larger r 0 values generally result in a more severe test of significance than low r 0 . One of the interesting results of the OLS regression is significant and roughly equal and opposite correlation of the two parameters dN 1 and dN01 ERAI , as they are both [25,32] predictions of the 1% point of the distribution of refractivity gradient in the 65 m surface layer of the atmosphere, by reanalysis of the NWP of the European Centre for Medium-range Weather Forecasting. The difference is that dN 1 is from an early re-analysis at 1.5 degrees latitude and longitude resolution, of 2 years of weather data, while dN01 ERAI is from a much more recent and comprehensive re-analysis, at 0.75 degrees latitude and longitude resolution. Our GLS analysis now shows this arguably better version of the parameter to be insignificant in fading prediction, and the early version of the parameter remains significant, when used together with the other parameters in this regression model.

Conclusion
The statistical significance of OLS regression models may be overestimated when residuals are correlated with each other. Often this is a positive correlation with residuals that are nearby in space or time. Although there are existing GLS techniques for equally spaced time-series data to counteract this, they may not provide adequate correction in the case of highly correlated residuals and small datasets.
For equally spaced data, we describe a new estimate of the first-order autocorrelation, based on the Durbin-Watson statistic d, of the OLS residuals, and the exact expressions for expected value and variance of d. Monte Carlo testing shows this new non-iterative estimate to have similar variance to iterative ML estimation, but with significantly less bias for small positively correlated datasets. Further testing shows that this new model provides a valid statistical test of trend or regression model significance, using a standard t-test.
We generalize these techniques to randomly spaced data in the space of one or more dimensions, by the new concept of the nearest new neighbour path. This is compared with the standard technique of semivariogram estimation, and found to be superior in Monte Carlo testing. Tests of real data described here suggest the value of applying these techniques as well as semivariogram estimation, as we describe a validity test for the GLS scheme that may be applied to data randomly located in space.
Our focus in this study is on small datasets, so we considered only first-order AR models in the case of equally spaced data in one-dimensional space, and the equivalent exponential model in the general case. All the methods described here are asymptotically correct, so may be applied to large datasets; except that long-memory models of correlation may need to be considered. However, examples presented here suggest that for smaller datasets, choosing an appropriate model for the response data is often more important than the choice of model for the error correlation.