Model building and assessment of the impact of covariates for disease prevalence mapping in low-resource settings: to explain and to predict

This paper provides statistical guidance on the development and application of model-based geostatistical methods for disease prevalence mapping. We illustrate the different stages of the analysis, from exploratory analysis to spatial prediction of prevalence, through a case study on malaria mapping in Tanzania. Throughout the paper, we distinguish between predictive modelling, whose main focus is on maximizing the predictive accuracy of the model, and explanatory modelling, where greater emphasis is placed on understanding the relationships between the health outcome and risk factors. We demonstrate that these two paradigms can result in different modelling choices. We also propose a simple approach for detecting over-fitting based on inspection of the correlation matrix of the estimators of the regression coefficients. To enhance the interpretability of geostatistical models, we introduce the concept of domain effects in order to assist variable selection and model validation. The statistical ideas and principles illustrated here in the specific context of disease prevalence mapping are more widely applicable to any regression model for the analysis of epidemiological outcomes but are particularly relevant to geostatistical models, for which the separation between fixed and random effects can be ambiguous.


Introduction
In this paper, our aim is to provide a guiding framework for the formulation, application and validation of geostatistical models [1] for disease prevalence mapping. Model-based geostiatistics (MBG) has been extensively used to address scientific problems whose primary objective is to make probabilistic inference on a spatially continuous phenomenon, using data collected over a finite set of geo-referenced locations.
Here, we focus on cross-sectional surveys that are conducted in order to understand the spatial variation of disease prevalence within a geographical area of interest. In low-resource settings, disease registries are typically absent and cross-sectional surveys often provide the only source of health outcome data that can be used to infer the disease burden in a population. In this context, the available data consist of a set of household locations x i , the numbers n i of individuals who have been tested for the disease of interest, and the numbers y i out of n i who have returned a positive test result. MBG provides a statistically principled, likelihood-based approach for predicting disease prevalence, p(x), at any desired location x in the area of interest by exploiting the spatial correlation between the observations y i . In order to improve the efficiency of spatial predictions for prevalence, MBG models also allow the inclusion of covariates, d(x), i.e. variables that are observed and recorded at a location x, and are considered to be associated with p(x). In this paper, our focus will be on the use of covariates d(x) that are available as a regular grid of locations covering the study area (also known as raster data) and are used as a proxy for the distribution of disease vectors or to capture socio-economic inequalities in the population. In the context of mosquitoborne diseases in developing countries, two examples of this kind of covariate are vegetation density, which quantifies the suitability of a location to serve as a mosquito habitat, and night-time light (NTL), which is often used as a proxy for the local level of economic development. Because of their highspatial resolution, raster data are especially useful for prediction of p(x) at locations where no data have been collected. However, this leads to two fundamental questions. How should we model the relationship between d(x) and p(x) in MBG models? And how should we assess the impact of d(x) on the spatial predictions of p(x)?
The answers to those questions might differ according to the scientific objective of the study. As in Shmueli [2], we distinguish between explanatory modelling and predictive modelling. Under explanatory modelling the model-building process, and in particular the selection of candidate covariates, should be informed by context-specific scientific knowledge of the underlying disease process, to the extent that this is well understood. Also, interest lies principally in understanding how risk-factors represented by the covariates d(x) relate to disease prevalence p(x). In predictive modelling, although context-specific scientific knowledge can still play an important role, the effort is directed primarily towards the development of a model that can predict as accurately as possible future data generated by the same underlying process. For these reasons, different concerns arise under the two paradigms. For example, in explanatory modelling accounting for measurement error in the covariates is important in order to obtain unbiased estimates of regression parameters that link d(x) to p(x). In predictive modelling, the selection of covariates is carried out in order to minimize a measure of prediction error and considerations about how accurately a regression parameter d(x) measures the effect of a putative risk factor are of less concern.
Current applications of disease prevalence mapping in low-resource settings have adopted different approaches to the development and application of geostatistical models, especially in relation to the selection and use of covariates for spatial prediction. The field of malaria epidemiology offers many interesting examples that illustrate this. Weiss et al. [3] propose an algorithmic approach to select a large number of covariates for mapping malaria prevalence across Africa, using standard, non-spatial logistic regression. Specifically, starting with an initial set of more than 50 million covariates including transformations and interactions of wellestablished malaria risk factors, the authors proceed through five steps of model selection by making extensive use of cross-validation and the Akaike information criterion for ranking different models. After reducing the number of covariates to 1887, a sixth step is carried out to identify a final set of 20 covariates, each of which is randomly drawn from the remaining set by giving preference to those showing better predictive performance.
In a more recent paper, Bhatt et al. [4], predict malaria prevalence using 15 pre-selected covariates without any further reduction. In a first stage, a predefined set of regression and non-parametric models are fitted separately to the data in order to capture complex interactions between covariates and their nonlinear relationships with prevalence. The resulting predictions for prevalence from each of the applied methods are then combined, using an approach that they term Gaussian process stacked generalization. This delivers a single estimate which is then re-used as a predictor in a linear geostatistical model for the logit-transformed prevalence. Using crossvalidation, Bhatt et al. [4] show that this approach outperforms other interpolation methods, including standard geostatistical models where covariates are introduced simply as regression terms in the linear predictor for prevalence. Here, the development of the model-fitting algorithm is exclusively focused on reducing a specified index of predictive performance in a hold-out sample. However, the improved predictive performance is obtained at the expense of the interpretability and, hence, the explanatory power of the model.
Within a Bayesian inferential setting, a widely used approach to variables selection for malaria prevalence mapping has been inspired by the seminal paper of George & Mulloch [5] in which a latent binary variable is introduced into the model in order to identify subsets of the most important covariates. Diboulo et al. [6] use this approach in the development of a geostatistical model for mapping malaria in Burkina Faso. Each of the regression coefficients associated with a covariate is given a prior consisting of a mixture of two zero-mean Gaussian distributions with the variance of one of the two constrained to be 1000 times smaller than the variance of the other. The mixing probability of the two distributions is assumed to be 0.5 a priori and a covariate is then retained in a subsequent fit of the model if there is a posterior probability larger than 0.5 that favours the Gaussian distribution with the larger variance. Variations of this approach have also been used by Giardina et al. [7], Adigun et al. [8] and, for schistosomiasis prevalence mapping, Chammartin et al. [9]. One of the main advantages of these approaches is that the variable selection process takes into account the spatial correlation in the data, unlike in Weiss et al. [3]. However, several arbitrary choices are made in the specification of the mixture distributions and it is unclear to what extent these affect the identification of important covariates.
By contrast, Macharia et al. [10] and Giorgi et al. [11] fit spatio-temporal geostatistical models to malaria prevalence data in Kenya and Somalia without using any covariates other than the age of the examined individuals at each sampled location. The authors' rationale for this choice was a concern for the misspecification of the regression relationship between d(x) and p(x), which might yield invalid inferences for p(x) in areas where, due to the absence of data, these would be entirely driven by the covariates d(x). However, the use of an MBG model without any spatially referenced covariates questions the face validity of predictions that consequently revert to the mean prevalence level at prediction locations remote from the sampled locations. Also, it is unclear to what extent misspecifications of regressions relationships between covariates and prevalence might distort the resulting predictions. We investigate both of these issues in the present paper.
A variety of approaches that have been used for building geostatistical models in the context of other tropical diseases include Slater et al. [12] and Moraga et al. [13] for lymphatic filariasis; Zourè et al. [14] and O'Hanlon et al. [15]  Through a geostatistical analysis of malaria prevalence data in Tanzania, this paper describes how to develop geostatistical models, from the first stage of exploratory analysis to the final step of spatial prediction. Simulation studies are also carried out in order to offer further insights into the use and misuse of covariates for disease prevalence mapping. The data and R scripts for the exploratory analysis, parameter estimation and spatial prediction are freely available at: github. com/giorgilancs/covariates.

Introducing the worked example
Throughout the paper, we use the data from the Tanzania Demographic and Health Survey and Malaria Indicators Survey conducted in 2015 (henceforth, DHS-MIS2015) [18]. This consists of 387 geo-referenced locations representing clusters of households falling within predefined geographical areas, known as census enumeration areas (EAs), which were delineated for the 2012 Tanzania Population and Housing Census.
The survey employed a two-stage stratified sampling design to provide estimates for the entire country, for urban and rural areas in Tanzania Mainland, and for Zanzibar. For specific indicators, the sample design allowed the estimation of indicators for each of the 30 regions (25 regions from Tanzania Mainland and five regions from Zanzibar) while the rest were representative for each of the nine zones. Stratification was achieved by separating each region into urban and rural areas, resulting in 59 sampling strata. In the first stage, clusters consisting of EAs were selected from the 2012 Tanzania Population and Housing Census sampling frame. In the second stage, 22 households were systematically selected from each cluster from a complete listing of households within each selected cluster. In all selected households, with the parent's or guardian's consent, children age 6-59 months were tested for Plasmodium falciparum.
The research question we address is: in what areas of Tanzania is malaria prevalence above 0.3? As in previous studies [10,19], here, we use a 0.3 (i.e. 30%) prevalence threshold to define high malaria burden areas that require intensive and sustained vector control.
Our outcome variable is the number of positive test results, y i , based on a rapid diagnostic test (RDT) for Plasmodium falciparum antigenaemia, out of n i examined individuals at location x i , for i = 1, …, N with N = 790.
In our analysis, we consider the following spatially referenced candidate covariates for modelling P. falciparum, all of which have been used in most previous studies on malaria risk mapping [20][21][22].
-Population density, obtained from the WorldPop database (www.worldpop.org), is used to account for the higher levels of malaria transmission in low-populated rural areas. This is available as a raster file at a resolution of 3 arc (approx. 100 m at the equator). -NTL captures the level of urbanization of a location [23] and is complementary to population density. As a result of lower poverty rates and increased access to health facilities, malaria risk is indeed lower in urban areas [24,25]. A 2013 gridded surface of intercalibrated NTL was obtained from the Geodata portal (geodata. globalhealthapp.net) at a spatial resolution of 1 km 2 .
-Rainfall is an important environmental factor that directly affects the suitability of potential breeding sites for Anopheles mosquitoes. Here, we use the annual mean precipitation based on the Climate Hazards Group Infra-Red Precipitation with Station data (CHIRPS) [ The different steps of a geostatistical analysis are summarized in table 1. In what follows, each section of the paper corresponds to a different step presented in table 1. We consider steps 1 to 4 to be an essential component of any geostatistical analysis.
Step 5 is specifically applicable to the primary objective of this paper, i.e. the assessment of the impact of covariates for prevalence mapping.

Exploratory analysis
An initial exploratory analysis of the data provides useful insights into the development of a suitable geostatistical model for prevalence. Here, we focus on two main aspects of exploratory analysis: (i) assessment of the relationship between prevalence and covariates; (ii) testing for residual spatial correlation. Figure 1 is a point-map showing the sampled locations of the dataset and their corresponding empirical prevalence. This plot serves two purposes: it visualizes the spatial coverage of the study region; and it gives initial insight into the spatial pattern of prevalence. We observe that the density of the sampled locations corresponds in large part to the distribution of the population in Tanzania, with few or no locations in sparsely populated areas. This feature of the data will be of particular relevance later in the geostatistical analysis, as the sparsity of the data affects both the predictive power of our model and our ability to validate it. In figure 1, we also observe the presence of locations with an empirical prevalence greater than 0.57 in the northwest and in the eastern part of the country. Many locations with no cases reported are found in large swathes of inland Tanzania, especially in the proximity of uninhabited areas.
In order to explore the association of the covariates with prevalence, we use the empirical logit transformation, defined as The rationale for using the empirical logit is that this matches the scale on which covariates are included as terms in the linear predictor of a logistic model; the addition of 1/2 to the royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20210104 numerator and denominator in the equation above is used to deal with y i = 0 and y i = n i , for which the standard logit transformation of the empirical prevalence would yield values plus and minus infinity, respectively. We emphasize that the use of the empirical logit is only for exploratory purposes and that the final geostatistical model should instead be based on a binomial likelihood, this being the natural sampling distribution for prevalence data. Fitting a linear geostatistical model to y Ã i may be a more convenient strategy when p(x) is close to 0.5 and the n i are large, but in other circumstances can result in highly biased inferences; see Stanton & Diggle [28] for details.
We next generate scatter plots for y Ã i against each of the covariates (figure 2). In the case of precipitation and population, we also take the logarithm of the two covariates in order to obtain a more approximately linear relationship with prevalence. We note that all covariates exhibit a very noisy relationship with the y Ã i . Regression splines can be used to obtain a smoothing curve as a better visualization of the nature of any remaining non-linear relationships. In our analysis, we fitted these using the 'mgcv' package [29], available in the R software environment. (Its use is illustrated in the R code available at: github.com/giorgilancs/covariates.) The resulting curves (see blue lines of figure 2) provide a useful description of the relationship between the empirical logit and each of the covariates but are not always easy to interpret. To overcome this issue, linear splines, i.e. piecewise linear functions, also known as broken-stick models, can be used to account for nonlinear relationships with more easily interpretable regression parameters.
In our use of linear splines, we aim to capture two different types of nonlinear relationships: unimodal, where an increasing trend is followed by a decreasing one, or vice versa; or 'saturation curve', by which we mean a monotonic relationship that appears to flatten for increasing values of the covariate. Both types of relationships can usually be captured using a linear spline having no more than two knots, say k 1 and k 2 , formally expressed in a standard logistic regression as In the above equation, exp{β 1 } is the multiplicative effect for a unit increase in d(x i ) on the odds ratios, for d(x i ) < k 1 . That effect then becomes exp{β 1 + β 2 } for k 1 < d(x i ) < k 2 and The fitted linear splines are shown in figure 2 by the dashed red lines, whilst the green lines represent simple linear fits. The knots of the splines are chosen through a graphical inspection of figure 2 by placing them in proximity of local maxima of the smoothing spline (blue lines of figure 2). In the case of EVI, NTL and precipitation, the estimated regression relationship, as shown by the red lines, can be described either as a unimodal or a saturation curve, both of which are scientifically Table 1. Summary of the steps carried in a geostatistical analysis, highlighting its objectives and the statistical tools used to pursue these. Note that the list of statistical tools presented in the table is not exhaustive and is limited to those presented in this paper. royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20210104 interpretable. For example, an increase in the levels of urbanization, as measured by NTL, is associated with a large decrease in prevalence, on average, when moving from rural to moderately urbanized areas. However, for values of NTL larger than 10 the relationship flattens, indicating that higher levels of urbanization are not likely to bring any further decrease in malaria risk.
In the case of temperature, values between 25°C and 30°C are considered optimum for P. falciparum sporogony [30], which suggests a unimodal effect of temperature on prevalence. The upper panel of figure 2 does not support this and shows instead an approximately linear relationship. The apparent inconsistency may be due to the low number of locations with temperature higher than 30°C. This may have masked any decreasing trend at higher temperatures [31], making any extrapolation based on the green line beyond this range biologically implausible. In the case of the population density covariate, the observed relationship is decreasing and approximately linear which can be explained, similarly to NTL, as a negative association between urbanization and malaria risk. The next step of the model-building process is to assess if the residual variation, i.e. variation that is not captured by the selected covariates, exhibits evidence of spatial correlation. To this end, we consider a first extension of the standard generalized linear modelling framework in which the notion of residual variation is introduced directly into the linear predictors for the log-odds of prevalence. More specifically, conditionally on a set of independent Gaussian variables Z i , called random effects, the count y i is now assumed to be the realization of a binomial distribution, with linear predictor where d(x i ) is a vector of covariates. The model specified in the equation above belongs to the class of generalized linear mixed models and accounts for overdispersion, which occurs when the data y i exhibit greater variation than would be expected under a standard binomial model. After fitting the model, we extract the estimates for the Z i , which we denote byẐ i , and calculate their empirical variogram to assess whether theẐ i show any evidence of spatial correlation. LetẐ i andẐ j denote the estimates of the random effects associated with locations x i and x j . The empirical variogram is calculated by averaging the quantities v ij ¼ (Ẑ i ÀẐ j ) 2 =2 within predefined classes of distance, also known as bins. The empirical variogram is then displayed by plotting the averaged v ij against the mid-points of each distance bin.
Electronic supplementary material 1, figure S2 shows the resulting empirical variogram for the malaria data, after removing the effects of temperature, NTL, population, EVI and precipitation. In the presence of spatial correlation, the typical shape of the empirical variogram is that of a nondecreasing function of distance. This is because spatial correlation would make the squared differences v ij smaller, on   figure S2 shows a pattern that suggests the presence of spatial correlation. However, this alone does not allow us to conclude that the random effects show evidence of residual spatial correlation; we need to show that the black line is indeed highly incompatible with the absence of spatial correlation. For this purpose, we re-calculate the empirical variogram of theẐ i by randomly allocating theẐ i to the locations x i , and repeat this 10 000 times. The resulting 10 000 variograms can then be used to construct 95% probability intervals for the range of the variation in the v ij under the assumption of spatial independence. These are denoted in electronic supplementary material, figure S2 by the shaded grey area. The black line falls well outside the 95% envelope, suggesting that the unexplained variation in prevalence by the covariates is spatially correlated. In the event that we detect spatial correlation, we further extend the generalized linear mixed model as follows. We now assume that, conditionally on the Z i and a spatial Gaussian process S(x i ), the y i are the realization of a binomial distribution with linear predictor In this paper, we assume that S(x) has an exponential covariance function, with variance σ 2 . This implies that the distribution of S(x i ), for i = 1, …, N, follows a multivariate Gaussian distribution, with correlations expressed by where u ij is the distance between x i and x j , and ϕ is a scale parameter that regulates how quickly the spatial correlation approaches zero for increasing distance u ij . The exponential covariance function corresponds to a mean-square continuous but non-differentiable Gaussian process S(x). It is a special case of a wider class of covariance functions proposed by Matérn [32] that include an additional parameter to control the meansquare differentiability of the Gaussian process S(x). This additional parameter is hard to estimate and, in our experience, only materially improves the fit of the model when the empirical variogram shows a clear 'S' shape over small distances, which is typical of Matérn covariance functions with k > 0.5. We assume that the Z i in (4.1) are independent and identically distributed random variables with mean 0 and variance τ 2 . The Z i are called the nugget effect, reflecting the historical origin of geostatistical methods in the mining industry. In our context, they can be interpreted as a combination of non-spatial over-dispersion and spatial variation on a scale smaller than the smallest distance between any two data-locations x i . It is impossible to disentangle these two effects without independently replicated data at coincident locations. In this paper, we make the pragmatic choice to exclude Z i from our spatial predictions of prevalence.
In (4.1), the sum S(x i ) + Z i represents all of the variation in prevalence p(x i ) that is not attributable to the covariates d(x i ). If all the variables that contribute to the spatial variation in p(x i ) were available as measured covariates, S(x i ) would be redundant.
In the geostatistical analysis illustrated in this paper, we estimate the model parameters using the Monte Carlo maximum-likelihood (MCML) method implemented in the PrevMap package [33]. For more technical details, we refer the reader to Geyer's work [34][35][36] and to Christensen [37] for an overview of MCML in the context of model-based geostatistics. Here, we only point out that, unlike Bayesian methods of inference, MCML avoids the specification of prior distributions for the parameters of the geostatistical model-namely β, σ 2 , ϕ and τ 2 -by treating these as unknown constants that are estimated from the data.
Prior distributions are probability distributions that quantify our belief about a model parameter before any empirical evidence is taken into account. In general, choosing a prior is not straightforward. In the recent work by Simpson et al. [38] and Fulgstad et al. [39], a principled Bayesian framework is developed to define priors that are invariant to reparametrization and penalize for model complexity. However, ideally, a prior should be informed by subject-matter knowledge but translating this into a unique probability distribution can be very difficult; see §4.2.2 of [40] to see some examples. In the absence of prior knowledge, a widely used approach is to define priors with large variances in order to let the data drive the inference on the model parameters. However, with this approach the choice of how large the prior variance should be is often arbitrary. For this reason, in the context of geostatistical analysis where informative priors are rarely available, maximum likelihood is our preferred method of estimation. Against this, an advantage of Bayesian inference is that it naturally incorporates parameter uncertainty into predictions of prevalence within a single, principled probabilistic framework. To overcome this limitation, Giorgi et al. [41] propose to use the distribution of the maximum-likelihood estimator as a 'quasi-posterior' distribution for the model parameters, and sample from this to propagate parameter uncertainty. When comparing this approach with Bayesian inference and the use of plug-in maximum-likelihood estimates for prevalence prediction, Giorgi et al. [41] find only small differences. This is because, in most applications, the predictive variance of the spatial process S(x) dominates the variance of the parameter estimates.
In order to develop a geostatistical model that is more explicitly informed by knowledge of the disease, it is convenient to group risk factors into different domains, each of which contribute to the transmission of the disease via different routes. In the context of malaria, three main domains are environmental covariates, socio-economic covariates and covariates relating to the history of control interventions. Let the cumulative effect on prevalence of the risk factors d j (x i ) that fall under the j-th domain; we call these domain effects. We then re-express model (4.1), as Although the model above is mathematically equivalent to that in (4.1), the introduction of the domain effects δ j (x i ) can help to improve the interpretability of the model in two different ways. First, at a pre-analysis stage they provide a rationale for the identification of suitable covariates. Second, they guide the variable selection process by requiring each domain to be represented by at least one covariate in the final model used for spatial prediction. In our analysis, the identified covariates provide materially different independent contributions to the explanation of the spatial variation in disease prevalence. For this reason, our approach will be to try to retain as many as possible in the model in order to maximize its explanatory power while guarding against overfitting.
Overfitting occurs when the model delivers predictions that follow too closely the observed values of the outcome. Two possible causes of this are (1) the inclusion of too many covariates relative to the sample size and (2) the use of an overcomplex regression function to capture nonlinear relationships between the outcome and the covariates. To detect overfitting, the most commonly used approach is cross-validation, whereby a randomly selected subset of the data, also referred to as the training set, is used to inform predictions at locations of the remainder of the data, or test set. Then, based on a prediction error summary, the performance of the model is assessed in both the training and test sets. In case of overfitting of the royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20210104 data, the prediction error obtained from the test set will be substantially larger than that obtained from the train set; for more guidance on how to carry out cross-validation, see §7.10 of [42]. However, this approach has three main limitations: first, it does not provide any insights into the possible causes of overfitting; second, cross-validation becomes less reliable for smaller sample sizes; third, most commonly used prediction error summaries treat the observed fraction of positive cases as the true disease prevalence, which is especially problematic in low prevalence settings.
To overcome these issues when the goal is to carry out variable selection, we propose an alternative approach based on inspection of the correlation matrix of the maximum-likelihood estimates for β. The rationale for this is that, in the presence of overfitting, the estimates of the regression coefficients β are less stable and tend to become linearly dependent, resulting in correlations close to −1 or 1. This allows us to identify which covariates may be removed from the model and how nonlinear functions, used to model the relationship of the outcome with the covariates, may be simplified.
The left panel of figure 3 shows the correlation matrix for the estimates of β based on the linear splines, as represented by the dashed red lines of figure 2. From now on, we shall refer to this initial geostatistical model as M I . In this graph, the squares encompass the correlations for all the regression coefficients used to capture the empirical nonlinear relationship between prevalence and a single covariate. We observe that most of the correlations outside of these squares are close to zero, with the exception of two pairs of coefficients for population and NTL whose correlations are approximately −0.7. None of the correlations outside the squares are high enough in absolute value to justify the removal of any of the covariates. If we now examine the correlations of the regression coefficients within each square, we see that, for NTL and precipitation, the estimates have a correlation very close to 1. This strongly suggests overfitting, indicating that the true nonlinear functional relationship may be difficult to recover as a result of the noisy empirical associations, as evident from figure 2. In this case, a possible solution is to fit a simpler linear relationship for NTL and precipitation. In other cases, where more than one knot has been used to define the linear spline, a solution may be the removal of any knot that is close to the minimum or maximum value of an explanatory variable, where nonlinear relationships between the covariates and the outcome are informed by fewer data-points and are therefore less reliable.
We then fit a simplified model where we assume a logitlinear relationship for NTL and log-precipitation. The right panel of figure 3, shows that this has reduced the correlations between the regression coefficients with the largest value around −0.6. We conclude that this simplified model does not show evidence of overfitting. In the next section, we consider spatial prediction based on this simplified geostatistical model, to which we shall refer as M F ; see table 3 for a summary of the covariates and how these are used in the models M I and M F , respectively. The point estimates and 95% confidence intervals for the parameters of M I and M F are reported in electronic supplementary material 1, table S1.

Spatial prediction
In order to carry out spatial prediction, we first must define (1) a predictive target, say T and (2) one or more summary statistics of its predictive distribution, i.e. the probability distribution of T conditional on all of the observed data.
Let A denote a geographical area of interest and R k , for k = 1, …, K, be a set of spatial units forming a partition of A. In our example, A corresponds to the area encompassed by the boundaries of Tanzania, while R k are the K = 26 Tanzanian regions. In disease prevalence mapping, two predictive targets that are often of scientific interest are the spatially continuous prevalence surface covering the whole of A, i.e. p(A) = {p(x) : x ∈ A}, and the regional population prevalence, where w(x) is the population density at location x. In order to compute the target in the above equation, we approximate each integral by a sum over a regular grid covering R k . In general, the spatial resolution of the regular grid should be fine enough to make the correlation between values of p(x) in adjacent pixels at least 0.95 in order to generate an appropriately smooth prevalence surface map for p(x). The most commonly used summaries of the predictive distribution of a prevalence surface are maps of its means, standard errors and selected quantiles for each spatial unit. However, public health policies are often developed based on the exceedance, or non-exceedance, of predefined prevalence thresholds, say l. In these cases, the prediction at each location or region is expressed as an exceedance royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20210104 probability (EP), defined as EP(x) = Prob[ p(x) > l|y 1 , …, y N ] and EP(R k ) = Prob[ p(R k ) > l|y 1 , …, y N ] for location-specific and spatially aggregated targets, respectively. An EP close to 1 or 0 indicates near-certainty that the target does or does not exceed the specified threshold l, whereas an EP close to 0.5 indicates a high degree of uncertainty.
Another important issue is to assess whether spatial predictions at unsampled locations should be conducted outside the empirical ranges of the covariates. Electronic supplementary material 1, S3 shows a 10 by 10 km regular grid covering the whole Tanzania, after removing locations with no inhabitants according to the WorldPop (www.worldpop.org) population density data. Locations in black indicate that the value of at least one of the covariates at that location falls outside the empirical range. In this case, extrapolation should be carried out with caution, especially when the fitted regression relationships may follow trajectories that are not supported by scientific knowledge of the phenomenon under investigation. Figure 4 shows the predicted prevalence and the EP for a 0.3 threshold over the regular grid (left panels) and at regional level (right panels). The regional-level estimates are obtained by using population density as the weighting function w(x). We identify areas that are highly likely to exceed 0.3 prevalence in the northwest and southeast of the country, where the EP is at least 80%. Large swathes of the central areas of Tanzania have instead estimated prevalences below 20% and EPs no more than 20%, indicating a low probability of exceeding a 0.3 threshold. The predictions at regional level mask the spatial variation observed over the grid locations. The largest estimated prevalence is about 0.47 in the North-Western region of Geita. In the lower right panel of figure 4, we see that overall three districts in Tanzania have an EP of at least 80% of a exceeding a 0.3 threshold. To pursue (a), we use a Monte Carlo procedure based on the empirical variogram. We first simulate a large number of binomial datasets under the fitted model, say 10 000. For each of these, we fit the model in (3.2) to obtain our estimates Z i and the v ij ¼ (Ẑ i ÀẐ j ) 2 =2. The resulting 10 000 variograms are used to compute 95% probability intervals at each of the distance bins. If the variogram obtained from the original data falls within the resulting envelope we conclude that the chosen spatial correlation function gives a suitable model. Using this approach in our application, we conclude that the assumed exponential spatial correlation is supported by the data, as evidenced by electronic supplementary material 1, figure S4. Failure of this diagnostic check could indicate misspecification of the covariance function or incorrect omission of the nugget effect Z i in (4.2). These and other examples are shown in §4.3 of Diggle & Giorgi [40]. Note, however, that the 95% probability intervals shown in electronic supplementary material 1, figure S4 are wide, indicating low statistical power to reject the model; in our experience, rejection of the correlation function based on the outlined Monte Carlo approach is more likely to occur at smaller distance bins of the variogram, where the test is more powerful.

Model validation
For the assessment of the model calibration (b), we adapt the probability integral transform for count data proposed by Czado et al. [43] to binomial geostatistical models. The details and application of this approach are described in electronic supplementary material 1, §1. In brief, this consists of the following steps.
1. Randomly split the dataset into a training and test set. 2. Obtain the predictive distribution of prevalence at the locations of the test set, based on the fitted geostatistical models. 3. Apply the probability integral transform (defined in equation (3) of electronic supplementary material 1) to the observed positive counts of the test set. 4. Assess through the cumulative density function whether the transformed data from the previous step follow a uniform distribution.
One of the main advantages of this approach over crossvalidated prediction error summaries, such as the mean square prediction error, is that it does not rely on treating the fraction of positive cases as if it were the true prevalence. Also, the use of the probability integral transform allows us to assess the overall compatibility of the predictive distribution of prevalence with the data, rather than just its mean. The results reported in electronic supplementary material do indicate that M F is a well calibrated model. For other approaches to the predictive assessment of count data, we refer the reader to Czado et al. [43].
Finally, for the third objective (c), we first compute Pearson's residuals, defined as wherep(x i ) is the mean of the predictive distribution of prevalence at location x i , for i = 1, …, n. We then compute the variogram based on e i and generate the 95% confidence envelope using the permutation test, described above. The results are displayed in electronic supplementary material, figure S6 and show no evidence of residual spatial correlation in the residuals e i , which thus support the assumption of conditional independence.

Assessment of the contribution of covariates to spatial prediction
We now assess the contribution of the domain effects, δ j (x i ), to the prevalence predictions.
To this end, we first fit and carry out spatial prediction for the following four geostatistical models, each of which considers a subset of the covariates used in the full model. We then compare the models to assess how the introduction of covariates affects (i) the spatial structure of the unexplained variation in prevalence S(x i ) + Z i , (ii) the point estimates of prevalence and (iii) the exceedance probabilities (EPs).
To carry out (i), we examine changes in the estimates of σ 2 , ϕ and τ 2 through a graphical inspection of the theoretical variograms from the four fitted models. When using an exponential correlation function, as in our application, the theoretical variogram is Plugging the maximum-likelihood estimates into the above equation, gives the four variograms shown in electronic supplementary material, figure S5. This clearly shows that the introduction of covariates leads to smaller estimates for σ 2 , larger values for ϕ but does not materially change the estimate of the nugget variance τ 2 . The practical range, i.e. the distance that gives a spatial correlation of 0.05, obtained from M SES is closer to M F than to the other two models, suggesting that socio-economic variables explain spatial variation on a smaller scale than environmental covariates.
To quantify the impact of the covariates on the spatial predictions of the predictive target, we consider two summaries defined for both spatially continuous and spatially aggregated predictions.
The first summary quantifies how similar are the prevalence predictions from M 0 , M SES and M E compared with the full model M F , using the root-mean-integrated-squareerror (RMISE) criterion, are closer to the full model M F . Similarly, for the spatially aggregated estimates p(R k ) in (5.1), we compute which we interpret in analogous fashion to RMISE C . To assess the impact of the covariates on the EPs, we propose a summary index that quantifies how well a fitted model is able to discriminate between areas that are above and below a given threshold. Hence, the summary should penalize models that give EPs closer to 0.5 and favour models with EPs closer to 0 or 1. For the spatially continuous EPs, this can be achieved using the following index: Similarly, for the EPs of the spatially aggregated average prevalence (EP M i (R k ) À 0:5) 2 : (7:4) As both I C and I R can only take negative values, these two indices are close to 0 when the EPs are close to 0.5. Conversely, I C and I R will give large negative values for EPs that are closer to 0 or 1.
In our application, we approximate the integrals in (7.1) and (7.3) using the 10 by 10 km regular grid shown in electronic supplementary material, figure S3. The values of the proposed summaries for each model are reported in table 2. Regarding the spatially continuous predictive inferences, we observe that M SES gives prevalence predictions that are closer to M F than the other models. The same is also observed for the regional predictions of the average prevalence. The model that gives a better discrimination of areas above and below a 0.5 prevalence threshold is M E but, when considering regional-level EPs, all the models considered give similar performances.
We conclude that, in this example, NTL and population density are mostly driving the predictions of prevalence in the final model, both on a continuous and regional scale. Environmental variables help to better discriminate between areas above and below a 0.3 prevalence threshold on a spatially continuous scale. However, according to the EPs of spatially aggregated prevalences at regional-level, the inclusion of covariates does not improve the discrimination between spatial units that exceed and do not exceed 0.3.

Further insights into the impact of covariates through a simulation study
We carry out a simulation study in order to gain further insights into the impact on spatial prediction arising from: (1) the misspecification of the relationship between covariates and prevalence; (2) the omission of important covariates.
In our simulations, we use M I as the true model for generating simulated datasets. We then compare the performance of the true model M I with M F , which misspecifies the relationships with NTL and precipitation, and with a model that does not use any covariates, denoted by M 0 ; see table 3.
The simulation proceeds through the following iterative steps.
1. Simulate the true prevalence under M I at the observed locations (figure 1) and over the 10 by 10 km regular grid (electronic supplementary material 1, figure S3). 2. Based on the simulated prevalence at observed locations in the previous step, simulate a binomial dataset for the number of malaria cases. 3. Fit models M I , M F and M 0 to the simulated dataset in the previous step. 4. For each of the three fitted models, carry out spatial predictions for prevalence over the 10 by 10 km regular grid and at regional level. 5. Repeat steps 1 to 4, 1000 times.
On completion of these steps, we summarize the performance of M I , M F and M 0 , through summaries at both pixellevel and regional-level. We quantify the prediction error from the three models using the RMISE as defined in the previous section, replacingp MI (x) in (7.1) andp MI (R k ) in (7.2) with the true prevalences obtained in step 1. We then denote these two summaries as RMISE T C and RMISE T R , respectively. To quantify the accuracy of the classifications of both pixels and regions based on the EPs, we first optimize the sensitivity and specificity of each model and average these two across the 1000 simulations, as follows. For each simulated dataset, we use the simulated true prevalence, both at pixel-level and regional-level, to label each pixel Table 2. Summaries quantifying the contribution of the covariates to spatial prediction. The summaries RMISE C and RMISE R quantify the discrepancy of the point predictions of prevalence between M 0 , M E and M SES with the full model M F . The summaries I C and I R quantify how close the exceedance probabilities generated from a model are to 0 or 1. The interpretation of the summaries is explained in the main text.
royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20210104 and region as being above or below the threshold of 0.3 prevalence. Based on the fitted geostatistical model, we then compute the EPs of 0.3 prevalence and find the value of EP that gives the largest sensitivity and specificity for the classification of pixels and regions. In reporting the results, we use Sens C and Spec C denote the sensitivity and specificity, respectively, at pixel-level, and Sens R and Spec R similarly at regional-level.
The results of the simulation study are reported in table 4. The true model M I yields the smallest RMISE for both the regional and spatially continuous predictions. The predictive performance of the misspecified model M F in terms of RMISE is not materially different from that of the true model for both continuous and regional targets. However, the misspecified relationship leads to a substantial decrease in the sensitivity of the model to detect areas above a 0.3 prevalence. The model without covariates, M 0 , has a higher RMISE than the other two models but also a substantially higher sensitivity than M F . This illustrates the general point that model choice needs to be considered in relation to the purpose of the modelling.

Discussion
In this paper, we have introduced a framework for the geostatistical analysis of spatially referenced prevalence data and have illustrated the different stages of the analysis using a case study on malaria mapping in Tanzania. The aim of this framework is to guide researchers to make modelling choices that enhance the predictive or explanatory power of the model depending on the primary objective of the analysis. For each step of a geostatistical analysis, we have proposed the use of statistical tools to address specific modelling objectives. We emphasize that the list of statistical tools presented in table 1 is not exhaustive and alternative solutions are also possible. Below, we further consider the strengths and limitations of our proposed methods.
The models presented in this paper did not explicitly account for the stratification of the DHS-MIS2015 conducted in Tanzania. In electronic supplementary material 2, we revisit the final model, namely M F , and incorporate the strata variables provided by the DHS-MIS2015, corresponding to the urban and rural strata for each of the regions of Tanzania. The results suggest that the inclusion of population density, NTL and the spatial random effects S(x) adequately adjust for the urban/rural stratification, thus making the inclusion of the design-based variables unnecessary. However, we emphasize that this result cannot be generalized and the importance of stratification variables should be assessed in any geostatistical analysis. As explained in §2 of Giorgi et al. [41], failing to account for the stratification of the design can lead to a stochastic dependence between the sampling design and the underlying spatial process-also known as preferential sampling-which can invalidate the predictive inference on prevalence. In the context of vaccination coverage estimation, Dong & Wakefield [44] have carried out a thorough study on the importance of design variables into geostatistical models, showing that modelling survey stratification and clustering improves the predictive performance of geostatistical models. Previous studies that used geostatistical models for disease mapping did not account for the sampling design explicitly (e.g. [45][46][47][48]), but made extensive use of covariates that are correlated with the urban and rural strata and may therefore be sufficient to adjust for the stratification of the design.
The problem of accounting for the sampling design has been extensively investigated in the field of small area estimation, or SME (e.g. [49][50][51]). SME models are spatially discrete models that are used to analyse areal-level outcomes and borrow strength of information across space by defining a spatial correlation structure based on neighbouring properties. SME methods thus provide an alternative approach to model-based geostatistical methods when the goal of the analysis is exclusively focused on making predictive inferences on disease risk at district-level. The recent study by Utazi et al. [52] compared the predictive performances of SME and model-based geostatistical models, and found that the latter deliver equally reliable areal-level estimates of prevalence. However, as highlighted by Paige et al. [53], district-level estimates based on geostatistical models require the availability of high-quality population density surfaces for the computation of areallevel prevalences (see equation (5.1)) and should, in principle, also be stratified according to the design. Nevertheless, model-based geostatistical methods provide principled, likelihood-based solutions to questions that cannot be addressed with SME methods, including the ability to combine data at multiple spatial resolutions and to carry out spatial aggregation at whatever level is relevant to a given health policy question.
Another important topic that we did not cover in this paper is the handling of large spatial datasets, which can substantially increase the computational burden. For an overview of some of the current methodologies that are used to approximate a Gaussian process, we refer the reader to ch. 3 of Diggle & Giorgi [40]. Among these, the approximation based on stochastic partial differential equations (SPDE), proposed by Lindgren et al. [54], is one of the most commonly used, especially in conjunction with Bayesian inference [55].
In order to guide the model-building process, we have distinguished between the predictive and explanatory power of the model. The balance between the two should depend on the objectives of the analysis. In our data example on malaria mapping in Tanzania, the objective was to identify areas of high transmission, defined as exceedance of a 0.3 prevalence threshold. Among the models considered, M I , which included both environmental and socio-economic risk factors as covariates, had the strongest explanatory power. However, our results indicated that some of the regression coefficients of the linear splines were poorly identified and proposed a simplified model, M F , which incorporated most covariates as linear effects into the linear predictor. From a predictive modelling perspective, when considering models that use subsets of the covariates in M F , we found that the full model M F delivered only a marginally better performance than those. Also, we found that some covariates helped to improve the accuracy of the point predictions of prevalence but were not as useful in increasing the sensitivity and specificity of the model with respect to the identification of pixels and regions that exceeded a 0.3 prevalence threshold. Furthermore, our simulation study showed that a model that misspecified the relationship between the covariates and prevalence performed as well as the true model in terms of the root-mean-square-error, but had a substantially lower sensitivity in the identification of areas above 0.3 prevalence. We point out that these findings are dependent on both the properties of the underlying spatial process and the prevalence threshold used for the exceedance probabilities, hence they are not generalizable to other geostatistical analyses.
We have shown that linear splines can help to account for non-linear relationships between covariates and disease prevalence, and to explain covariate effects through easily interpreted regression coefficients. We have also introduced the concept of domain effects in order to inform the variable selection process and the validation of the model. When considering a larger number of covariates than those used in our example, a reduction of the dimensionality could be achieved by combining all the covariates within each domain using, for example, principal component analysis. However, in this case, the explanatory power of the model would rely on interpretability of the principal components, which may be problematic.
In our framework, validation of the assumed correlation function is carried out through a Monte Carlo procedure based on the empirical variogram. Two limitations of this approach are: the validity of the diagnostic relies on the assumption of stationary and isotropy; the statistical power to reject a chosen correlation function is often low. The second issue is related to the wider problem that prevalence data are only weakly informative of the smoothness of the underlying spatial process, which can result in highly uncertain estimates of the covariance parameters. Development of more statistically efficient tools for the diagnostic of correlation functions for count data requires more research.
To assess over-fitting, we have proposed the use of the correlation matrix as an approach for exploring overfitting and identify solutions for avoiding over-complex and poorly estimated regressions relationships. We emphasize that this approach cannot be used to assess the relative importance of covariates but only how well the regression relationships of the model are estimated. In our analysis of the Tanzania malaria data our initial geostatistical model did show evidence of overfitting with correlations close 1 and we therefore simplified the model by removing knots of the linear splines at points where the data were sparse. However, it is difficult to provide a single, universal threshold value for the correlation that can be applied to identify overfitting. Our view is that decisions on how to define regression relationships should also rely on subject matter knowledge and not exclusively on statistical judgement. For this reason, models that have more scientific validity but are less precisely estimated may also be favoured in some contexts.
In our example, information on the minimum and maximum age of the sampled individuals at each location was also available. However, we did not find any discernible relationship of these with the empirical prevalence and therefore discarded them from the analysis. When analysing aggregated disease counts data, an alternative approach to the standardization of prevalence maps to specific age groups is to define weights w(x) in (5.1) based on the age-stratified population density at location x.
In studies where the primary goal is to understand the relationship between risk factors and disease risk, rather than geostatistical prediction, some modifications to the framework we have illustrated may be needed. For example, a model for disease prevalence may be first developed by considering only covariates that are known to confound the relationship between disease risk and the variables that are of primary scientific interest. These are then introduced in the model after restricting the variable selection process, as described in this paper, to the confounding factors. Covariates of scientific interest should then be kept in the final model, regardless of their statistical significance. Confidence intervals for the estimated regression relations can instead be used as a way of conveying the strength of the association between the covariates and disease risk.
Finally, we point out that the statistical ideas and principles presented in this paper are applicable to any statistical analysis of epidemiological data based on regression modelling. This also includes extensions of the standard geostatistical model for prevalence mapping to spatiotemporal analysis, modelling of zero-inflated prevalence data, combining data from a mix of randomized and opportunistic surveys [56,57] and multiple diagnostics [58].
Data accessibility. This article has no additional data.