Bayesian cluster geographically weighted regression for spatial heterogeneous data

Spatial statistical models are commonly used in geographical scenarios to ensure spatial variation is captured effectively. However, spatial models and cluster algorithms can be complicated and expensive. One of these algorithms is geographically weighted regression (GWR) which was proposed in the geography literature to allow relationships in a regression model to vary over space. In contrast to traditional linear regression models, which have constant regression coefficients over space, regression coefficients are estimated locally at spatially referenced data points with GWR. The motivation for the adaption of GWR is the idea that a set of constant regression coefficients cannot adequately capture spatially varying relationships between covariates and an outcome variable. GWR has been applied widely in diverse fields, such as ecology, forestry, epidemiology, neurology and astronomy. While frequentist GWR gives us point estimates and confidence intervals, Bayesian GWR enriches our understanding by including prior knowledge and providing probability distributions for parameters and predictions of interest. This paper pursues three main objectives. First, it introduces covariate effect clustering by integrating a Bayesian geographically weighted regression (BGWR) with a post-processing step that includes Gaussian mixture model and the Dirichlet process mixture model. Second, this paper examines situations in which a particular covariate holds significant importance in one region but not in another in the Bayesian framework. Lastly, it addresses computational challenges in existing BGWR, leading to enhancements in Markov chain Monte Carlo estimation suitable for large spatial datasets. The efficacy of the proposed method is demonstrated using simulated data and is further validated in a case study examining children’s development domains in Queensland, Australia, using data provided by Children’s Health Queensland and Australia’s Early Development Census.


Introduction
Spatial regression models and algorithms are widely used to model the relationships between the response variable and the covariates over the region of interest.Cressie [8] proposed a spatial linear regression model in which only the intercept accounts for the spatial random effect.Diggle [13] extended the spatial linear regression to the spatial generalised linear model.Brundson [3] proposed geographically weighted regression (GWR) to capture smoothly varying patterns of the regression coefficients.The GWR fits a local weighted regression model at the location of each observation and captures spatial information by accounting for nearby observations, using a weight matrix defined by a kernel function.Xue [58] extended the GWR to the Cox survival model, by providing a novel approach to analysing spatially dependent survival data.This extension enhances the ability to explore how geographic factors impact time-to-event outcomes.
Despite the appeal of the GWR model, there are some limitations in these frequentist approaches.For example, a critical issue is the violation of the usual assumption of non-constant variation between observations, and the resultant normality assumption for the errors [5].Additionally, it struggles to address issues of model complexity, overfitting, variable selection and multicollinearity; also, the stability and reliability of frequentist GWR might yield unstable results or high variance when dealing with small sample sizes [15].Bayesian GWR is considered one of the best solutions to address these problems [48].In the Bayesian framework, Gelfand [19] built a model with spatially varying coefficients by applying a Gaussian process to the distribution of regression coefficients.LeSage [29] suggested an early version of BGWR, where the prior distribution of the parameters depends on expert knowledge.More recent approaches have been proposed by Ma [37], who proposed BGWR based on the weighted log-likelihood, and Liu [34] proposed BGWR based on a weighted least-squares approach.
In this paper, we propose a new extension that integrates BGWR with unsupervised probabilistic clustering.Cluster analysis is of great interest in many spatial contexts.The most common method for spatial clustering is the scan statistic [27], which is constructed via likelihood ratio statistics.Similar efforts have been made in Bayesian and non-parametric frameworks.For example, Li [31] proposed a non-parametric Bayesian method to find cluster boundaries for areal data.Neill and Moore [41] described an extension of the spatial scan statistics based on improving space-time cluster detection.Unsupervised clustering is a set of statistical and machine learning approaches that partition cohorts into subgroups based on the structure of the data.The most common unsupervised clustering algorithms are K-means [57] and a Gaussian mixture model (GMM) [38].We extend the BGWR to identify groups of observations that exhibit similar behaviour, by clustering the posterior regression coefficients obtained from BGWR with a Gaussian mixture model and the Dirichlet process mixture model.The proposed BGWR model thus clusters the coefficients into distinct homogeneous groups using these two probabilistic clustering algorithms.One notable advantage of our algorithm is its ability to automatically determine the optimal number of clusters without requiring prior knowledge, which sets it apart from the traditional K-means algorithm and provides an edge over frequentist GWR.
In this paper, we also introduce methods that significantly improve the Markov chain Monte Carlo (MCMC) estimation for the BGWR model proposed by Ma [37].Due to the high computational cost of their algorithm, the geographic regions must be divided into subsets and compute BGWR for each subset separately.This compromised the accuracy of the model for areas on the boundaries of the subsets.In our proposed approach this is not necessary, making it more suitable for large spatial scales and numerous areas while preserving information from the boundaries.Our approach involves removing unnecessary computations and factorizing declarations.It also enables the inclusion and efficient handling of both continuous and categorical explanatory variables.We also integrate BGWR with dynamic variable selection using the reversible jump Markov chain Monte Carlo(RJMCMC), which identifies when a particular covariate is important in a specific local region.
The power of the proposed algorithm is demonstrated through comprehensive simulation studies.As a practical illustration, this methodology is then applied to a case study focusing on children's development domains in Queensland, Australia.
Child development includes the biological, psychological, and emotional changes that occur between birth and maturity [53].Children's development in the early years from birth to five years of age is crucial since it is at this time that the foundations for health development, emotional wellbeing, and life success are built [24].The early identification of groups of children who are developmentally vulnerable may lead to prompt early intervention.Physical, social, emotional, speech and language, and communication skills are the five critical domains of growth [25].Development domains have been used in previous research to classify children into subgroups that describe their development using an unsupervised clustering algorithm [1,20,52].While BGWR offers a more robust framework for spatial regression, there remains a challenge in identifying and interpreting localized patterns or homogeneity within the spatial data.This is where the integration with clustering becomes important.Clustering, especially with advanced methods like Gaussian mixture models and the Dirichlet process mixture model, allows us to group similar spatial behaviours based on the posterior regression coefficients derived from BGWR.We are aiming to identify regions within the spatial data where specific behaviours or patterns are consistent, enabling targeted interventions or insights.
The novelty of our approach lies in this unique integration.By combining BGWR with probabilistic clustering, we not only get a refined understanding of spatial relationships but also connect these relationships into distinct clusters or groups.This analysis offers depth via BGWR and comprehensive insights through clustering.Moreover, our methodology's capability to autonomously determine the optimal number of clusters offers a more adaptive and intuitive clustering mechanism than traditional methods.In the context of children's development domains, this can be valuable.Identifying clusters can help in pinpointing areas or groups of children who might have similar developmental challenges or needs, leading to more targeted interventions and policy implementations.

Methods
In this section, we introduce the BGWR model and detail the three methodological contributions of this paper.First, we describe the vectorisation method to enhance BGWR efficiency.Second, we extend the BGWR analysis to include variable selection for each covariate in each specific region.Finally, we describe the extension of the BGWR model to incorporate the Dirichlet process mixture model (DPMM) and Gaussian mixture model (GMM) in order to identify clusters of coefficients of interest.

Bayesian geographically weighted regression
A common assumption for the GWR is that the error terms are normally distributed with constant variance ϵ(s) ∼ N (0, σ 2 I) [30] for a specific geographical location s.There are many situations in different fields where the assumption of constant variance is invalid.According to Paez et al. [42], the error term can be written as ϵ(s) ∼ N (0, Ω(s)), where Ω(s) = σ 2 (s)W (s) with W (s) as a diagonal matrix of geographic weights function f (d i (s)|b) that is a decreasing function of distance d i (s) from the location s to the location i. GWR is seen as a locally weighted regression method that operates by assigning a weight to each and every observation i depending on its distance from a specific geographical location s [56].This local perspective of the variance is often called location heterogeneity.A common approach is to define the observations that are within a certain distance b from a specific location s.Different weights can be used in the GWR model, including: where d i (s) is the distance between the locations s and i.Other weights include the exponential and Gaussian functions, which give the following expressions for weights, respectively: where b represents the bandwidth that controls the decay over distance [7].Equations ( 1) and ( 2) are decreasing functions of d i (s), which shows that an observation far away from the location of interest contributes little to the estimate of parameters at that location.Different choices have been used to find d i (s); the Euclidean distance is the most popular choice when latitude and longitude for each observation are available [59].Other choices of distance matrices include the graph distance [18] and greater circle distance (GCD) [4]; these approaches are used in the BGWR in sections 3 and 4. A further explanation of these distance measures can be found in Appendix A.
The likelihood of the BGWR model can be written as [37]: Y is the n × 1 observation or response, X the n × p predictor matrix, β is p × 1 vector of spatially varying coefficients, and . The posterior distribution is given as In the GWR model, it is crucial to choose a suitable bandwidth for the weighted function.In a BGWR context, a prior can also be applied to the bandwidth b to allow estimation of the best bandwidth given other parameters.The choice of the prior also depends on the measure of distance that is used.A common prior for the bandwidth is [43]: Without any prior knowledge, D can be selected to be large enough that we begin to approximate with a non-informative prior; i.e., we begin with an approximate global model in which all observations are weighted equally.

Vectorisation methodology
In this section, we introduce our first contribution aimed at enhancing the computational efficiency of the BGWR to accommodate large-scale datasets, ensure that there is no loss of information at the boundaries, and remove the need to split the data into smaller regions to run the BGWR model.The use of a multivariate normal distribution in BGWR leads to high computational costs.In this paper, we introduce a new approach based on vectorisation, which is achievable in the "nimble" package in R.This improves efficiency by reducing the number of calculations and nodes in the model and thus enhancing MCMC performance.The likelihoods for each region in the BGWR model are vectorized.The precision matrix σ 2 (s)W −1 (s) is a diagonal matrix, which allows for an alternative representation using univariate Gaussian functions.This permits independent estimation of mean and variance in each dimension and characterization of the multivariate density function as a product of univariate Gaussian functions for each location s.Given the likelihood function in the multivariate case with a diagonal W (s) for the region s, the likelihood becomes: When W (s) is diagonal, the inverse W −1 (s) is also diagonal, so the likelihood function can be rewritten as follows: Expanding the multiplication term: The likelihood function can then be further simplified to the product of univariate Gaussian likelihoods: where w −1 1 (s), w −1 2 (s), . . ., w −1 n (s) is the diagonal that represents the weights from a specific region s for the n different locations of the weight matrix W (s) for location s.Thus, observations with higher weights (indicating higher importance or reliability) contribute more to the overall likelihood value compared to those with lower weights.
The exponent term inside the exponential represents the sum of squared differences between the observed response variable y(s) and the predicted values x T (s)β, weighted by w −1 i , which is the precision or weight associated with each observation.The determinant of a diagonal matrix is simply the product of its diagonal entries.Thus, The likelihood in the above form demonstrates that, when the weighted matrix is diagonal, the multivariate Gaussian likelihood reduces to a product of n univariate Gaussian likelihoods, one for each region s.We can represent the likelihood for each y(s) as: The full likelihood for the entire dataset is the product of the likelihood for each observation.In summary, the BGWR approach that is developed in this paper can be represented in the following form: where f is the weighted function in equations 1 or 2, and b is the bandwidth.The posterior distribution is obtained using MCMC methods and the parameters of interest are estimated using the posterior mean of the respective marginal posterior distributions.The steps of fitting the proposed model in Nimble [11] are provided in Appendix B.

BGWR with Dynamic Variable Selection: RJMCMC Approach
This section introduces our second contribution, the identification of locally important covariates for each location.
In the Bayesian framework, various algorithms have been developed to determine the most relevant predictors that should be included in a model to provide the best explanation for a response variable.A popular approach was developed by Mitchell and Beauchamp [39], which employs a prior distribution that encourages sparse models in Bayesian linear regression.Spike and slab approach to variable selection have also been proposed by George and McCulloch, Chipman [6], and Kuo and Mallick [28] proposed a simpler approach that embeds indicator variables in the regression equation.Green [22] introduced RJMCMC, which enables models of different dimensionality to be explored.Bhattacharya and Dunson [2] proposed a Bayesian non-parametric approach to sparse regression that can handle infinite potential predictors.The approach uses a spike-and-slab prior to calculating posterior model probabilities for each possible subset of variables but does not require specifying the number of potential predictors a priori.Ma [37] employed a traditional spike and slab approach in the context of BGWR.However, their method includes or excludes the coefficient entirely without considering its potential importance in specific regions.
In this paper, a novel approach is described for spatial local BGWR that leverages the inclusion or exclusion of the coefficients based on their importance in specific regions.This advancement enhances the model's ability to capture location-specific effects accurately and opens new possibilities for understanding the spatially varying impact of coefficients.The proposed algorithm replaces the likelihood in equation (11) as: where Γ s is a 1 × p vector with elements γ j (s), and * presents the broadcast operation (element wise multiplication), with other priors given by equations ( 12) to (16).In equation (17), for each region s and each corresponding covariate j, an indicator variable γ j (s) is assigned, following a Bernoulli distribution with probability parameter ψ j .Thus γ j (s) allow probabilistic determination of whether the coefficient β j should be included in the model for the region s or not, depending on the contribution of the jth covariate in the model for the sth region.This dynamic approach to variable selection allows the model to make data-driven decisions about the importance of covariates in explaining the response variable for each region.A RJMCMC algorithm is adopted to implement this approach [22].The steps of the RJMCMC algorithm are detailed in Appendix C. We have tested our algorithm using simulated and real datasets; the code for the analysis of the simulated data is available on the first author's GitHub.Link to BGWR Clustering on GitHub

Cluster BGWR
In this section, we present two techniques for probabilistic clustering of the posterior distributions of the region-specific (local) coefficients obtained from the BGWR analysis.The first parametric unsupervised learning technique assumes that the coefficients can be represented by a known probability distribution, the Gaussian mixture model (GMM).The second is a non-parametric unsupervised learning method that does not make any assumptions about the probability distribution of the coefficients and aims to discover the clusters, using a Dirichlet process mixture model (DPMM).We describe these two approaches in a univariate context and consider a random sample of size n drawn from the posterior distribution of the coefficient of interest, obtained from the MCMC iterations.We denote this sample generically as y = {y 1 , y 2 , ..., y n }.

Gaussian mixture model
The GMM is represented as a weighted sum of Gaussian density functions, each with its own set of parameters [12].The GMM is represented as where ) is a Gaussian density function parameterized by θ k [23].For computational simplicity and to enhance inferential capability, a latent component indicator, z = {z 1 , z 2 , .., z n } is introduced, where each z i is a K-dimensional vector and the kth element of z i takes the value 1 if y i comes from component k, (k = 1, ..., K) and takes the value 0 otherwise.The parameters of the GMM can be estimated in a Bayesian framework using MCMC or approximations such as variational inference.Another popular approach that we adopted here is expectation-maximization (EM) [23].The EM algorithm iterates between two steps until convergence: the expectation step, which takes the conditional expectation of the complete data log-likelihood given the observed data and current parameter estimates, and the maximization step, which maximizes the log-likelihood with respect to the parameters to give updated estimates [45].There are many ways to select the optimal number of clusters (K) for GMM.Common approaches include the silhouette score [46], the Bayesian information criterion (BIC) [54], the elbow method which is based on a plot of the explained variance of the model versus the number of clusters [33], and cross-validation, which involves splitting the data into multiple subsets and then training and evaluating the GMM model on different subsets of the data [17].In this study, we used BIC to find the optimal number of clusters.The choice of BIC is motivated by the nature and assumptions of the GMM.The GMM assumes that data are generated from a mixture of several Gaussian distributions.BIC is particularly suited for model selection in probabilistic models like GMM.It balances the likelihood of the data under the model against the complexity of the model, effectively penalizing overfitting.Given that GMM involves the estimation of several parameters, especially as the number of mixtures (or clusters) increases, BIC becomes a particularly relevant criterion for determining the optimal model.

Dirichlet process mixture model
DPMM is a non-parametric method that replaces the fixed number of clusters with a random probability measure [16].The DPMM is defined by a base distribution G 0 and a concentration parameter α.The base distribution represents the prior distribution over the means of the clusters, while the concentration parameter controls the level of clustering.The DPMM can be summarised as follows: where each θ i is drawn from a mixing distribution G.This mixing distribution has a Dirichlet process prior, with concentration parameter α and base distribution G 0 where The concentration parameter acts as an inverse variance where larger values of α result in smaller variances, which creates more concentrated draws around the mean of the base distribution [50].
The DPMM is typically estimated using a Bayesian approach, such as the Chinese restaurant process or the stick-breaking process [40].The stick-breaking construction is used in this paper.
In this construction, the G can be determined by an infinite sum of weighted point masses: where δ θ k is a point mass of 1 located at θ k , which is sampled directly from G 0 , i.e θ k ∼ G 0 .The weights C k are obtained through the stick-breaking process:

Cluster configurations
Two approaches were considered to determine cluster membership of the observed data.The first method, known as Dahl's approach [10], involves computing the membership matrices for each iteration, denoted as B (1) , . . ., B (M ) , with M being the number of post-burn-in MCMC iterations.
The membership matrix for the cth iteration B (c) is defined as: where 1(•) represents the indicator function.The entries B (c) (i, j) take values in {0, 1} for all i, j = 1, . . ., n and c = 1, . . ., M with B (c) (i, j) = 1, indicating that observations i and j belong to the same cluster in the cth iteration.An empirical estimate of the probability that locations i and j are in the same cluster is given by the average of B (1) , . . ., B (M ) : where denotes the element-wise summation of matrices.The (i, j)th entry of B provides the required empirical estimate.Subsequently, we determine the iteration that exhibits the least squared distance to B as: The second method utilised here is the mode method.This method leverages the posterior samples of z i , where z denotes the cluster assignments specific to each region.Each iteration generates a new set of cluster assignments z, which are dependent on the parameters.Consequently, following multiple iterations, each region will have a collection of cluster assignments z.The mode indicates the cluster with the highest probability of assignment for a given region.Moreover, obtaining probabilities for assignment to alternative clusters provides valuable insights, aiding in the inferential and decision-making process.

Cluster accuracy
To assess the accuracy of our proposed algorithm, we utilised the Rand index (RI) [44] in our simulated study (see section 3), to perform a comparison between the cluster configurations obtained using either Dahl's method or the mode method, and the true clusters.Specifically, we employed this metric for the simulated data.However, for real data analysis, true labelling is unavailable.The RI quantifies the level of agreement between two sets of cluster assignments, denoted as C and C ′ , with respect to a given dataset X = {x 1 , x 2 , . . ., x n }.Each data point x(s) is assigned a cluster label c i in C and c ′ i in C ′ .The computation of RI is based on the following formula: where a is the number of pairs of data points that are in the same cluster in both C and C ′ (true positives), b is the number of pairs of data points that are in different clusters in both C and C ′ (true negatives), c is the number of pairs of data points that are in the same cluster in C but in different clusters in C ′ (false positives), d is the number of pairs of data points that are in different clusters in C but in the same cluster in C ′ (false negatives).
The RI value ranges from 0 to 1, where a value of 1 signifies a perfect agreement between the two clusterings (both C andC ′ fully agree on all pairs of data points).On the other hand, a value close to 0 indicates a low level of agreement between the two clusterings.

Simulated Data Analysis
In this section, we present two simulated datasets: one with spatially varying coefficients and the other without.The first simulation is related to a dataset based on Louisiana and is generated in a manner similar to the work by Ma [37].This dataset comprises a total of 64 regions, with three observations in each region i.
We used a bandwidth parameter D set to 100.The maximum great circle distance in the spatial structure of the 64 regions is 10, so using a bandwidth of 100 induces a weighting scheme that ensures relative weights are assigned appropriately.If the distance between two regions is considerable, the relative weight is approximately exp(−10/100) = 0.904.This approximation thus allows the model to behave similarly to a global model where every observation is equally weighted, ensuring a sufficiently non-informative prior bandwidth b.
In each region, three observations are generated, resulting in a total of 192 observations per replicate.The analysis is replicated 100 times.Each replicate involves running a MCMC chain of length 10,000 without thinning, and the initial 2,000 samples are discarded as burn-in.The mean bandwidth selected in the 100 replicates was calculated as well as the posterior means of the parameters β.The results are reported in Table 1.Additionally, we employed the dynamic variable selection (RJMCMC) technique described in section 2.3.This analysis wasreplicated 100 times.Each replicate involves running a RMCMC chain of length 400,000 without thinning, and the initial 20,000 samples are discarded as burn-in to converge.The results are summarized in Table 2.This table illustrates how our algorithm identifies important covariates for each location.Our proposed BGWR model with vectorisation significantly improves computational efficiency, completing each replicate in under 250 seconds.In contrast, a model that emulated the previous approach suggested by Ma [37], which relies on a multivariate normal distribution to calculate the likelihood [37], took over 15 minutes to run for each replicate.Finally, we evaluated various weighted functions in this analysis.Table 3 presents the WAIC, DIC, and effective sample size (P D ) obtained with different kernels.The explanation of WAIC and DIC for the BGWR model can be found in the Appendix D. The Bi-square kernel demonstrated a better fit to the data compared to the exponential and Gaussian kernels.
The second simulated study was created based on the structure of the state of Georgia, also considered by Ma [36].This dataset includes 159 regions.Six spatially correlated covariates (X 1 to X 6 ) were generated using multivariate normal distributions with spatial weight matrices derived from the distance matrix and parameter bandwidth.The response variable (Y ) in the simulation was generated by the GWR model: Notably, the parameters (β 1 to β 6 ) of this GWR model were spatially varying, based on the spatial weight matrices.We then visually partitioned the areas into three large regions to define true clustering settings based on the spatial coordinates of centroids.This approach allowed us to create distinct spatial patterns in the data, which incorporated spatial autocorrelation, spatial variability, and true clustering settings.proposed algorithm is presented in Table 5.The analysis includes the three settings (Setting 1, Setting 2, and Setting 3) and the two clustering methods (GMM and DPMM).It is apparent that Setting 3 consistently exhibits higher RI values, signifying superior clustering accuracy compared to the other settings.Furthermore, the number of clusters varies across settings and methods.As anticipated, the DPMM method generally generates a larger number of clusters than the GMM method across most settings [32].Additionally, the cluster configurations differ across settings and methods, revealing distinctive patterns and structures in the data for each configuration.The results for the final clusters in each setting are visually presented in Appendix F. (1, 0, 1, 0, 0.5, 2) (1, 0.7, 0.3, 2, 0, 3) (2, 1, 0.8, 1, 0, 1) 2 (2, 0, 1, 0, 4, 2) (1, 0, 3, 2, 0, 3) (4, 1, 0, 3, 0, 1) 3 (9, 0, -4, 0, 2, 5) (1, 7, 3, 6, 0, -1) (2, 0, 6, 1, 7, 0) Table 5: The cluster accuracy and cluster configuration using three different settings when there is spatial variation in the underlying true parameters based on Georgia dataset.4 Real Data Analysis In this section, we provide a comprehensive overview of our findings derived from applying the BGWR model in a real-world scenario.We explain the sources of the data employed and describe the cluster configurations associated with the BGWR parameter coefficients.Moreover, we present a thorough investigation into preschool attendance as estimated by the BGWR model, alongside the corresponding probability values.We also discuss the incorporation of dynamic variable selection into our analysis of the actual data, and we present a visual representation of substantively important coefficients for each region.

Sources of the data
The Children's Health Queensland (CHQ) has created an impressive resource called the CHQ Population Health Dashboard, which provides data on key health outcomes and socio-demographic factors for a one-year period from 2018-2019.The dashboard is based on information from 528 small areas (statistical area level SA2) across the state of Queensland, Australia.The dashboard includes over 40 variables, visualized in a user-friendly format.The case study considered in this paper focuses on the health outcomes section of the dashboard, specifically, vulnerability indicators, which measure developmental vulnerability for children across five Australian Early Development Census (AEDC) domains: • Physical health and wellbeing, which evaluates children's physical readiness for school, their level of physical independence, and their gross and fine motor skills.
• Social competence, which assesses children's overall social competence, responsibility, respect, approaches to learning and readiness to explore new things.
• Emotional maturity, which examines children's pro-social and helping behaviour, anxious and fearful behaviour, aggressive behaviour, and hyperactivity and inattention.
• Language and cognitive skills (school-based), which evaluates children's basic literacy, interest in literacy, numeracy, and memory, as well as their advanced literacy and basic numeracy.
• Communication skills and general knowledge, which measures children's communication skills and general knowledge.The AEDC data also includes two additional indicators: vulnerable on one or more domains (Vuln 1) and vulnerable on two or more domains (Vuln 2).Additionally, the CHQ dashboard encompasses data on socio-demographic factors that may be linked to health outcomes.Three factors considered in the analysis are the Socio-Economic Indexes for Areas (SEIFA) score, attendance at a preschool, and remoteness.The SEIFA score is a socio-economic index that summarizes a variety of data on individual and family economic and social conditions in a given area.It ranges from 1 to 5, with a low score indicating that greater disadvantage.The remoteness factor includes the categories of cities, regional, and remote.In 2018, Queensland had 294 SA2s in cities, 208 regional SA2s, and 24 remote SA2s.The analysis uses data from the AEDC, which is conducted every three years and collects data on children in their first year of school.The data used in this study is from the 2018 census.Due to the aggregated nature of the data, the analysis focuses on the proportion of vulnerable children within each SA2 [51], collected by first-year teachers across the Australian government and non-government schools, with parents' agreement.The final dataset therefore compromises the proportion of children who attended preschool, the SEIFA, and remoteness for each SA2.
Between 3 and 6% of the data had missing variables, which were imputed using the average of the proportions from the neighbouring SA2s.For categorical data, such as remoteness, missing values were imputed using the highest frequency category of the neighbouring SA2s.However, missing values for two islands could not be imputed as these regions have no contiguous neighbours.As a result, the analysis was carried out on the remaining 526 SA2 areas.Ethical considerations The study was approved by the Human Research Ethics Committee of Children's Health Queensland Hospital and Health Service (HREC/21/QCHQ/75725).

Case study analysis
The main objective of the analysis was to examine the factors that influence children's developmental vulnerability in one or more domains (Vuln1) with a particular focus on the importance of attendance at preschool.Using the proposed approach, we aimed to identify clusters of regions that are similar with respect to the influence of attendance at preschool on Vuln1.
The previous work on the BGWR model was designed to deal with continuous covariates [37].However, in this paper, we extended the BGWR model to deal with categorical variables as well, specifically for the "remoteness factor".Since a region can only belong to one of the three levels (cities, regional, or remote), we created a list of covariates for each region for inclusion in the BGWR model.This avoids the critical issue of drawing from the prior when there is no valid information available for a particular category.
In Section 4.2.1, we explore inferences associated with the proportion of attendance at preschool, as derived from the BGWR model.In section 4.2.2, we explore the efficiency of our variable selection algorithm.Finally, in Section 4.2.3, we discuss the probabilistic clustering results from the two algorithms.Additionally, we investigate the probability of each region belonging to specific clusters.

Inferences from the Bayesian geographically weighted regression
We use the graph distance as well as the greater circle distance in the BGWR model.So, we adopted a non-informative prior for the bandwidth and estimated the optimal bandwidth as part of the analysis by choosing a value of D = 100 and the maximum distance between any two points is 10.
The MCMC algorithm was run for 400,000 iterations with a burn-in period of 20,000 iterations.The results reported are based on the remaining 380,000 iterations.We focused on exemplar insights from the BGWR model.First, the posterior mean and 95% credible interval were obtained for the coefficients associated with attendance at preschool in each region.The results are summarised in Figure 2. It appears that there is a substantive negative relationship between the proportion of attendance at preschool and Vuln 1.This negative association is stronger in the South-East of Queensland, particularly in greater Brisbane, and comparatively weaker in the northern regions of Queensland.Several factors could contribute to this relationship, such as parental background, Indigenous status, and access to preschool services.The second insight is on the evaluation of the non-stationary variance.
Figure 3 presents a scatter plot of the posterior means and variances per region derived from the BGWR model.Regions with smaller posterior means generally have larger variances.Noticeably, this relationship is affected by the three levels of remoteness factor in the case study.The third Figure 3: The scatter plot illustrates the relationship between the mean and variance per each region, derived from the posterior of the coefficients associated with attendance at preschool in the BGWR model.insight involves comparisons between various regions.We selected two regions from cities, two outer regions, and two remote areas, for the purpose of identifying differences and similarities.These regions include Thorlands, Clayfield, Rockhampton region -West, Lockyer Valley-East, Ingham region, and Kowanyama Pormpuraaw.A visualization of the correlation between these six regions is shown in Figure 4, which reveals a slight association among the posterior estimates for preschool attendance.Figure 5 presents a comparative analysis of probabilities among the six regions regarding the posterior estimates of the coefficient associated with attendance at preschool.The probabilities are computed by comparing the posterior draws of the coefficient for one region with those of other regions, for each MCMC simulation.The vulnerabilities in our study are expressed as proportions ranging between 0 and 1.Given this scale, large coefficient values would not only be unexpected but also potentially misleading.Instead, smaller coefficients are more appropriate and consistent with the data's inherent structure.Additionally, while the range of these coefficients might appear limited, even small variations can be meaningful.Especially in a context where the dependent variable operates on such a narrow scale, even slight changes in predictor values can lead to significant shifts in outcomes.Therefore, allowing the coefficients to vary, even within a small range, is essential.This flexibility ensures that our model can capture and represent subtle spatial patterns and gradients in vulnerabilities, which might be critical for understanding and addressing them effectively.Assume β p (i) is the coefficient associated with attendace at preschool and (s) location.In Figure 5, the values are computed using the formula P r (β p (k) > β p (q)).This formula is computed over M iterations when the β p coefficient of the region k is greater than the coefficient in region q.Specifically, for each iteration, an indicator function assigns a value of 1 when β p (k) > β p (q) and 0 otherwise.The final probability is the sum of these indicator values divided by the total number of iterations, M .For example, the value at row 5, column 3 (0.87) represents the probability of attendance at preschool in Kowanyama Pormpuraaw to be greater than Ingham region.The analysis was extended to explore P r (β p (i) > βp ), where β p (i) is similar to before, and βp denotes the attendance at the preschool posterior mean for specific region i.The resulting probabilities are visualized in Figure 6, offering insights into the variability and disparities among regions.Over 300 SA2 regions have a probability of less than 0.5 exceeding their posterior mean βp .
The distribution of these probabilities is depicted in Figure 7. Regions with higher probabilities are highlighted, indicating a greater probability of exceeding the posterior mean for that particular region.In this plot, 101 SA2 regions exhibit a probability greater than 0.8 of having (β p (i) > βp ) across M iterations.Additionally, our analysis involved calculating the probability of each region being among the top ten regions for the coefficients associated with attendance at preschool β p .To achieve this, we performed the following steps: For each iteration in our study (denoted by M ), we obtained a posterior draw for β p in each of the 526 SA2 regions.These draws were then sorted, and the top ten were identified.An indicator variable was created for each iteration M , taking the value 1 if it was among the top ten and 0 otherwise.This process was repeated for all iterations in the posterior.After obtaining the indicator values for all iterations, we calculated the probabilities by summing up the values corresponding to a specific region and dividing by the total number of iterations.This gave us the probabilities of each region being in the top ten.The probability of each region being in the top ten is shown in Figure 8.This visual representation highlights that the far north of Queensland stands has a stronger probability of being in the top ten compared to South-East Queensland.
Finally, an assessment was undertaken of the impact of different kernel functions in the BGWR analysis of the case study.The exponential, bi-square and Gaussian functions were considered.Results were presented in Table 6.As in the simulated study in section 3, Table 3, the performance of the algorithm is evaluated based on three criteria: WAIC, DIC and P D (effective sample size).The table shows that the exponential kernel outperforms the other two kernels, bi-square and Gaussian, in all three criteria.Furthermore, we also observed a consistent pattern in the behaviour  of the greater circle distance and the graph distance when utilised with non-informative priors for the bandwidth in both our case study and simulated data.The results from both distance metrics provide similar posterior estimates.The cluster configuration process of DPMM happens in two stages: The first stage is a withinsample cluster configuration, obtained for each of the 500 randomly chosen iterations in the MCMC analysis of the DPMM.The configurations are based on Dahl's method and the mode method.The second step is across-sample cluster configuration.In this stage, we consider the entire set of 500 samples and their corresponding cluster configurations obtained from the first stage and perform clustering configurations again to get the final cluster configuration for each region (see tinuous and categorical variables.The new vectorisation provides remarkable improvements in MCMC sampling efficiency.This enables scalable analysis of large-scale datasets without requiring the dataset to be split into smaller sub-regions.An additional feature of the BGWR model is the facility to identify covariates that are important in some locations but have minimal impact in other locations.The extension of the approach to include probabilistic clustering substantially expands the utility of the analysis and provides insights about similar geographic regions that are of great interest in subdomains such as health.
The proposed methods were successfully implemented in R using the nimble computational framework [11].Additionally, we investigated various weighting schemes based on graph distance and greater circle distance, finding that these approaches yield models with robust parameter estimation capabilities, particularly when facing spatially heterogeneous data.
There are several avenues for future research that extend beyond the scope of this paper.For instance, our Bayesian approach could be extended to handle generalized linear models (GLMs), which would significantly enhance its applicability to a broader range of regression models and diverse data types.This extension would involve adapting the algorithm to accommodate various link functions and likelihood distributions [26].Additionally, integrating penalized methods like ridge regression, lasso, or elastic net regularization into our approach could efficiently handle high-dimensional data and potentially improve the accuracy and stability of parameter estimation.Another direction is to develop a robust framework to determine optimal bandwidths.A comprehensive study on bandwidth selection is essential to ensure the efficiency and accuracy of the BGWR algorithm in handling spatially varying covariate effects within a Bayesian framework [9].
For model interpretability and identifying relevant covariates, appropriate approaches for variable selection in the context of clustered regression should be explored.Utilising DPMM and GMM to obtain clustering information of regression coefficients shows promise but requires further investigation to resolve inconsistencies in the posterior on the number of clusters [21].Moreover, to enhance the reliability and robustness of the proposed algorithm, especially when dealing with spatially dependent data and clustered structures, a prior for spatially clustered regression coefficients should be developed [26,21].
Developing a comprehensive spatial probabilistic framework would provide a solid foundation for handling complex spatial structures, incorporating various spatial components, and enabling integration with other spatial models and techniques.For example, investigating the connection of the proposed BGWR with latent variable models might lead to more advanced and efficient clustering techniques.Such integration could allow for more flexible and adaptive clustering approaches that account for complex spatial patterns [14].Finally, another promising direction to extend this work is to develop the proposed algorithm for a multivariate BGWR model.By considering multiple response variables simultaneously, we can enhance the model's capability to capture complex spatial dependencies and better understand the relationships between the variables in a spatial context.This endeavour holds great promise for advancing the state-of-the-art in spatial statistics and providing a powerful tool for tackling real-world spatial data challenges.

Conclusion
In this paper, we have achieved three main contributions.The first is the combination of Bayesian Geographically weighted regression with clustering via the Gaussian mixture model and the Dirichlet process mixture model to detect patterns in how different factors influence geographic data.The second is the optimisation of the computational algorithm to work well with large geographic areas and many spatial regions, leading to faster estimations.The third is the inclusion of the dynamic variable selection for each location in the BGWR model.
In addition to validating the effectiveness of our method through substantive simulated studies, we considered a case study focusing on children's development in Queensland.Real data from Children's Health Queensland and Australian Early Development Census were utilised for this study.By examining vulnerability in at least one of five critical development domains-physical health and wellbeing, social competence, emotional maturity, language and cognitive skills (schoolbased), and communication skills and general knowledge-our approach successfully identified clusters of regions with similar developmental vulnerabilities.This discovery holds significant implications for health services planning and early intervention efforts, aiming to enhance the overall wellbeing and success of children during their formative years.Furthermore, we conducted an in-depth exploration of the preschool attendance covariate using the BGWR model.
Overall, our research contributes to the field of spatial regression analysis by offering a powerful and efficient tool for understanding spatially varying patterns in the relationship between risk variables and responses and opens up new avenues for exploring spatial clusters and their implications.Our approach can be used in research in various domains beyond child development, such as health services planning and spatial analysis in general.

Appendices A Spatial Distances
When working with areal data, the graph distance is an alternative distance metric that can be used.It is based on the concept of a graph, where V = {v 1 , ..., v m } represents the set of nodes (vertices) and E(G) = {e 1 , ..., e n } represents the set of edges connecting these nodes.The graph distance is defined as the distance between any two nodes in the graph, where d is the graph distance, f is a weighting function, and b represents the bandwidth.In this study, we suppose that f () is a negative exponential function [37], so that: Another way to calculate the distance from the areal data is the greater circle distance (GCD) which calculates the shortest distance d i (s) between two points on the surface of a sphere (e.g., Earth) using their latitude and longitude coordinates.This accounts for the curvature of the Earth and gives the distance along the surface of the sphere, following the path of a great circle.The great circle represents the largest circle that can be drawn on a sphere and passes through the two points.The GCD is given as [4] : To calculate the GCD, the formula involves trigonometric functions.Specifically, it calculates the arc-cosine of a term involving the cosines and sines of the latitudes and longitudes of the two points.a s and b s represent the latitude and longitude for location s, and a i and b i represent the latitude and longitude for location i.The term inside the arc-cosine represents the dot product of two vectors in a 3D space, which is used to determine the angle between the two points.The inverse cosine function of this dot product yields the angle, and by multiplying it with the Earth's radius r, we get the great circle distance between the two points [4].The weighting scheme introduced in graph distance and greater circle distance, where neighbours are assigned the same weight and all others receive positive weight, provides additional assurance of parameter estimation stability when compared to the other weighting schemes.

B Metropolis-Hastings Sampling for Bayesian Geographically Weighted Regression (BGWR)
This section explains how to draw samples from the posterior distribution of the BGWR model parameters, considering the spatial dependencies, likelihood function, and prior distributions.Initialization: • Start with an initial set of values for all parameters: β(s), σ 2 (s), σ β , and b.
Compute Acceptance Ratio R:

Accept or Reject
• Draw a random value u from a uniform distribution U (0, 1).If u < min(1, R), accept the proposed parameters.Otherwise, retain the current parameter values.

Diagnostics
• Repeat the above steps for the desired number of iterations until converge.

C RJMCMC Algorithm
RJMCMC lets the chain switch between models with different parameter counts.In this paper, RJMCMC is used to assess the inclusion or exclusion of predictors at each spatial location.This section explains the RJMCMC steps mentioned in 2.3 and how they're used to find important local variables. Initialization: • Initialize all parameters of the model, i.e., choose initial values for γ j (s), ψ, β(s), σ 2 (s) and b.
• Example: Set γ j (s) = 1 for all s and j (all predictors are included initially).
• Set ψ = 0.5 assumption that each predictor is equally likely to be included or excluded.
• Initialize β(s) and σ 2 (s) based on a simple linear regression with all predictors included.

Proposal
Step: • Select a predictor j and a spatial location s randomly.
• Propose a change in the model.If γ j (s) = 0 (predictor j is currently excluded), propose to include predictor j in the model (γ j (s) = 1).If γ j (s) = 1 (predictor j is currently included), propose to exclude predictor j from the model (γ j (s) = 0).
• Propose a new value for b by adding a small random change to the current value.
Calculation of Acceptance Ratio R: Compute the acceptance ratio as: R = p(y|new model) × p(new model) × q(delete) p(y|current model) × p(current model) × q(add) × J where: • Likelihood of Data Given Model p(y|model): The likelihood function is given by the normal distribution: • Prior Probability of Model p(model): This term includes the prior distribution of the model parameters β(s), σ 2 (s), and ψ.with a Bernoulli prior for γ j (s) and a Beta prior for ψ, β(s) prior is N (0, Σ β ) ,σ 2 (s) prior is IGamma(α 1 , α 2 ) and b prior is uniform(0, D).
• Proposal Densities q(add), q(delete): These represent the probability of proposing to add or delete a predictor.In this case, let's assume q(add) = q(delete) = 1 p , where p is the total number of predictors.This assumes that each predictor is equally likely to be proposed for addition or deletion.
• Jacobian J: The Jacobian accounts for the change in the dimensionality of the parameter space when adding or deleting predictors.In this case, adding a predictor increases the dimension by one, while deleting reduces it by one.We use the Jacobian to properly adjust our calculations in the RJMCMC algorithm.Specifically, we apply the reverse of the Jacobian that was used when we proposed changing the model.This helps to ensure our probabilities are correct and the Markov chain remains balanced.
Calculate the log pointwise predictive density for each data point y(s): finally, calculate the WAIC for the BGWR model using the formula: where V is the sum of the posterior variances V i for all data points, and log p(y(s)|y) is the log pointwise predictive density for each data point.Smaller WAIC value indicates a better-fitting model.
The DIC is given as : In this context, θ denotes the parameters of interest, and θ represents the posterior mean.The deviance function, denoted as Dev(•), is utilised, and the effective number of parameters in the model, denoted as p D , is calculated using the equation p D = Dev(θ) − Dev( θ).The specific expression for the deviance function is provided in [35] and is as follows: The DIC for GWR can be summarised as: The quantities β(s), W (i), and σ2 (s) correspond to the posterior estimates derived from the MCMC results similar to before.A smaller DIC value is a more favourable model [37].

E DPMM Cluster Configuration Process
In this section, we provided a further explanation for the two-stage cluster configuration that we developed to find the final cluster labelling for the DPMM in the case of the mode method.
Stage 1: Within-Sample Cluster Configuration Given: • N : total number of regions • M : number of MCMC samples obtained from BGWR (here, M = 500) • Q: number of MCMC iterations for DPMM for each sample i (here, Q = 2000) • C i : cluster configuration for the ith sample For each sample i from 1 to M : 1. Perform MCMC analysis on the DPMM and determine the cluster assignment for each region.
2. Based on the given regions, we can find the cluster configuration, C i , using the mode method.
3. Arrange the cluster configuration derived from each samplei across Q iteration and put N A if the value does not exist.Each sample i may give a different number of clusters.
Mathematically, for the mode method, the cluster assignment for a region j during the ith iteration, c i,j , is: where p(x j |C i ) represents the probability of region N i belonging to each cluster in configuration C i .Where: • c i,j is the cluster assignment for region j for sample i.
• If the number of clusters during sample i is less than M max , some c i,j values in that row will be NA.
1.For each region j, consider its column in matrix C.
2. Calculate the mode across all entries in that column, ignoring the NA values.
The final cluster assignment for each region j, c final,j , is: c final,j = mode ({c 1,j , c 2,j , . . ., c i,j }) Where: • c final,j is the most frequently assigned cluster for region j across all MCMC 500 samples.
• The mode function will exclude NA values.
Using the matrix C, we construct the final cluster assignment vector C final as: Where each entry c final,j corresponds to the final cluster assignment for each region j.To handle variations in cluster numbers across iterations, the use of NA simplifies the mode calculation by indicating a lack of assignment.

F Simulated Data Further Analysis
In this section, we present additional results from the analysis of simulated data for the three settings across 100 replicates and report the average performance within these replicates.Our main focus is on the spatial distributions of the clusters obtained using our proposed algorithm.The final cluster configuration is determined using both Dahl's method and the mode method as the assignment criteria.
Additionally, in this section, we present an empirical density plot showcasing the optimal number of clusters obtained from each sample of the BGWR coefficients posteriors.To ensure accuracy, we considered approximately 500 samples from the posterior of the beta parameters obtained from Figure 15: The optimal number of clusters obtained from the GMM using BIC from 500 samples from the posterior, for Setting 1, along with the cluster configuration on the spatial map using the proposed algorithm BGWR without replacement during the evaluation of our algorithm.For the simulated data, we ran our algorithm for 5000 iterations, with 2000 burn-in iterations.Furthermore, Figure 16 depicts the spatial distribution obtained from the proposed algorithm using the GMM cluster method.Notably, our algorithm consistently identifies the optimal number of clusters as 3 across the mode method and Dahl's method.Noticeably, in this setting, the range of the Rand index (RI) is higher compared to setting 1.Compared to Figure 15 and Figure 16, Figure 17 demonstrates higher Figure 16: The optimal number of clusters obtained from the GMM using BIC from 500 samples from the posterior, for Setting 2, along with the cluster configuration on the spatial map using the proposed algorithm accuracy (RI) with the true labels.The Mode method and Dahls' method generate more clusters.As anticipated, increasing the signal strength leads to improved clustering accuracy.However, the proposed approach exhibits a tendency for over-clustering.Nevertheless, this tendency diminishes as the signal strength increases.
We compared two models for grouping data, DPMM and GMM, in three different situations.
Figure 17: The optimal number of clusters obtained from the GMM using BIC from 500 samples from the posterior, for Setting 3, along with the cluster configuration on the spatial map using the proposed algorithm.
Notably, the DPMM outperformed the GMM by creating more clusters, especially smaller ones.This pattern persisted consistently across all three scenarios, demonstrating the DPMM's remarkable ability to uncover structures in the data.The DPMM's adaptive nature allowed it to identify complex patterns, making it a powerful tool for revealing hidden insights in complex datasets.

G Case Study Further Analysis
The optimal number of cluster derived from 500 samples with the GMM method and assessed using BIC, is depicted in Figure 21.The provided plots illustrate the probability of each region belonging to various clusters.Since we have 526 regions to visualize on the heatmap, we have

Figure 1 :
Figure 1: Cluster assignment for Georgia regions used for simulation studies.

Figure 2 :
Figure 2: LHS) posterior means and 95% credible interval for the coefficients associated with preschool attendance, RHS) the geographical distribution of the posterior means.

Figure 4 :
Figure 4: The correlation between the posterior distributions of the attendance at preschool parameter across the six regions.

Figure 5 :
Figure 5: Comparison of probabilities for attendance at preschool posterior across the selected six regions.

Figure 6 :
Figure 6: The probabilities of each region to get the posterior draws coefficient associated with preschool attendance greater than its posterior mean.

Figure 7 :
Figure 7: The spatial distribution of the probability of each region to get the posterior draws coefficient associated with preschool attendance greater than its posterior mean.

Figure 8 :
Figure 8: The probability of each region on the map to be in the top ten.

Figure 10 :
Figure 10: The posterior mean summary for each covariate from the local model according to the proposed selection algorithm.

Figure 11 :
Figure 11: Spatial pattern of GMM clusters: case study results using Dahl's and mode methods.

Figure 14 :
Figure 14: The probabilities of each region in Queensland belonging to specific clusters obtained from the DPMM.

Stage 2 :
Across-Sample Cluster ConfigurationGiven the set of cluster configurations from the first stage:

Figure 18 :
Figure 18: The spatial distribution of the clusters obtained from setting 1 using DPMM.

Figure 19 :
Figure 19: The spatial distribution of the clusters obtained from setting 2 using DPMM.

Figure 20 :
Figure 20: The spatial distribution of the clusters obtained from setting 3 using DPMM.

Figure 23 :
Figure 23: The probabilities of the second 100 regions to belong to each cluster from 1 to 4 for GMM.

Figure 24 :
Figure 24: The probabilities of the third 100 regions to belong to each cluster from 1 to 4 for GMM.

Figure 25 :
Figure 25: The probabilities of the fourth 100 regions to belong to each cluster from 1 to 4 for GMM.

Figure 26 :
Figure 26: The probabilities of the fifth 100 regions to belong to each cluster from 1 to 4 for GMM.

Figure 27 :
Figure 27: The probabilities of the last 26 regions to belong to each cluster from 1 to 4 for GMM.

Figure 28 :
Figure 28: The probabilities of the first 100 regions to belong to each cluster from 1 to 12 for DPMM.

Figure 29 :
Figure 29: The probabilities of the second 100 regions to belong to each cluster from 1 to 12 for DPMM.

Figure 30 :
Figure 30: The probabilities of the third 100 regions to belong to each cluster from 1 to 12 for DPMM.

Figure 31 :
Figure 31: The probabilities of the fourth 100 regions to belong to each cluster from 1 to 12 for DPMM.

Figure 32 :
Figure 32: The probabilities of the fifth 100 regions to belong to each cluster from 1 to 12 for DPMM.

Figure 33 :
Figure 33: The probabilities of the last 26 regions to belong to each cluster from 1 to 12 for DPMM.

Table 1 :
Average parameter estimates and their performance when there is no spatial variation in the underlying true parameters.The performance metrics used include mean absolute bias (MAB), mean standard deviation (MSD), and mean of mean squared error (MMSE).

Table 2 :
Average parameter estimates and the performance of parameter estimates when there is no spatial variation in the underlying true parameters using dynamic variable selection.The performance metrics used include MAB, MSD, and MMSE.

Table 3 :
Model assessment for the proposed algorithm BGWR using different kernels for the simulated data.

Table 4 :
True parameter vectors used in data generation for three clusters.

Table 6 :
WAIC, DIC, and effective sample size values from the proposed BGWR algorithm using different kernels for the case study.
e is the shortest distance connecting a pair of nodes.∞ if the two nodes are not connected.where |V (e)| is the number of edges in e [18].Choosing the right settings for distance methods can be tricky and depend on "how near is really near" One way to decide this is by using graph distance.If regions have common boundaries, they have a graph distance of 1, which means they are close.But if they have a distance of more than 1, they are not so close.Regions that aren't very close should get less importance when we do calculations.The graph distance-based weighted function is given: