Examining the correlates and drivers of human population distributions across low- and middle-income countries

Geographical factors have influenced the distributions and densities of global human population distributions for centuries. Climatic regimes have made some regions more habitable than others, harsh topography has discouraged human settlement, and transport links have encouraged population growth. A better understanding of these types of relationships enables both improved mapping of population distributions today and modelling of future scenarios. However, few comprehensive studies of the relationships between population spatial distributions and the range of drivers and correlates that exist have been undertaken at all, much less at high spatial resolutions, and particularly across the low- and middle-income countries. Here, we quantify the relative importance of multiple types of drivers and covariates in explaining observed population densities across 32 low- and middle-income countries over four continents using machine-learning approaches. We find that, while relationships between population densities and geographical factors show some variation between regions, they are generally remarkably consistent, pointing to universal drivers of human population distribution. Here, we find that a set of geographical features relating to the built environment, ecology and topography consistently explain the majority of variability in population distributions at fine spatial scales across the low- and middle-income regions of the world.


Introduction
While archaeologists have long stated that settlement patterns are complex and multi-factorial, geography has always been a determinant of the location of human settlements with humans primarily settling where resources are available, such as coastal areas and arable lands [1][2][3][4][5]. Access to sufficient resources to meet the needs of a population limit the population densities in any given location while other locations may have climates and topography that are less conducive to supporting human populations. However, the location of human populations is not simply determined by the natural environment, i.e. environmental determinism [6]. Since the agricultural revolution, humans have often been the drivers of change in the natural environment, modifying it in ways to better access resources/services (e.g. transportation networks, densification of services and production in urban areas) or to make the natural environment more productive and habitable (e.g. land conversion to agriculture, wetland drainage, irrigation, shelter in settlements) [7][8][9][10][11]. Sometimes humans have modified the environment in ways that make it less habitable, such as through pollution and desertification, or no longer habitable, such as in the cases of radiation in areas surrounding Chernobyl or desiccation of the Aral Sea [8,12,13]. With these changes, settlements and urban areas and populations continue to grow and their spatial distributions continue to evolve [14][15][16].
Between 2015 and 2050, the UN estimates that the global human population will grow by 2.4 billion [17]. Most of this projected change is anticipated to occur in the least developed countries and in urbanized areas [15,16]. Concurrently, Africa, Asia, Latin America and the Caribbean are estimated to experience the highest rates of urbanization [15]. As a part of this 'urban transition', the majority of Africa and Asia are experiencing large rates of internal migration, international migration and changes in the spatial distribution of natural population growth [15,16]. While Latin America and the Caribbean are predicted to experience decreasing urbanization rates, as was the trend through the 1990s and the early 2000s, the region is expected to have major demographic shifts. These rapidly changing magnitudes, composition and distribution of human populations imply a continued if not increasing need for high-resolution spatially explicit population maps that more accurately capture these changes to facilitate public health, sustainability and policy planning in general.
Over the past 20 years, the advancement of statistical techniques, availability of consistent geospatial data and rise in processing power have been leveraged to more accurately map populations over global scales. Such efforts include the simple gridding of census data matched to administrative boundaries that is undertaken for the Gridded Population of the World (GPW) project [18], and the use of satellite images of night-time lights to map urban areas and allocate populations to them, in the case of the Global Rural Urban Mapping Project (GRUMP) [19][20][21]. Other ongoing efforts, including LandScan [22][23][24], the Global Human Settlement Population Grid (GHS-POP) [25] and WorldPop [26], focus on a multivariate approach, utilizing multiple geospatial layers representing factors related to human population distributions to disaggregate areal unit-based census population counts to fine spatial resolution grid squares. These approaches can assess the contribution of different factors in explaining the observed population distributions (e.g. [26]), providing valuable data on the drivers and correlates of these patterns.
Despite the development of these multivariate approaches, there have been few globally representative comprehensive studies on the relationships between population densities, their associated covariates and the ancillary datasets that represent the covariates at a sub-national scale. Only basic within-country analyses have been undertaken in the course of validation or accuracy assessment, yet no analysis across low-and middle-income countries has occurred [26][27][28][29]. However, some local-scale case studies have investigated associations between covariates and population or residential land to better understand the correlates and drivers of population distributions in different settings [30,31]. Additionally, dasymetric modelling has evolved significantly over the past few years and provided important insights into the relationships between population and ancillary variables [32][33][34][35]. Such analyses have the potential to uncover fundamental patterns in the correlates and drivers of population distributions across the world.
Here, we undertake such an analysis for 32 low-and middle-income countries, focusing on answering the following two questions. (i) What datasets, representing drivers and associated landscapes of population distribution, are the most informative for accurately mapping populations at global scales?; (ii) What are the differences, in terms of relative importance of these datasets, between countries, between regions of countries and within regions of countries? By quantifying the relative importance of the drivers and correlates of human population distributions in relation to observed population densities, the question of how populations are distributed, and how this varies geographically, can begin to be addressed. Furthermore, it will allow informed development of new ancillary datasets with a high probability of importance when placed within a modelling framework and potentially lead to more informed covariate choices in population modelling that can expand the possible end-use applications of the population data. Moreover, by better depicting the relative importance of the drivers and associated landscapes of populations at the global and regional scales the accuracy and precision of high-resolution population mapping and construction of future scenarios will be furthered, benefitting all down-stream applications.

Material and methods
To assess the relationships between population densities and candidate correlates and drivers, we built a machine-learning-based modelling framework to expose the relationships between subnational boundary-matched population census data and a library of geospatial datasets. The population models considered in this study are based on the random forest (RF)-based method as described in Stevens et al. [26]. We took the RF regression model objects for each sample country which were trained at the administrative unit level of the corresponding census-based population data, extracted the covariate importance metrics, standardized what the covariates were representing to facilitate comparisons across models and analysed these data for differences between and within covariate classes as well as within each covariate class between all countries, between regions and within regions to begin to address the possibility of geographic variability in these relationships.

Random forest-based population models
RFs are a non-parametric, nonlinear statistical method that falls within a category of machine-learning methods known as 'ensemble methods'. Ensemble methods take individual decision trees that are considered 'weak learners' and combine them to create a 'strong learner'. The benefits of ensemble methods are that generalizability is increased, performance on large or small datasets is improved and the ability of the method to model difficult learning tasks is more effective. Compared with other ensemble methods RFs are robust to noise, small sample sizes and over-fitting, yet they need little in the way of parameter specifications [36][37][38][39].
RFs independently generate k number of unpruned decision trees using 'bagging' [37,40]. Once a decision tree is grown, the one-third of the bagged training data that the tree was not grown upon remain and are known as the 'out-of-bag' (OOB) data. The decision tree applied to these data and the accuracy of the tree, as measured by the mean squared error (m.s.e.), are stored as the OOB error for that tree [37]. The prediction error of the entire RF model can be estimated by averaging the OOB error of all trees [37]. The OOB error is also used for estimating covariate importance by randomly permutating a given covariate's OOB data with random noise and calculating the average per cent increase in the mean squared error, hereafter the Per.Inc.m.s.e., across all trees of the RF model which used the covariate [37]. For more details on the construction of RFs, see Breiman [37] and Liaw & Wiener [38].
The RF method outlined by Stevens et al. [26] uses an RF regression model and dasymetric mapping methods in a threestep process to estimate a population layer from input census and covariate data. The general steps are as follows: (i) iterative covariate selection for the RF model, (ii) the fitting of the RF model, using all available census units, and creation of a population density weighting layer from the created RF model, and (iii) the dasymetric redistribution of population counts from census-based administrative units to grid cells [29] using the population density weighting layer [26,32,33]. We give a general schematic of the RF process described by Stevens et al. [26] in figure 1. The covariate selection process is identical to step 2, but iterates until the removal of all covariates with a Per.Inc.m.s.e. less than zero. Data input to an RF model varies on a country-by-country basis with highresolution country-specific datasets being used over coarser resolution default datasets, when available. This last detail required the standardization of what each covariate more generally represented to facilitate comparison across models.

Census data
For this investigation, we sampled countries (n ¼ 32) from low-and middle-income countries in four regions of the world where available boundary-matched census data were available at an average spatial resolution (ASR) of 100 km 2 or below: Africa, Central America and the Caribbean (C. America and the Caribbean), South America (S. America) and Southeast Asia (S.E. Asia) [41]. The sampled countries, shown in figure 2, were modelled upon census data from varying years, with differing ASRs [41] of administrative units, and people per administrative unit, shown in table 1. These regions were selected because of their continued and rapidly growing importance in relation to world population [15,17].

Geospatial covariates and standardization
Human population density is highly correlated with environmental and physical factors [35], which can influence distributions of population. As indicated by the literature and availability of global data, the following factors were identified and used as predictive covariates: intensity of night-time lights [42], energy productivity of plants [43], topographic elevation and slope [44,45], climatic factors [46], type of land cover (LC) [27] and presence/absence of roads [47], water features [48], human settlements and urban areas [49], protected areas [50] and locations of points of interest (POIs) and facilities such as health centres and schools [51]. Rather than attempt to standardize the input covariates between countries, we used the most contemporary available datasets on a country-by-country basis to produce the population maps. See Stevens and co-workers [26] and [29] for a typical set of ancillary data included in a given model, with further details provided in Lloyd et al. [52].
For every model run, information about the RF model settings, covariates and their importance, metadata on the covariate datasets themselves and the general results of the RF model were output to summary files, which are included in the electronic supplementary material. From those summaries, we extracted the region modelled, the total variance explained by the model, the covariate names and the Per.Inc.m.s.e. for every covariate included in the model [37]. We then examined the covariates for all sampled countries to reclassify them into the covariate classification groups shown in  Figure 1. General process of using a random forest to created gridded population maps following Stevens et al. [26], where 'out-of-bag' (OOB) data are the approximately one-third of the data not sampled for training any single tree. rsif.royalsocietypublishing.org J. R. Soc. Interface 14: 20170401 purpose of this classification system was to facilitate comparisons between the country models via a standardized framework.
We would expect that covariates within the urban/suburban extents and built environment and urban/suburban proxy classes would be the most important for predicting population density as these typically capture settlements either implicitly or explicitly [10,[64][65][66]. Transportation networks and facilities and service classes would also be expected to be consistently important as transportation networks exist solely to facilitate the movement of people, goods and ideas [66]. Responding to the classic 'location-allocation' problem, facilities and services, e.g. schools and health centres, are often located to promote access by and service to a population. Rivers/waterbodies/waterways are unique in that they can be used by people as both a transportation network and a resource, an attraction for population, but, in some cases, could be perceived as more hazard than resource, e.g. floods, and would therefore serve as a disincentive for a population locating near them. Previous studies have shown that landcover classes can be used for predicting population density by predicting either their absence, e.g. natural or bare surface land cover, or their presence due to their direct impact on land use (LU), e.g. cultivated land cover [8,27,32].

Analysis
From the independently modelled countries, we synthesized generalized data on the relative importance of various covariates in predicting population densities. All analysis and data handling was performed in the R Statistical Environment, version 3.2.2, with a ¼ 0.05 significance levels and appropriate corrections for multiple outcomes where indicated [67].
To account for the differing number of total covariates in each country's model, we calculated a weighted importance rank (WIR). Within each country, we ranked covariates by descending Per.Inc.m.s.e. and then weighted them by the total number of covariates in the final model for a given country, calculated as WIR ¼ within-country ranked importance total number of covariates in country model : Within a given country, a WIR of zero indicates the covariate of highest importance and a WIR of 1 is the least important covariate. Hereafter, unless explicitly stated, within the text, variable class importance is referring to the WIR. To examine potential differences in variable class importance, we used both analytical and graphical methods.
Given the non-normal nature of the covariate importance data, we used the non-parametric form of the Kruskal-Wallis test to test for significant differences between covariate classes across all countries [68]. The inter-regional analyses were of a hierarchical nature using data subsets of a given covariate class and using the region category as the grouping variables, but still using the Kruskal-Wallis test [68,69]. The intra-regional analyses subset the data to a given region and a given covariate class then used a Kruskal-Wallis test to determine whether significant differences in importance for the given covariate class existed between countries of the same region [68]. If any of the Kruskal-Wallis tests were significant they were followed up with post hoc Dunn tests, using Holm's correction for multiple outcomes, to determine between which covariate classes or regions the significant differences occurred [70,71].

Results
The consistent patterns of covariate importance to predicting population density were observed between all sampled countries globally, with similar patterns observed between regions of countries. The correlates pertaining to urban areas and, more surprisingly, topographical features were the most important predictors of population density at all scales of analysis and were the only covariate categories which were consistently significantly more important than other categories, again at all scales.

Global
We present global covariate importances in figure 3. The five most important covariate classes, in descending order of median importance, were urban/suburban extents (0.32), built environment and urban/suburban proxies (0.35), climatic/environmental variables (0.37), populated place covariates (0.42) and transportation networks (0.50). This result matches expectations, as the five most important covariate classes (figure 3) are also the most often included in the final population models.
Globally, for predicting population density, we found that built environment covariates were significantly more important than classified populated place ( p , 0.01), natural/semi-natural vegetation LC ( p , 0.01), general classified LU ( p ¼ 0.04), protected LU ( p , 0.01) and rivers/ waterbodies/waterways covariates ( p , 0.01). We also found that urban/suburban extents were significantly more important

Inter-regional
Another POI was that the strong patterns of association seen at the global level were largely consistent when drivers and correlates were examined between regions. The only significant differences between regions were seen for the non-residential LU variable and the rivers/waterways/ waterbodies variable, the latter shown in table 4. Non-residential Table 2. Reclassification scheme to standardize covariates into variable classes representing spatial drivers and determinants of population. LC, thematically classified land cover; LU, classified land use; nat., natural; OSM, Open Street Map; semi.-nat., semi-natural; veg., vegetation. Note: The references are not exhaustive, but are characteristic of most models. Any of these covariates could be replaced by a country-specific dataset sourced from a one-off source or country partner. Refer to country-specific metadata files provided with the source download from www.worldpop.org.

Intra-regional
Like global patterns, there were no differences between the importance of the covariates urban/suburban extents and built environment and urban/suburban proxies within any region. Within any single region, we found no significant differences in patterns of importance between countries for all given covariate classes. However, between covariate classes across all countries within a given region, we found significant differences within the C. America and the Caribbean and S. America regions and display these in table 5. Similar to the global results, we found within S. America that built environment and urban/suburban proxies were significantly more important than classified populated place ( p , 0.01), protected LU ( p , 0.01) and rivers/waterbodies/waterways covariates ( p ¼ 0.01). Also within S. America, we found that climatic/ environmental variables were significantly more important than classified populated place ( p , 0.01), natural/semi-natural vegetation LC ( p ¼ 0.02), general classified LU ( p ¼ 0.04), protected LU ( p , 0.01) and rivers/waterbodies/waterways covariates ( p , 0.01). For C. America and the Caribbean, we found that the covariates regarding built environment and urban/suburban proxies ( p , 0.01), transportation networks ( p ¼ 0.03), urban/suburban extents ( p , 0.01) and climatic/ environmental variables ( p ¼ 0.02) were significantly more important than rivers/waterbodies/waterways covariates. Additionally, built environment and urban/suburban proxies were found to be significantly more important than classified populated place ( p ¼ 0.01), natural/semi-natural vegetation LC ( p , 0.01) and protected LU ( p , 0.05). Full results including the non-significant findings are included in the electronic supplementary material. We illustrate the consistency of the importance of distribution and their relative importance regionally for each covariate class graphically in figure 5.

Discussion
The majority of predicted population growth across the globe by 2050 is expected to occur in low-and middle-income countries [14,15,17]. With this predicted growth in population and urbanization challenges are expected to arise regarding food security, health and infrastructure, to name but a few    [72][73][74][75][76]. These continued and heightened concerns regarding the implications of the rapid pace of shifting populations in low-and middle-income countries ensure a continued demand for high-resolution gridded population maps in these regions of the world. This continued demand reinforces why understanding the drivers of the spatial distribution of   populations to improve population mapping is important. Moreover, an improved understanding of the fundamental drivers of population distributions and their spatial variations is of value for modelling future growth and designing strategies around such models.
Our results show that variables related to built/urban areas and to climatic/environmental covariates were the most important for predicting population density and were the only covariate classes that were significantly more important than other variable classes, regardless of the scale of analysis. This study begins to quantify commonly held concepts regarding the drivers and correlates of human population distributions, e.g. urban areas are associated with denser populations. Having quantified these patterns globally and regionally allows future work on the more unique aspects of location-specific distributional relationships of populations to be placed within the context of these larger-scale findings, and to help relate observed and past population distributions to historical and cultural contexts and the presence or absence of resources/hazards.
The finding that built area-related covariates were the most important in predicting population density should not be a surprise and it aligns with expectations that an estimated 54% of the world's population live in urbanized areas [15]. There are numerous examples where population density was an important predictor of urban area extent [77][78][79][80]. This study shows that this relationship goes in the other direction as well with built area extent being important in predicting population density. However, caution should be used when using the newer urban/settlement feature datasets such as global human settlement layer and global urban footprint. While they are improvements on the thematically classified 'urban', making use of spectrally and spatially refined optical and radar-based data, they are known to be most accurate in dense urbanized areas [64,65], leading to population model biases in less densely populated or rural contexts by virtue of the settlements being missed in the input covariates [26].
We were surprised how important the climatic/environmental covariate category was in predicting population density. While the category was not broken up for subsequent testing, by examining the covariate importance plots of individual countries we believe that this importance was largely driven by elevation covariates, including derived slope. Previous studies have shown that population is prevalent in the lower elevations of resource-rich coastal zones, deltas and river valleys [81][82][83] and it is simply easier to build on relatively shallowmoderate slopes than on steep slopes. There is also precedence for transportation and elevation covariates being predictive of urban or built land cover, corroborated by our finding that transportation networks and climatic/environmental covariate classes were consistently important predictors of population density [27,84,85]. Water-related covariates being consistently less important than crop or natural vegetation landcover covariates (figure 3) could be a result of the resource/hazard relationship [86] that populations have with waterbodies, which of course is highly context dependent.
Differing data quality of input covariates to the models analysed here should be kept in mind when interpreting these results as they directly affect the observed importance, or non-importance, of the covariates. For instance, the significant difference seen between C. America and the Caribbean and Africa and between C. America and the Caribbean and S.E. Asia within the rivers/waterbodies/waterways covariate class (table 4) is most likely to be due to the different thematic land cover sources used for those regions. While all landcover data used were adjusted to a standard thematic framework and resampled to 100 m [27], the majority of the Africa models used the 300 m resolution Globcover data whereas the S.E. Asia and the C. America and the Caribbean data were based upon the commercial, 30 m resolution, Geocover data [28]. While C. America and the Caribbean and S.E. Asia both used the Geocover dataset, they also sourced OpenStreetMap [55] for data pertaining to river features. OpenStreetMap varies widely as to completeness, coverage and data quality [87,88]. So, we would speculate that the observed significant differences were not likely to be indicative of actual differences in how the population relates to water features between those regions, but are the result of different data sources for the built arearelated covariates being used (table 2). Similar differing data quality or completeness issues are likely to be at the source of the significant differences between regions seen for the residential LU variable, which is entirely based on OpenStreetMap data [55].
These findings are valid only for a specific spatial resolution and modelling scale that may or may not maintain the same structures and relationships at a finer scale, as is typically the case with the modifiable areal unit problem (MAUP) [89]. All covariates are affected to some degree because they are all resampled to 100 m and are further aggregated by some summary measure at the administrative unit level prior to input in the RF from which our covariate importance metrics are derived [26]. Variations in data quality of the census-based population counts and the differing number of administrative units used in each region's countries modelled can partly explain the variance in importances within variable classes between regions. This follows the scale effect of the MAUP, which states that as the number of areal units is decreased there is a decrease in the variability of the observations corresponding to the areal units [89]. The potential of the coarseness of the polygonal census units to have an effect on this variability is less clear, but is likely to have an effect similar to the MAUP zonation effect [89]. So, while we observed very consistent patterns of importance between classes of variables and population density, this is based upon country-level averages of importance derived from a country-specific level of subnational units and then analysed at the country level across all countries and between and within regional groupings of countries. Were we to change the groupings, e.g. change the level of sub-national units from which a country-level RF is constructed, then, following the MAUP, the results would be likely to change. However, given that no significant differences in importance for any covariate class between countries within a given region were found, it would appear that the regional groupings maximized internal homogeneity, better facilitating inter-regional testing for differences.
There are inferential limits to using the RF model to identify/approximate the structure of covariate class relationships to population density. Unlike multiple linear regressions or single regression trees where coefficients and confidence intervals can be quantified, the numerous trees in an RF preclude the tracing of the regression from input to prediction [37]. The strength of an RF to capture nonlinear relationships of covariates and their complex interactions, through its numerous trees, does not make for simple interpretations of the underlying mechanisms of the modelled phenomenon, in this case the driver and correlates of population rsif.royalsocietypublishing.org J. R. Soc. Interface 14: 20170401 distribution [37]. Covariate importance within an RF is also complex because of those same nonlinear relationships and interactions and results in a covariate's importance within an RF being highly conditional on all other covariates present, with similar results not guaranteed in other models, even for the same country [38].
Another consideration when evaluating the importance of covariate classes and their relationships to population density is the varying temporality of the covariate datasets, which may not match the date of the input census data. Therefore, the modelled relationships are imperfect to begin with, as it is impossible to have complete temporal agreement between all input datasets because of well-known availability constraints. Furthermore, the quality of census data varies from country to country as well as from census to census, with completeness and spatial resolution of the administrative units being variable.
Further investigating these covariates in relation to population density could involve utilizing a different modelling framework that would allow for more inferential power as to the structure and nature of the relationships between these covariates and population density. Additionally, focusing our study on specific covariate classes, such as the urban-/suburban-related variable classes, by sourcing novel and forthcoming datasets that help illuminate the heterogeneity within these areas, both internally and across different countries and regions, could increase the predictive ability of a population model regardless of the framework. As these population datasets are scaled up to global extent, the question occurs as to whether these trends persist in high-income regions and once a consistent set of covariates is used for modelling all countries.
Better mapping of potential trends regarding drought [90], water distribution [91], crop distribution [92] and forest distribution [93] continue to improve and refine our spatial awareness of resource distribution, change and environmental patterns, globally. The relationships between population distribution and various ancillary datasets outlined in this paper provide relevant information for future work examining how populations may react to a continually changing landscape. In addition, potential exists to integrate such temporally dynamic datasets into gridded population models for better informing population distribution, not only over space but also over time [94]. However, this study is simply a cross section of covariate relationships to population density; a key question is whether these relationships remain static or are dynamic through time and the answer to that question is of great importance to population growth models, and other population-related fields, looking backwards and forwards through time.
Data accessibility. All original extracted data and code used for analysis has been provided in the electronic supplementary material. The large size of the original model objects precluded their inclusion in the electronic supplementary material. Requests for the original model objects should be sent to the corresponding author.
Authors' contributions. J.J.N. was responsible for research design, coding, data collection, management, processing statistical analyses, interpretation and drafting and production of the final manuscript. F.R.S. was responsible for research design, coding, data collection, interpretation and production of the final manuscript. A.E.G. was responsible for research design, data collection, interpretation and production of the final manuscript. C.L. was responsible for data collection and interpretation. A.S. was responsible for data collection and interpretation, and production of the final manuscript. G.H. was responsible for data collection. N.N.P. was responsible for data collection. A.J.T. was responsible for overall scientific management, interpretation and drafting and production of the final manuscript. All authors gave final approval for publication.