Royal Society Open Science
Open AccessResearch article

A worldwide model for boundaries of urban settlements

Abstract

The shape of urban settlements plays a fundamental role in their sustainable planning. Properly defining the boundaries of cities is challenging and remains an open problem in the science of cities. Here, we propose a worldwide model to define urban settlements beyond their administrative boundaries through a bottom-up approach that takes into account geographical biases intrinsically associated with most societies around the world, and reflected in their different regional growing dynamics. The generality of the model allows one to study the scaling laws of cities at all geographical levels: countries, continents and the entire world. Our definition of cities is robust and holds to one of the most famous results in social sciences: Zipf’s law. According to our results, the largest cities in the world are not in line with what was recently reported by the United Nations. For example, we find that the largest city in the world is an agglomeration of several small settlements close to each other, connecting three large settlements: Alexandria, Cairo and Luxor. Our definition of cities opens the doors to the study of the economy of cities in a systematic way independently of arbitrary definitions that employ administrative boundaries.

1. Introduction

What are cities? In The Death and Life of the Great American Cities, Jacobs argues that human relations can be seen as a proxy for places within cities [1]. A modern view of cities establishes that they can be defined by the interactions among several types of networks [2,3], from infrastructure networks to social networks. In recent years, an increasing number of studies have been proposed to define cities through consistent mathematical models [415] and to investigate urban indicators at inter- and intra-city scales, in order to shed some light on problems faced by decision-makers [1631]. Despite the efforts of such studies, properly defining the boundaries of urban settlements remains an open problem in the science of cities. A minimum criterion of acceptability for any model of cities seems to be the one that retrieves a conspicuous scaling law found for USA, UK and other countries, known as Zipf’s law [6,7,3242]. In 1949, Zipf [43] observed that the frequency of words used in the English language obeys a natural and robust power law behaviour, i.e. a few words are used many times, while many words are used just a few times. Zipf’s law can be represented generically by the following relationship between the size S of objects from a given set and its rank R:

RSζ,1.1
where ζ=1 is Zipf’s exponent. The size of objects is, in the original context, the frequency of used words. On the other hand, if such objects are cities, then the sizes stand for the population of each city, taking into account Zipf’s law and reflecting the fact that there are more small towns than metropolises in the world. We emphasize that it is not straightforward that Zipf’s law, despite its robustness, should hold independently of the city definition, as other scaling relations are not, such as the allometric exponents for CO2 emissions and light pollution [24,31]. Many other man-made and natural phenomena also exhibit the same persistent result, e.g. earthquakes and incomes [44,45].

Here, we propose a worldwide model to define urban settlements beyond their usual administrative boundaries through a bottom-up approach that takes into account cultural, political and geographical biases naturally embedded in the population distribution of continental areas. After all, it is not surprising that two regions, e.g. one in western Europe and another one in eastern Asia, spatially contiguous in population or in commuting level have different cultural, political or geographical characteristics. Thus, it is also not surprising that such issues yield different stages of the same mechanics of growth. The main goal of our model is to be successful in defining cities even in large regions. Our conjecture is straightforward: there are hierarchical mechanisms, similar to those present in previous studies of cities in the UK [14] and brain networks [46], behind the growth and innovation of urban settlements. These mechanisms are ruled by a combination of general measures, such as the population and the area of each city, and intrinsic factors which are specific to each region, e.g. topographical heterogeneity, political and economic issues, and cultural customs and traditions. In other words, if political turmoil or economic recession plagues a metropolis for a long time, all of its satellites are affected too, i.e. the entire region ruled by the metropolis will be negatively impacted.

2. The models

2.1. City clustering algorithm

In 2008, Rozenfeld et al. [6] proposed a model to define cities beyond their usual administrative boundaries using a notion of spatial continuity of urban settlements, called the city clustering algorithm (CCA) [68,11,15,30,24,31]. The CCA is defined for discrete or continuous landscapes [7] by two parameters: a population density threshold D* and a distance threshold ℓ. These parameters describe the populated areas and the commuting distance between areas, respectively. Here, we adopt the following strategy to improve the discrete CCA performance. (i) Supposing a regular rectangular lattice Lx×Ly of sites where the population density of the kth site is Dk, we perform an initial agglomeration by D* to identify all clusters. If Dk>D*, then the kth site is populated and we aggregate it with its populated nearest neighbours. Otherwise, the kth site is unpopulated. (ii) For each populated cluster, we define its shell sites, i.e. sites in the interface between populated and unpopulated areas. (iii) Lastly, we perform a final agglomeration by ℓ, taking into account only the shell sites. If dij<ℓ, where dij is the distance between the ith and jth shell sites, and if they belong to different clusters, then the ith and jth sites belong to the same CCA cluster, even with spatial discontinuity. Otherwise, they indeed belong to different CCA clusters. This simple strategy improves the algorithm’s computational performance because the number of shell sites is proportional to L, where L=LxLy is a linear measure of the lattice.

2.2. City local clustering algorithm

We propose a worldwide model based on the CCA, called the city local clustering algorithm (CLCA), not only to define cities beyond their usual administrative boundaries, but also to take into account the intrinsic cultural, political and geographical biases associated with most societies and reflected in their particular growing dynamics. The traditional CCA, with fixed ℓ and D*, when applied to a large population density map, can introduce biases defining a lot of clusters in some regions, while in others just a few. We present the CLCA with the aim of defining cities even in large regions in order to overcome such CCA weakness. Hence, it is possible that other models, such as the models based on street networks proposed by Masucci et al. [13] and Arcaute et al. [14], carry the same CCA burden and that local adaptations are necessary for their applications into large regions.

The main idea of our model is to analyse the change of the CCA clusters through the variation of D* under the perspective of different regions. First, we define a regular rectangular lattice Lx×Ly of sites, where the population density of the kth site is Dk. We sort all the sites in a list according to the population density, in descending order. Therefore, the site with the greatest population density is the first entry in this list, which we call the first reference site. The reference site can be considered as the current core of the analysed region. Second, we apply the CCA to the lattice, keeping a fixed value of ℓ, for a range of D* decreasing from a maximum value D(max) to a minimum value D(min) with a decrement δ. During the decreasing of D*, clusters are formed and they spread out to all regions of the lattice. Eventually, the cluster that contains the reference site (from now on the reference cluster), together with one or more of the other clusters, will merge from D(i) to D(i+1), where D(i+1)=D(i)δ. In order to accept or deny the merging of these clusters, we introduce three conditions:

  • (i) If the area Ar(D(i)) of the reference cluster r, i.e. the cluster that contains the rth reference site at D(i), obeys

    Ar(D(i))<A,2.1
    then the reference cluster r always merges with other clusters, because it is still considered very small. In this context, the area A* can be understood as the minimal area of a metropolis.

  • (ii) If the difference between the areas of the reference cluster r at D(i+1) and D(i) obeys

    Ar(D(i+1))Ar(D(i))>HAr(D(i)),2.2
    then the reference cluster r has grown without merging (figure 1a) or there is a merging of at least two large clusters (figure 1b). In the last case, we emphasize that if there are more than two clusters involved in the merging process, the reference cluster r may not be one of the largest. As the first case is not desirable, we can avoid it by reducing the value of δ and keeping the value of H* relatively high. The parameter H* can be understood as the percentage of the area of the reference cluster r at D(i). If the second case happens, we consider the entire region inside of the reference cluster r at D(i+1), but the clusters of this region (which we call the usual clusters) are defined by those at D(i). The usual clusters are the CCA clusters at the imminence of the merging process between D(i) and D(i+1). This includes the reference cluster r itself and one or more of the other clusters before the merging (figure 1b). Furthermore, all of the sites of the reference cluster r at D(i+1) are removed from the initial list of reference sites. This condition is necessary because we should not merge two large metropolises.

  • (iii) In condition (ii), when a reference cluster r is merging with another cluster that covers one or more regions already defined by previous reference clusters at different values of D*, there is a strong likelihood of the emergence of a forbidden region within that cluster. In this case, we force the region already defined by the largest value of D* to grow to the limits of the forbidden region (figure 1c). The forbidden regions are the complementary areas of the reference clusters already defined within the usual clusters. As a consequence of this procedure, some CCA clusters that were hidden after the analysis of the previous reference cluster arise in this forbidden region. We justify this condition by the idea that a metropolis rules the growth of its satellites, as it plays a fundamental role in their socioeconomic relations.

Figure 1.

Figure 1. CLCA: representation of the conditions (ii) and (iii). (a) The growth of the reference cluster without the merging process. (b) The rising of the usual clusters. The usual clusters are the CCA clusters at the imminence of the merging process between D(i) and D(i+1). (c) For tth, sth and rth reference clusters (tth is prior to sth which is prior to rth), the merging processes are performed as described in (b), even though there are clusters already defined close to and within the current analysed region in the second and third case, respectively. In the latter, there is the emergence of a forbidden region. The forbidden regions are the complementary areas of the reference clusters already defined within the usual clusters. In order to define the clusters inside those areas, we force the region defined by the largest value of D* to grow to the limits of the forbidden region. Here, we suppose that D(j)>D(k). The filled dots stand for the reference sites.

We apply the same procedure to the second reference cluster, to the third reference cluster and so on. Finally, we also define the isolated clusters with the minimum value of D* for all the cases accepted in condition (ii). In order to make our model clearer, we chose the descending order to sort the population density for one reason: to favour the merging process of the high-density clusters that arose from the decreasing of D*. In practice, we run our revised discrete CCA just once for the entire range of input parameters and store all of the outputs in order to improve the performance of the model. The apparent simplicity of this task hides a RAM management problem of storing all of the outputs in a medium-performance computer. We overcome such a barrier through the zram module [47], available in the newest linux kernels. The zram module creates blocks which compress and store information dynamically in the RAM itself, at the cost of processing time.

3. The dataset

We use the GRUMPv1 [48], available from the Socioeconomic Data and Applications Center (SEDAC) at Columbia University, to apply the CLCA to a single global dataset. The GRUMPv1 dataset is composed of georeferenced rectangular population grids for 232 countries around the world in the year 2000 (figure 2). Such a dataset is a compilation of gridded census and satellite data for the populations of urban and rural areas. These data are provided at a high resolution of 30 arc-seconds, equivalent to 30/3600° or a grid of 0.926×0.926 km at the Equator. We note that despite the heterogeneous population distributions that built the GRUMPv1, its overall resolution is tolerable to the CLCA, since we can identify well-defined clusters around all continents in the raw data.

Figure 2.

Figure 2. The Global Rural-Urban Mapping Project (GRUMPv1) dataset. The population map of the entire world from the GRUMPv1 dataset in logarithmic scale.

We calculate the area of each site by the composition of two spherical triangles [49]. The area of a spherical triangle with edges a, b and c is given by

A=4Re2tan1[tan(s2)tan(sa2)tan(sb2)tan(sc2)]1/2,3.1
where s=(a/Re+b/Re+c/Re)/2, sa=sa/Re, sb=sb/Re and sc=sc/Re. In this formalism, Re=6378.137 km is the Earth’s radius and the edge lengths are calculated by the great circle (geodesic) distance between two points i and j on the Earth’s surface:
dij=Recos1[sin(ϕi)sin(ϕj)+cos(ϕi)cos(ϕj)cos(λjλi)].3.2
The values of λi (λj) and ϕi (ϕj), measured in radians, are the longitude and latitude, respectively, of the point i (j). Thus, we are able to define the population density for each site of the lattice, since its population and area are known.

We also pre-process the GRUMPv1 dataset, dividing all countries and continents—and even the entire world—into large regions which we call clusters of regions, to apply our model in a feasible computational time using medium-performance computers. These regions are defined by the CCA with lower and upper bound parameters D*=50 people km−2 and ℓ=10 km, respectively. We believe that such large clusters can hold the socioeconomic and cultural relations among different urban settlements of a territory. Figure 3a shows the largest clusters of regions in the USA; as we can see, all of the eastern USA is considered a single cluster.

Figure 3.

Figure 3. The largest cluster of regions for the USA. (a) The single population density cluster from the eastern USA is defined by the CCA with lower and upper bound parameters D*=50 people km−2 and ℓ=10 km, respectively. The population, provided by the GRUMPv1 dataset, is shown in logarithmic scale within each populated area. (b) Application of the CLCA for the cluster of regions of the eastern USA. The CLCA cities are represented in several colours, e.g. New York in mustard, Philadelphia in light brown, Washington-Baltimore in light green, Boston in green and Chicago in red. The CLCA parameters used were D(min)=100 people km−2, D(max)=1000 people km−2, δ= 10 people km−2, ℓ=3 km, A*=50 km2 and H*=0.05.

4. Results

To show the relevance of our model, we apply the CLCA to the GRUMPv1 dataset at three different geographical levels: countries, continents and the entire world. For each case, we consider only a single set of CLCA parameters. We justify our choices with the following assumptions: (i) D(min)=100 people km−2, a value slightly greater than the lower bound CCA parameter (D*=50 people km−2) used to define the regions of clusters; (ii) D(max)=1000 people km−2, a loosened value of D(max)=; (iii) δ=10 people km−2, a small enough value to avoid the reference clusters growing without merging; (iv) ℓ=3 km, the critical distance threshold, already extensively analysed by previous CCA studies [6,7,24]; (v) A*=50 km2, the minimum area of a metropolis, as it is required that A* be reasonably greater than the minimum unit of area from the dataset and smaller than a metropolis’ area; and (vi) H*=0.05, a large enough value to favour the merging of clusters which are similar in size. Figure 3b shows the CLCA cities defined by the single set of CLCA parameters. For other regions, see the electronic supplementary material.

We study the population distribution using the maximum-likelihood estimator (MLE) proposed by Clauset et al. [50]. Their approach combines maximum-likelihood fitting methods with goodness-of-fit tests based on Kolmogorov–Smirnov statistic. Figure 4 shows the log–log behaviour of the cumulative distribution function (CDF) for the population of the CLCA cities, considering only the countries with the highest number of CLCA cities for each continent (for other countries, see the electronic supplementary material). The Pr(PP) represents the probability that a random population P takes on a value greater than or equal to the population P. In all CDF plots, we also show the maximum-likelihood power-law fit, as well as the value of the exponent ζ=α−1, where α is the MLE exponent, and the value of Pmin, the lower bound of the MLE.

Figure 4.

Figure 4. CDF Pr(PP) versus population P, in log–log scale, for the countries with the highest number of cities in each continent (for other countries, see the electronic supplementary material). (af) Cities proposed by the CLCA are represented by light blue circles. The solid black line is the maximum-likelihood power-law fit defined by the MLE [50]. The value of the lower bound Pmin and the exponent ζ are also shown. The CLCA parameters used were D(min)=100 people km−2, D(max)= 1000 people km−2, δ=10 people km−2, ℓ= 3 km, A*=50 km2 and H*=0.05.

In figure 5, we show a normalized histogram, with frequency F, of the ζ exponents for all countries (145 out of 232) with at least 10 CLCA cities in the region covered by the maximum-likelihood power-law fit. The mean value of the ζ exponents is ζ¯=0.98, with variance σ2=0.09. The dashed red line stands for the normal distribution N(ζ¯,σ2). In spite of the ζ exponent heterogeneity illustrated by figure 5, Zipf’s law holds for most countries around the globe. We emphasize that such results corroborate with previous studies performed for one country or a small number of countries [6,7,3242]. In particular, the figure 5 also endorses an astute meta-analysis performed by Cottineau [51]. Cottineau provided a comparison among Zipf’s law exponents found in 86 studies. Our results strongly corroborate those presented in such study, except that our exponents are ranged between 0 and 2.

Figure 5.

Figure 5. Normalized histogram, with frequency F, of the ζ exponent at the country level. The plot shows those countries (145 out of 232) with at least 10 cities defined by the CLCA in the region covered by the maximum-likelihood power-law fit. We find the mean value of the Zipf exponents ζ¯=0.98 and its variance σ2=0.09. The dashed red line stands for the normal distribution N(ζ¯,σ2). Therefore, Zipf’s law holds for most countries.

Furthermore, we challenge the robustness of our model at higher geographical levels: continents and the entire world. We performed the same analyses and find that our results persist on both scales, i.e. the CLCA cities follow Zipf’s law for continents and the entire world, as illustrated in figures 6 and 7.

Figure 6.

Figure 6. CDF Pr(PP) versus population P, in log–log scale, for the continents. (af) Cities proposed by the CLCA are represented by light blue circles. The solid black line is the maximum-likelihood power-law fit defined by the MLE [50]. The value of the lower bound Pmin and the exponent ζ are also shown. The CLCA parameters used were D(min)=100 people km−2, D(max)= 1000 people km−2, δ=10 people km−2, ℓ= 3 km, A*=50 km2 and H*=0.05.

Figure 7.

Figure 7. CDF Pr(PP) versus population P, in log–log scale, for the entire world. (af) Cities proposed by the CLCA are represented by light blue circles. The solid black line is the maximum likelihood power-law fit defined by the MLE [50]. The value of the lower bound Pmin and the exponent ζ are also shown. The CLCA parameters used were D(min)=100 people km−2, D(max)= 1000 people km−2, δ=10 people km−2, ℓ= 3 km, A*=50 km2 and H*=0.05.

We summarize our results in a set of seven tables: tables 16, for countries from Africa, Asia, Europe, North America, Oceania and South America, respectively. Table 7 contains similar information for all continents and the entire world. In all cases, we show the name of the considered region (country, continent or globe), the ISO 3166-1 alpha-3 code associated (only for countries), the number of cities obtained by the CLCA and those covered by the MLE, the lower bound Pmin and the Zipf exponent ζ.

Table 1.African countries. We show the name, the ISO 3166-1 alpha-3 code, the number of cities obtained by the CLCA and the number of those covered by the maximum-likelihood power-law fit defined by the MLE [50] (represented by †), the lower bound Pmin, and the Zipf exponent ζ.

countryISOCLCA citiesCLCA citiesPminζ
AngolaAGO201643 9370.780 ± 0.195
BeninBEN403012 6070.780 ± 0.142
Burkina FasoBFA1397812 3141.256 ± 0.142
BotswanaBWA795816740.785 ± 0.103
Central African RepublicCAF371114 8681.230 ± 0.371
Ivory CoastCIV834718 4000.962 ± 0.140
CameroonCMR1439374780.711 ± 0.074
Democratic Republic of the CongoCOD1914725 9960.764 ± 0.111
CongoCOG211817 6731.050 ± 0.248
ComorosCOM161541670.922 ± 0.238
Cape VerdeCPV161152051.083 ± 0.327
AlgeriaDZA27311224 1920.910 ± 0.086
EgyptEGY191211 9670.511 ± 0.147
EritreaERI271265590.730 ± 0.211
EthiopiaETH24414766380.688 ± 0.057
GabonGAB332731080.844 ± 0.162
GhanaGHA952554 6621.145 ± 0.229
GuineaGIN341340 1181.234 ± 0.342
GambiaGMB353311860.610 ± 0.106
Guinea-BissauGNB261491481.139 ± 0.305
KenyaKEN1792072 7561.383 ± 0.309
LiberiaLBR421964680.604 ± 0.139
Libyan Arab JamahiriyaLBY301840 2731.180 ± 0.278
LesothoLSO141119990.651 ± 0.196
Morocco (includes Western Sahara)MAR585026 3250.763 ± 0.108
MadagascarMDG1387414 8671.340 ± 0.156
MaliMLI15214644631.161 ± 0.096
MozambiqueMOZ12714128 2141.861 ± 0.497
MalawiMWI1797241940.779 ± 0.092
NamibiaNAM311712 4671.637 ± 0.397
NigerNER583610 7170.753 ± 0.126
NigeriaNGA1448089 5870.893 ± 0.100
SudanSDN775639 7641.031 ± 0.138
SenegalSEN423413 4750.798 ± 0.137
Sierra LeoneSLE625218990.612 ± 0.085
ChadTCD751419 5741.086 ± 0.290
TogoTGO541182 9641.667 ± 0.503
TunisiaTUN463616 1301.014 ± 0.169
United Republic of TanzaniaTZA1143373 6210.936 ± 0.163
UgandaUGA1553330 5871.386 ± 0.241
South AfricaZAF19159753 3201.270 ± 0.129
ZambiaZMB553471180.666 ± 0.114
ZimbabweZWE282413 4110.746 ± 0.152

Table 2.Asian countries. We show the name, the ISO 3166-1 alpha-3 code, the number of cities obtained by the CLCA and the number of those covered by the maximum-likelihood power-law fit defined by the MLE [50] (represented by †), the lower bound Pmin, and the Zipf exponent ζ.

countryISOCLCA citiesCLCA citiesPminζ
AfghanistanAFG953829 2420.809 ± 0.131
ArmeniaARM411917 0881.256 ± 0.288
AzerbaijanAZE342117 1690.776 ± 0.169
BangladeshBGD1035826 5860.581 ± 0.076
BhutanBTN19158930.469 ± 0.121
ChinaCHN4782270629 4670.941 ± 0.018
CyprusCYP17156260.486 ± 0.126
GeorgiaGEO523865260.765 ± 0.124
IndonesiaIDN241654212 8760.894 ± 0.038
IndiaIND104029994 9760.786 ± 0.045
IranIRN16956100 7631.194 ± 0.160
IsraelISR24208770.448 ± 0.100
JordanJOR131115 2530.803 ± 0.242
JapanJPN27033289 0391.011 ± 0.176
KazakhstanKAZ7722103 2891.505 ± 0.321
Kyrgyz RepublicKGZ1343791170.991 ± 0.163
CambodiaKHM842434 4951.735 ± 0.354
KoreaKOR13123126 8190.750 ± 0.156
Lao People’s Democratic RepublicLAO352012 5950.958 ± 0.214
Sri LankaLKA232085730.466 ± 0.104
MaldivesMDV1494014981.799 ± 0.285
MyanmarMMR1153769 9351.190 ± 0.196
MongoliaMNG241913 1791.419 ± 0.325
MalaysiaMYS11915157 8431.286 ± 0.332
NepalNPL392215 3960.560 ± 0.119
OmanOMN281234 9561.519 ± 0.438
PakistanPAK964590 3560.790 ± 0.118
PhilippinesPHL35238106 8541.195 ± 0.194
Democratic People’s Republic of KoreaPRK5320174 1211.502 ± 0.336
Saudi ArabiaSAU5715156 6720.861 ± 0.222
Syrian Arab RepublicSYR392029 9080.647 ± 0.145
ThailandTHA1002423 4820.718 ± 0.147
TajikistanTJK391317 6600.740 ± 0.205
TurkmenistanTKM301426 3190.883 ± 0.236
East TimorTLS231512200.547 ± 0.141
TurkeyTUR33824418 3890.926 ± 0.059
TaiwanTWN161321860.344 ± 0.095
UzbekistanUZB563615 8650.574 ± 0.096
VietnamVNM3457235 9800.876 ± 0.103
YemenYEM462238 2761.059 ± 0.226

Table 3.European countries. We show the name, the ISO 3166-1 alpha-3 code, the number of cities obtained by the CLCA and the number of those covered by the maximum-likelihood power-law fit defined by the MLE [50] (represented by †), the lower bound Pmin and the Zipf exponent ζ.

countryISOCLCA citiesCLCA citiesPminζ
AlbaniaALB463260300.783 ± 0.139
AustriaAUT1167443830.754 ± 0.088
BelgiumBEL433198000.706 ± 0.127
BulgariaBGR562933 3381.308 ± 0.243
Bosnia-HerzegovinaBIH571715 7081.186 ± 0.288
BelarusBLR361773 6821.123 ± 0.272
SwitzerlandCHE711555 8781.167 ± 0.301
Czech RepublicCZE2063341 2541.393 ± 0.243
GermanyDEU33124213 9260.811 ± 0.052
DenmarkDNK1348522480.682 ± 0.074
SpainESP35836133 7591.192 ± 0.199
EstoniaEST511314 0411.178 ± 0.327
FinlandFIN722227 8311.444 ± 0.308
FranceFRA125311442 1601.087 ± 0.102
United KingdomGBR21422229 1330.983 ± 0.210
GreeceGRC3209376390.930 ± 0.096
CroatiaHRV884096721.085 ± 0.172
HungaryHUN1432534 4741.189 ± 0.238
IrelandIRL1896247751.093 ± 0.139
IcelandISL15127080.560 ± 0.162
ItalyITA40015719 7240.885 ± 0.071
LithuaniaLTU763210 6541.007 ± 0.178
LatviaLVA752892761.107 ± 0.209
Republic of MoldovaMDA312366090.570 ± 0.119
MacedoniaMKD452311 0010.981 ± 0.205
The NetherlandsNLD6916112 0581.288 ± 0.322
NorwayNOR1051821 7951.214 ± 0.286
PolandPOL23616017 3900.903 ± 0.071
PortugalPRT1393217 1101.027 ± 0.182
RomaniaROU52238531290.740 ± 0.038
RussiaRUS62238431 9640.893 ± 0.046
Serbia and MontenegroSCG602738 4151.340 ± 0.258
SlovakiaSVK882035 0681.468 ± 0.328
SloveniaSVN883232730.730 ± 0.129
SwedenSWE1686111 4491.008 ± 0.129
UkraineUKR16410736 5150.833 ± 0.081

Table 4.North American countries. We show the name, the ISO 3166-1 alpha-3 code, the number of cities obtained by the CLCA and the number of those covered by the maximum-likelihood power-law fit defined by the MLE [50] (represented by †), the lower bound Pmin and the Zipf exponent ζ.

countryISOCLCA citiesCLCA citiesPminζ
CanadaCAN113530848790.815 ± 0.046
Costa RicaCRI141120 7511.195 ± 0.360
CubaCUB1134634 6731.327 ± 0.196
GuatemalaGTM251428 3530.948 ± 0.253
HondurasHND2363517 1201.290 ± 0.218
HaitiHTI231821 9530.897 ± 0.211
MexicoMEX47428411 9920.726 ± 0.043
NicaraguaNIC312898020.821 ± 0.155
PanamaPAN401217 7171.089 ± 0.314
El SalvadorSLV251321 3230.816 ± 0.226
United StatesUSA22 893162498740.876 ± 0.022

Table 5.Oceanian countries. We show the name, the ISO 3166-1 alpha-3 code, the number of cities obtained by the CLCA and the number of those covered by the maximum-likelihood power-law fit defined by the MLE [50] (represented by †), the lower bound Pmin and the Zipf exponent ζ.

countryISOCLCA citiesCLCA citiesPminζ
AustraliaAUS17714553320.788 ± 0.065
FijiFJI15149360.807 ± 0.216
Marshall IslandsMHL2827440.760 ± 0.146
New ZealandNZL1087930770.776 ± 0.087
Papua New GuineaPNG301313 8281.479 ± 0.410

Table 6.South American countries. We show the name, the ISO 3166-1 alpha-3 code, the number of cities obtained by the CLCA and the number of those covered by the maximum-likelihood power-law fit defined by the MLE [50] (represented by †), the lower bound Pmin and the Zipf exponent ζ.

countryISOCLCA citiesCLCA citiesPminζ
ArgentinaARG74922710 8800.994 ± 0.066
BoliviaBOL835767290.841 ± 0.111
BrazilBRA96661318 5551.057 ± 0.043
ChileCHL591993 9151.422 ± 0.326
ColombiaCOL40216312 8900.886 ± 0.069
EcuadorECU945412 7170.832 ± 0.113
PeruPER41715382790.867 ± 0.070
ParaguayPRY292649280.700 ± 0.137
UruguayURY791623 3461.310 ± 0.327
VenezuelaVEN812882 3231.254 ± 0.237

Table 7.Continents and the entire world. We show the name, the number of cities obtained by the CLCA and the number of those covered by the maximum-likelihood power-law fit defined by the MLE [50] (represented by †), the lower bound Pmin and the Zipf exponent ζ.

continent/globeCLCA citiesCLCA citiesPminζ
Africa486066061 5690.940 ± 0.037
Asia10 9531167169 5880.947 ± 0.028
Europe6118148933 9510.895 ± 0.023
Oceania18010326680.745 ± 0.073
North America24 919136420 3730.883 ± 0.024
South America293452239 5140.929 ± 0.041
world (except Antarctica)50 314801935 7250.871 ± 0.010

It is remarkable that the top CLCA city, with a population of 63 585 039 people, is composed of three large urban settlements (Alexandria, Cairo and Luxor) connected by several small ones. Figure 8ac shows the largest cluster of regions in Egypt for the GRUMPv1 dataset, CLCA cities and night-time lights from the National Aeronautics and Space Administration (NASA) [52], respectively. We believe the main reason for this finding has been present in the northeast of Africa since before the beginning of ancient civilization—namely, the Nile river. Actually, it is well known that almost the entire Egypt population lives in a strip along the Nile river, in the Nile delta and in the Suez canal on 4% of the total country area (106 km2), where there are arable lands to produce food [53]. The river and delta regions are composed by some large cities and a lot of small villages, making them extremely dense. Therefore, our results raise the hypothesis that the cities and villages across the Nile can be seen as a kind of ‘megacity’, despite spatially non-contiguous, due to the socioeconomic relation, reflected in the high commuting levels, among close subregions.

Figure 8.

Figure 8. Northeastern region of Egypt. (a) The cluster of regions defined by the pre-processing of the GRUMPv1 dataset for the northeastern region of Egypt. (b) The largest city defined by the CLCA in the entire world is formed by several cities, including Alexandria, Cairo and Luxor. (c) Night-time lights of the northeast of Egypt provided by National Aeronautics and Space Administration (NASA). The CLCA cities found exhibit a remarkable similarity with the lights across the Nile.

Table 8 shows the top 25 CLCA cities in the entire world by population, and their associated areas. After the top CLCA city, Alexandria-Cairo-Luxor, we emphasize that the 13 next-largest CLCA cities are in Asia. Indeed, we can see that the shape of the tail end of the entire world population distribution (in figure 7) is roughly ruled by the greater CLCA city in Africa and several CLCA cities in Asia.

Table 8.Top 25 cities, by population, in the world. We emphasize that, after the top CLCA city (Alexandria-Cairo-Luxor), the 13 next-largest CLCA cities are in Asia. The largest United Nation city, Tokyo, is just the 4th according to our analyses.

CLCA citycountryCLCA population (people)CLCA area (km2)
Alexandria-Cairo-LuxorEgypt63 585 03934 434
DhakaBangladesh48 419 11726 963
Guangzhou-Macau-Hong KongChina44 384 64712 896
TokyoJapan34 318 0729189
KolkotaIndia28 876 91010 408
PatnaIndia28 484 38018 670
Xi’anChina25 370 87539 736
Jakarta-Bekasi-BantenIndonesia23 814 1975862
Hanoi-Hai PhongVietnam22 480 08319 128
New DelhiIndia22 136 6756914
SeoulSouth Korea20 318 8813610
MumbaiIndia18 431 9602443
ManilaPhilippines17 591 7944039
Mexico CityMexico17 190 7252845
São PauloBrazil16 984 6272840
Kyoto-Osaka-KobeJapan16 398 8294608
New York CityUSA16 364 1094471
ShangaiChina15 291 1432529
Kochi-Kottayam-KollamIndia14 551 8098091
Surabaya-Gresik-MalangIndonesia14 289 5476891
Los AngelesUSA13 615 6105167
Cirebon-Tegal-KebumenIndonesia12 758 6176818
Semarang-Klaten-SurakartaIndonesia12 456 4086418
MoscowRussia11 894 0341448
Buenos AiresArgentina11 132 0812653

These facts are not in line with what was recently reported by the United Nations (UN) [54], e.g. the largest CLCA city, Alexandria-Cairo-Luxor, is just the 9th largest city according to the UN, and the largest UN city, Tokyo, is just the 4th largest according to our analyses.

5. Conclusion

We propose a model to define urban settlements through a bottom-up approach beyond their usual administrative boundaries, and moreover to account for the intrinsic cultural, political and geographical biases associated with most societies and reflected in their particular growing dynamics. We claim that such a property qualifies our model to be applied worldwide, without any regional restrictions. We also propose an alternative strategy to improve the computational performance of the discrete CCA. We emphasize that the CCA can still be used to define cities; however, it depends upon a different tuning of its parameters for each large region without direct socioeconomic and political relations. Furthermore, we show that the definition of cities proposed by our approach is robust and holds to one of the most famous results in social science, Zipf’s law, not only for previously studied countries, e.g. the USA, the UK or China, but for all countries (145 from 232 provided by GRUMPv1) around the world. We also find that Zipf’s law emerges at different geographical levels, such as continents and the entire world. Another highlight of our study is the fact that our model is applied upon one single dataset to define all cities. Furthermore, we find that the most populated cities are not the major players in the global economy (such as New York City, London or Tokyo). The largest CLCA city, with a population of 63 585 039 people, is an agglomeration of several small cities close to each other which connects three large cities: Alexandria, Cairo and Luxor. Finally, after the top CLCA city of Alexandria-Cairo-Luxor, we find that the next-largest 13 CLCA cities are in Asia. These facts are not in full agreement with a recent UN report [54]. According to our results, the largest CLCA city, Alexandria-Cairo-Luxor, is just the 9th largest city according to the UN, while the largest UN city, Tokyo, is just the 4th largest according to our analyses.

Data accessibility

The data supporting this article are available at http://sedac.ciesin.columbia.edu/data/collection/grump-v1. More specifically, the reader can click on ‘Data sets’ and, after that, on ‘Population Count Grid, v1 (1990,1995,2000)’. We also provide the codes for the proposed model that are available at https://doi.org/10.5061/dryad.968nq8n [55].

Authors' contributions

E.A.O. performed the data analysis, the algorithm of the proposed model, and the statistical analysis. He also participated in the design of the study and drafted the manuscript. V.F. carried out the funding acquisition and helped draft the manuscript. J.S.A. participated in the design of the study, carried out the funding acquisition, and helped draft the manuscript. H.A.M. conceived, designed, and coordinated the research, as well as carried out the funding acquisition and helped draft the manuscript. All authors approved the manuscript.

Competing interests

We declare we have no competing interests.

Funding

We gratefully acknowledge funding by CNPq, CAPES, FUNCAP, NSF, ARL Cooperative Agreement no. W911NF-09-2-0053 (the ARL Network Science CTA), NIH-NIBIB 1R01EB022720, NIH-NCI U54CA137788/U54CA132378 and NSF-IIS 1515022.

Acknowledgements

We thank the Global Rural-Urban Mapping Project (GRUMPv1) team for the dataset provided. Furthermore, we would like to thank X. Gabaix for helpful comments and discussions.

Footnotes

Electronic supplementary material is available online at https://doi.org/10.6084/m9.figshare.c.4089005.

Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.

References

Comments