Is this scaling nonlinear?

One of the most celebrated findings in complex systems in the last decade is that different indexes y (e.g. patents) scale nonlinearly with the population x of the cities in which they appear, i.e. y∼xβ,β≠1. More recently, the generality of this finding has been questioned in studies that used new databases and different definitions of city boundaries. In this paper, we investigate the existence of nonlinear scaling, using a probabilistic framework in which fluctuations are accounted for explicitly. In particular, we show that this allows not only to (i) estimate β and confidence intervals, but also to (ii) quantify the evidence in favour of β≠1 and (iii) test the hypothesis that the observations are compatible with the nonlinear scaling. We employ this framework to compare five different models to 15 different datasets and we find that the answers to points (i)–(iii) crucially depend on the fluctuations contained in the data, on how they are modelled, and on the fact that the city sizes are heavy-tailed distributed.


Introduction
The study of statistical and dynamical properties of cities from a complex-systems perspective is increasingly popular [1]. A celebrated result is the scaling between a city-specific observation y (e.g. the number of patents filed in the city) and the population x of the city as [2] y = αx β , (1.1) with a non-trivial (β = 1) exponent. Super-linear scaling (β > 1) was observed when y quantifies creative or economic outputs and indicates that the concentration of people in large cities leads to an increase in the per capita (y/x) production. Sublinear scaling (β < 1) was observed when y quantifies resource use and suggests that large cities are more efficient in the per capita (y/x) consumption. Since its proposal, nonlinear scaling has been reported in an impressive variety of different aspects of cities [3][4][5][6][7][8][9]. It has also inspired the proposal of different generative processes to explain its ubiquitous occurrence [10][11][12][13][14]. Scalings similar to the one in equation (1.1) appear in physical (e.g. phase transitions) and biological (e.g. allometric scaling) systems suggesting that cities share similarities with these and other complex systems (e.g. fractals).
More recent results cast doubts on the significance of the β = 1 observations [15][16][17]. Reference [15] agrees that economic  outputs are faster than linear in x, but claims that the population x has a limited explanatory factor on the per capita rate y/x of cities and function (1.1) is not better than alternative ones (see [6,18] for opposing arguments). Reference [16] focuses on the case of CO 2 emissions and shows that depending on whether city boundaries or metropolitan areas are used, the value of β changes from β > 1 to β < 1. This point was carefully analysed in [17] for different datasets y. Through a careful study of different possible choices of city boundaries, the authors report that the evidence for β = 1 virtually vanishes. These results ask for a more careful statistical analysis that rigorously quantifies the evidence for β = 1 in different datasets.
In this paper, we propose a statistical framework based on a probabilistic formulation of the scaling law (1.1) that allows us to perform hypothesis testing and model comparison. In particular, we quantify the evidence in favour of β = 1 comparing (through the Bayesian information criterion, BIC) models with β = 1 to models with β = 1. We apply this approach to 15 datasets of cities from five regions and find that the conclusions regarding β vary dramatically not only depending on the datasets, but also on assumptions of the models that go beyond (1.1). We argue that the estimation of β is challenging and depends sensitively on the model because of the following two statistical properties of cities: (i) the distribution of city population has heavy tails (Zipf's law) [1,19] and (ii) there are large and heterogeneous fluctuations of y as a function of x (heteroscedasticity).
Points (i) and (ii) are shown, respectively, in figure 1a,b.
The paper is organized as follows. We start by describing the problem and the datasets we use (in §2) and discussing (in §3) the limitations of the usual statistical approach based on least-squares fitting in log-scale. We then propose a probabilistic formulation together with different statistical models (in §4) and describe (in §5) how they can be compared with each other and to data. Finally, we discuss our main findings (in §6) and summarize our conclusions (in §7).

Data
The general problem we are interested in is to test and estimate the parameters of equation (1.1) based on observations (x i , y i ) for i = 1, . . . , N cities, where x i is the population and y i is the amount of the quantity of interest in city i (as in figure 1b). The quantities x i , y i are estimated within a measurement precision which, in principle, could also be included in the analysis. However, in most cases, this information is not available, and only single measurements of x i , y i exist. The datasets we choose include a variety of different regions, aggregation methods to define city boundaries, and quantities y. They include the data from various countries and regions: 100 metropolitan areas of the United Kingdom (UK), aggregated as in [17]; 381 metropolitan areas of the United States of America (USA), as discussed in [12]; 459 urban areas of the USA; 472 large cities of the European Union (EU); 275 large cities from the members of the Organization for Economic Co-operation and Development (OECD); and 5565 municipalities (administrative units) from Brazil. For each database, we use indexes of economical activity (weekly In the study of scaling laws in biology, the underlying hypothesis and alternatives to the usual leastsquares fitting have been extensively discussed [22,23]. In city data, statistical analysis beyond the usual approach was performed in [3,5,7,8,10]. It typically amounts to an analysis of the residuals ln αx β i − ln y i , e.g. a (visual) comparison of the residuals of the fit to the Gaussian distribution predicted by the model underlying the linear fit in log-log scale. The controversies regarding a nonlinear scaling β = 1 motivate us to search for an alternative statistical framework to test the scaling (1.1) beyond the usual approach with residual analysis.

Probabilistic models
The statistical analysis we propose is based on the likelihood L of the data being generated by different models. Following [5], we assume that the index y (e.g. number of patents) of a city of size x is a random variable with probability density P(y | x). We interpret equation (1.1) as the scaling of the expectation of y with x where This choice corresponds to Taylor's law [24]. It is motivated by its ubiquitous appearance in complex systems [25], where typically δ ∈ [1,2], and by previous analysis of city data that reported non-trivial fluctuations [8,26,27]. The fluctuations in our models aim to effectively describe the combination of different effects, such as the variability in human activity and imprecisions on data gathering. In principle, these effects can be explicitly included in our framework by considering distinct models for each of them. Below, we specify different models P(y | x) compatible with equations (4.1) and (4.2). We consider two classes of models. In the first class, which we call city models, we a priori choose a parametric form for P(y | x), and we use equations (4.1) and (4.2) to fix the free parameters. In the second class, which we call person models, we derive P(y | x) from a generative process for the assignment of y to people that is compatible with equations (4.1) and (4.2). In both cases, the likelihood L of the model is written as a function of the data {(x i , y i )} i=1,...,N and at most four free parameters (α, β, γ and δ).

City models
In this class of models, we assume that each data point y i is an independent realization from the conditional distribution P(y | x i ) and therefore the log-likelihood can be written as In order to explore how the choice of P(y | x) affects the outcome of the statistical analysis, we consider two different continuous distributions (Gaussian and lognormal). 1

Gaussian fluctuations
Here, we consider that P(y | x) is given by a Gaussian distribution with parameters μ N (x) and σ N (x) The relations (4.1) and (4.2) are fulfilled choosing the parameters as The log-likelihood (4.3) is given by This model has P(y ≤ 0 | x) > 0 and therefore observations with y i ≤ 0 can be accounted for. For the observables considered here, y = 0 is a valid observation but y < 0 is not. We consider two cases: Fixed δ = 1. This is the typical fluctuation scaling found when y i is the result of a sum of random variables [25].
Free δ ∈ [1,2]. The general functional form that fulfils equation (4.2). We exclude δ > 2, because, in this case, the probability P(y < 0 | x) of negative values (not feasible for most observables y) remains large for large x.

Lognormal fluctuations
Here, we consider that P(y | x) is given by a lognormal distribution with parameters μ LN (x) and σ LN (x): The relations (4.1) and (4.2) are fulfilled choosing the parameters as (see appendix B) and The log-likelihood (4.3) is given by This model has P(y ≤ 0 | x) = 0 and therefore observations with y i ≤ 0 cannot be accounted for. We again consider two cases: Fixed δ = 2. This scaling is obtained when y i is the product of independent random variables. Furthermore, σ 2 LN (x) and the fluctuations of ln y are independent of x and therefore the maximumlikelihood estimation of β coincides with the estimation obtained with minimum least-squares for ln y, as discussed in §3.

Person model
The starting point for this class of models is the natural interpretation of equation (1.1) that people's efficiency (or consumption) scales with the size of the city they are living in. This motivates us to consider a generative process in which tokens (e.g. a patent, a dollar of GDP, a mile of road) are produced or consumed by (assigned to) individual persons, in the same spirit as in [12,14]. Specifically, consider j = 1, . . . , M persons living in i = 1, . . . , N cities, on which the population of the city i is given by Consider also that there is a total of k = 1, . . . , Y tokens that are randomly assigned to the M persons. A super-linear (sublinear) scaling suggests that a token is more likely to be assigned to someone living in a more (less) populous city. In this spirit, we assume that the probability that a token is assigned to person j depends only on the population x (j) of the city where person j lives as . For β = 1, p( j) = 1/M and each person is equally likely to be assigned a token (independently of the population of its city). Equation (4.10) is a microscopic model, and we are now interested in the macroscopic behaviour of the city: the probability that a city i gets y i tokens, given that its population is x i . Assuming that besides their city, individuals are indistinguishable, the probability p(i) that a token is assigned to a city i is given by a sum of p(j) over persons j on city i, which contains exactly x i terms. Because x (j) = x i when the person j lives in city i, represented by j ∈ i, we obtain The probability of observing y i tokens in each city of size x i is a multinomial distribution Thus, the likelihood can be written as a function of the observed quantities (x i , y i ) as The scaling of the average and variance of y, i.e. equations (4.1), (4.2), is recovered as where we identify that α = Y/Z(β), γ = 1 and δ = 1. For y i 1, this model coincides with the city model with normal fluctuations and the latter choice of parameters. Note that the fluctuations of this model account only for fluctuations of the assignment, and neglects potential fluctuations of measurement imprecisions.

Results
In this section, we compare the models presented above against our 15 datasets. In particular, we address the following questions whose answers are summarized in  Table 1. Summary of the application of our statistical framework to 15 different databases and five models. The entries in the table represent the scaling exponent β. The value obtained through least-squares fitting in log-scale coincides with the value reported in the first column. The error bars were computed with bootstrap. The asterisk indicates that the model has a p-value higher than 0.05. If the difference BIC between the BIC of each model with the same model with a fixed β = 1 is below 0, the model is linear (→), between zero and six is inconclusive (open circle) and higher than six (strong evidence) is super-linear ( )/sublinear ( ). The models were also compared between each other using the respective BICs within the same noise model (grey background has lower BIC) and between all others (bold text indicates the model with the lowest BIC).

Discussion
In this section, we interpret the outcome of the statistical analysis summarized in table 1. We focus on specific findings and their significance to the problem of scaling in cities.

Data are almost never compatible with the proposed models
In almost all cases, the data are not a typical outcome of any of the five proposed models, leading to a rejection of the models (p-value < 0.05). The only exceptions (marked by an asterisk in table 1) are the two lognormal models in UK-income and UK-train stations, and the Gaussian model with free δ for OECD-GDP. There are several possible reasons for the widespread rejection of the models: fluctuations of the data may differ from the fluctuations P(y | x) of the models (e.g. measurement errors are not correctly accounted for by P(y | x)); the observations are not independent (e.g. there are correlations between residuals and city size); different scalings are observed for small and large cities (as discussed in [28] and figure 3). The rejections of the models considered here are a consequence of their strong simplifying hypothesis and show that the development of better models is needed in order to understand the observations and clarify the existence of the nonlinear scaling (1.1). It shows also that the estimated confidence interval cannot be used (in the rejected models) to discard a linear scaling β = 1 [21]. Still, the widespread rejection of models does not imply that the nonlinear scaling (1.1) is rejected altogether because it is possible that the data are well described by another (unknown) model consistent with equation (4.1) but different from the ones considered here, e.g. having different fluctuations in P(y | x). These alternative models can have different fluctuation relations or can account for the known (e.g. spatial [3]) correlations in the data. In particular, the generative process underlying the person model could be generalized to account for other effects beyond city-size population (e.g. individuals could be segmented by income).
Even if most models are rejected, some models can still describe the data better than others (in terms of BIC). The conclusions drawn from such model comparison analysis depend on the used set of models and may change by the introduction of a better model in the future. Our investigations of scaling laws in cities  in the next sections are mostly based on model comparison: we analyse which model and parameters best describe the data, with particular interest in the parameter β.

Different datasets are best described by different models
There is no single model that best describes all databases (the bold value in table 1 appears in different rows). A systematic observation on the 15 datasets is that the person model and the Gaussian model with fixed δ are never the best ones. This indicates that the fluctuations in the (large) cities are much larger than predicted by the scaling δ = 1 used in both models. For the other models, there are databases in which they are the best models: the lognormal with fixed δ = 2 is the best model in the three UK cases and for USA-GDP; the lognormal model with free δ is the best model for USA-roads and EU-cinema capacity; and the Gaussian model with free δ is the best for EU-cinema usage, OECD-GDP and EU-libraries. The inclusion of the additional parameter δ in the lognormal model, related to Taylor's law in equation (4.2), is considered beneficial in eight out of the 15 cases (shaded grey regions in the two first rows of table 1). Altogether, these results show that the model underlying the usual approach (lognormal with fixed δ) is often not the best model.

The estimated β depends on the model
Models consistent with the average scaling (4.1), but that have different assumptions regarding the fluctuations, can lead to different estimations of β. Consider the case of EU-cinema attendance. The value estimated from the lognormal model with fixed δ is β = 1.46 ± 0.19. It coincides with the usual approach (least-square fitting) and suggests a super-linear relation between the number of cinema visitors and the population of cities. However, if we allow for a different fluctuation scaling as in the lognormal model with free δ, a model that is preferred according to our BIC test, we obtain that β = 1.00 ± 0.30, i.e. a linear scaling. Conflicting conclusions are observed also in the EU-theatres database. The data and fittings for these two cases are shown in figure 2. Visual inspection of the graph can be misleading because of the log-scale and the different density of points, and shows the need for more careful (quantitative) statistical analysis. Altogether, the variation of β across different models shows that conclusions regarding β (e.g. β = 1) cannot be done independently from the analysis of the fluctuations. Considering also that different models are preferred for different databases (previous point), this confirms the practical importance of going beyond the usual approach (least-square fitting) both in terms of methods and models, as proposed in this paper.

Models are dominated either by the small or the large cities
The variation on the estimation of β across the different models can be better understood by analysing how the city size distribution shown in figure 1a influences the estimation of β. The least-square fitting minimizes the distance between the curve and the points in logarithmic scales (ln y). Therefore, when data are viewed in the usual double logarithmic plot, the best curve will be the one that passes close  to most points, i.e. it weights a village as much as a million-size city. The fit will be thus dominated by the large number of small cities. The disadvantage of this is that, even if the model describes well most cities, it may fail to describe the behaviour of most of the population. Our person model addresses this issue by giving the same weight to each person, leading to the problem of describing most people but potentially not most cities. To see this, consider the example of the 5565 Brazilian cities. Half of the Brazilian population lives in the 201 largest cities (3.6% of cities); yet, 50% of smallest cities account for only 8.2% of the total population. This is a direct consequence of the heavy-tailed distribution of city sizes, which holds in all our databases (figure 1a). Our city models with free δ in equation (4.2) allow cases beyond the least-squares fitting (δ = 2) and person model (δ = 1). The exponent δ controls how the variance of P(y | x) grows with x. A small variance for large x, obtained for small δ, will force the fitted curve (average) to pass close to the points of large cities. The weight of the large cities is inversely proportional to δ. The general considerations above explain a great extent of the variation of β across the models observed in table 1. The values obtained for the Gaussian model with δ = 1 and the person model are dominated by large cities, in the lognormal δ = 2 case, they are dominated by small cities, whereas for the free δ models, it depends on which best δ is obtained. In the Brazil-AIDS dataset δ ≥ 2 and β is dominated by the small cities (δ = 2 in the Gaussian model and δ = 2.79 in the lognormal model). Accordingly, the values of β for these two models in the second to last row of table 1 are β 1 in agreement with the lognormal with δ = 2 case and in contrast with the Gaussian δ = 1 and person model which have β > 1 and are dominated by the large cities. Figure 3 shows the results for this dataset and emphasizes how different models describe different city sizes. The same reasoning also explains the values of β of other databases reported in table 1 (e.g. all UK cases).
In summary, the 'weights' each statistical model attributes to cities have an impact on the estimated value of β and, in particular, on the visual agreement between the data and the fit in the usual doublelogarithmic plots. When the scaling relation (4.1) holds for all x, the difference between the models will not be significant. However, as we showed in §6.1, data are typically not compatible with models. In the cases in which β varies substantially across models, generalization beyond the simple scaling (1.1) [6] should be considered in order to account for the x dependence of β. In this case, the heavy-tailed distribution of city sizes leads many models to be dominated either by the large amount of small cities or by the few cities containing most of the population. This reasoning explains why cut-offs in minimum city size and aggregation of cities (different city borders) [9,16,17] influence the estimated β. All these procedures have a strong influence on the small cities, which are the dominant ones in the least-square fitting (e.g. aggregation of cities into metropolitan areas reduces the number of small cities). While applying cut-offs for small cities increases the visual agreement between the data and the fit in the log-log plot, this is only justified if the scaling (1.1) is interpreted as being valid only for large cities. The latter interpretation limits the relevance of the scaling because it becomes limited to a small fraction of the total cities.

Is the scaling nonlinear?
New answers to this central question emerge from the results of our paper (summarized in table 1). In three of the 15 cases, we found models which are reasonably compatible with the data, and we can base our conclusions on these models, i.e. on the obtained β and on the model comparison with the case β = 1 (arrows →, ↑, in table 1). This leads to the conclusion that the UK-income and UK-train stations show linear and OECD-GDP shows super-linear scaling. In the remaining 12 cases, conclusions are based solely on model comparison, and we feel more confident to give an answer to this question only when the same conclusion is obtained for models with different fluctuations (i.e. we compare the conclusions obtained in the best model with lognormal and Gaussian fluctuations). We find such an agreement in eight of the 12 cases, so that the scalings are: UK-patents and OECD-patents are linear; USA-GDP, EUmuseum usage and Brazil-GDP are super-linear; USA-roads, EU-libraries and Brazil-AIDS are sublinear. For the remaining four cases, our analysis is inconclusive on the question of linear or nonlinear scaling. Two reasons can lead to this conclusion. The first is that the nonlinear scaling qualitatively changes from β < 1 to β > 1 depending on the assumptions of the fluctuations (e.g. EU N. theatres). The second reason is that in one of the best models there is no sufficient statistical evidence for β = 1 (marked by an open circle in table 1, EU-cinema capacity, EU-cinema usage and Brazil-external). One interesting case falling in this second reason is EU-cinema usage, for which both the lognormal with fixed δ and the best model (Gaussian with free δ) yield β > 1. We still consider this case inconclusive, because the best model, despite showing β = 1.13 ± 0.11, only marginally improves (0 < BIC < 6) upon the model with β = 1. In this case, additional data are required in order to increase the statistical evidence in favour of either situation. The possibility of reaching an inconclusive answer shows the advantage of the statistical framework proposed here. In summary, in 15 datasets, we found four linear, four super-linear and three sublinear scalings.

Conclusion
In summary, we investigated the existence of non-trivial β = 1 scalings in city datasets. We introduced five different models, showed how to compare them and how to estimate β, and finally tested our methods and models in 15 different datasets. We found that in most cases models are rejected by the data and conclusions can be based only on the comparison between the descriptive power of the different models considered here. Moreover, we found that models which differ only in their assumptions on the fluctuations can lead to different estimations of the scaling exponent β. In extreme cases, even the conclusion on whether a city index scales linearly β = 1 or nonlinearly β = 1 with city population depends on assumptions on the fluctuations. A further factor contributing to the large variability of β is the broad city-size distribution that makes models to be dominated either by small or by large cities. In particular, these results show that the usual approach based on least-square fitting is not sufficient to conclude on the existence of nonlinear scaling.
Recent works focused on developing generative models of urban formation that explain nonlinear scalings [10][11][12][13][14]. Our finding that most models are rejected by the data confirms the need for such improved models. The significance of our results on models with different fluctuations is that they show that the estimation of β and the development of generative models cannot be done as separate steps. Instead, it is essential to consider the predicted fluctuations not only in the validation of the model, but also in the estimation of β. Finally, the methods and models used in our paper can be applied to investigate scaling laws beyond cities [20,23].

Appendix A. Databases
We used 15 datasets from five different databases. In each database (UK, USA, EU, OECD and Brazil), the same cities x i were used, and the different datasets are different indexes y. Some of our models cannot consider y i ≤ 0. In order to allow for a comparison across all models, we ignored y i ≤ 0 in all cases, and below we report the number N of cases y i > 0 in each dataset.
-UK: this database corresponds to fig. 5b of [17], was provided by the authors of that paper, includes the aggregation of population in cities proposed in that paper, and corresponds to the period 2000-2011.

Appendix B. Taylor's law in lognormal
Here, we express the parameters of the lognormal distribution, μ LN (x) and σ 2 LN (x), as a function of the parameters of the scaling laws E(y | x) = αx β (4.1) and probability of the data given the model. It can be shown [36] that this quantity can be approximated by B 12 ≈ e BIC/2 , ( F 1 ) where BIC ≡ BIC 2 − BIC 1 is the difference of the respective BICs. Thus, if BIC 1 < BIC 2 , it follows that B 12 > 1, i.e. that model 1 provides a better description of the data than model 2. Regarding the decision about nonlinear scaling, i.e. β = 1, we require that BIC ≡ BIC β=1 − BIC β ≥ 6 (see main text), in line with [40], where it is suggested that this implies strong or very strong evidence for a model with β = 1. This corresponds to B 12 ≥ e 3 ≈ 20.08, i.e. it is at least 20 times more likely that the data come from a model with β = 1.