Scaling and universality in urban economic diversification

Understanding cities is central to addressing major global challenges from climate change to economic resilience. Although increasingly perceived as fundamental socio-economic units, the detailed fabric of urban economic activities is only recently accessible to comprehensive analyses with the availability of large datasets. Here, we study abundances of business categories across US metropolitan statistical areas, and provide a framework for measuring the intrinsic diversity of economic activities that transcends scales of the classification scheme. A universal structure common to all cities is revealed, manifesting self-similarity in internal economic structure as well as aggregated metrics (GDP, patents, crime). We present a simple mathematical derivation of the universality, and provide a model, together with its economic implications of open-ended diversity created by urbanization, for understanding the observed empirical distribution. Given the universal distribution, scaling analyses for individual business categories enable us to determine their relative abundances as a function of city size. These results shed light on the processes of economic differentiation with scale, suggesting a general structure for the growth of national economies as integrated urban systems.

We parameterize the resolution of scheme with the number of digit r. As the resolution becomes finer, that is, larger r, the number of distinct types of business D max (r) becomes larger, as shown in Fig. S1, and, correspondingly, saturation come in larger cities Fig. 1B. Most of analysis for F i (N ) and f i in the main text uses the data at the highest resolution r = 6. In the section 2.3, we explain that universal distribution holds in the high resolution.

Rank-size abundance distribution
In ecology, species abundance is a key element of biodiversity. It refers to how common or rare a species is relative to other species in a given community, where species potentially or actually compete for similar resources. Here, we consider establishment types as species and their number as their abundance in an ecosystem. Then, the species area relation corresponds to D(N ) where population N acts like area and D acts like the number of species as it is shown in Fig. 1B (main text); and universal rank-size distribution is f i as shown in Fig. 2. the bottom hundred rank business. It is interesting to note the persistent patterns that restaurants (black) and lawyers offices (orange) are always at high ranks (within the top ten) while manufacturings, minings, and utilities (expressed as blue shades) are at low ranks. The regularity of these rank positions across cities is derived by utilizing scaling and universality of the rank-size frequency distribution in the section S3.

Effect of income level on deviations of the universal distribution
We study the effect of income level of cities on their shape of distribution.  figure 7 shows that the cities of low-income level generally under-represent F i (N ) than the cities of high-income level given the population size. This creates vertical down shift from the expectation attributing to deviation from the universal shape. However the shifts, and hence the deviations, do not mean that distribution shapes are deviated from the universal shape. This is shown in Fig. 7 (B) where f (x) is rescaled by optimized population N such that deviations are minimized. The deviation significantly disappears and the universal shape becomes even clearer and tighter.

Mathematical derivation of the universality of the distribution
Although a universal function for the diversity of businesses across all cities seems counterintuitive, the result can be heuristically derived in the limit of large cities using a mathematical argument based on a simple sum rule for the total number of establishments as below.
Let F i (N ) be the number of establishments in the ith rank in a city of size N shown in Fig. 2A (main text). When summed over all ranks (that is, over all possible types of businesses), this must give the total number of establishments in the city, N f (N ): thus, This condition holds even when we treat the discrete rank, i, as a continuous variable, x, and, correspondingly, the frequency, F i (N ) and f i (N ), as continuous functions, F (x, N ) and f (x), when D(N ) is large enough and resolution of the categorization is in the limit of infinite (i.e., when D max → ∞). Therefore it requires careful attention when focusing on the cities that have small diversity D(N ) and coarsed resolution r with small D max (r). With this caveat, the sum rule can be well-approximated by should become independent of city size for sufficiently large N . The surprise in the data is that the predicted collapse into a single curve independent of city size extends down to relatively small cities (that is, up to relatively high rankings). The precocious nature of the scaling of diversity mirrors that observed in urban metrics as a function of N seems to persist down to cities with populations only in the tens of thousands of inhabitants.

The universal shape
The empirical data suggest that the universal form of this scaled rank-frequency, f (x), has three distinct regimes: for small x(< x 0 , say), it is well described by a Zipf-like power law with exponent γ, as shown in the inset of Fig. 2B; for larger x(> x 0 ), the curve seems exponential (the approximately straight line portion in Fig. 2B); finally, as x approaches its maximally allowed value for the total number of establishment categories, D max (the finite resolution of the data given the classification scheme), f (x) drops off dramatically. To a very good approximation, these can be combined into a single analytic form: (ii) φ(x, D max ) → 1 when D max → ∞: As its effect vanishes in the limit of the finest grained resolution for any value of x.
(iii) φ(D max , D max ) = 0: This means that the cut-off completely dominates when x reaches its maximum value determined by the finite resolution, x = D max .
A simple phenomenological function that satisfies all of these conditions is with as manifested in the data (Fig. 2B, in main text). An excellent fit to the data is obtained with ξ ≈ −1.2. For comparison, we also show fits to the data both with and without the finite resolution cut-off function, φ(x, D max ). An almost equally good fit is obtained with ξ = −1, in which case φ(x, D max ) = e −x/(Dmax−x) and the universal distribution takes on the simple form: where h(x) is (x/x 0 )[1+x 0 /(D max −x)] and can be further simplified as x/x 0 when x D max .

Growing model to explain the universal shape
Ultimately we are interested in the frequency distribution P (s|N f , . . .) that predicts the numbers of businesses of type s given N f establishments and other characteristics of a city. This can be written as where P (s|i, N f , . . .) is the probability that business type s has rank i in a city of size N f among other characteristics. Note that s P (s|i, N f , . . .) = 1, because it is a probability and must be normalized. is the frequency of establishments with rank i, which empirically shows a universal form simply proportional to N f . Justified by these findings we drop the . . . and write F (i, N f ). This rank size distribution can be derived from a stochastic process for f (u, N f ), which is the number of different business types that have appeared u times in N f total businesses, and thus uf (u, N f ) is the total number of occurrences of all businesses that appeared u times.
The reason why we have to deal with f (i) instead of f (u, N f ) is that f (i) is universal while f (u, N f ) is a complicate function of both u and N f . Recall that a fast decay (e.g. exponential) in the rank frequency distribution corresponds to slow behavior in terms of f (u, N f ) and vice versa. This is well known in terms of power law distributions but needs to be generalized to our case, given the negative exponential form for the rank-size distribution at tail.
As noted in the main text the universal form of f (i) takes two distinct regimes, at ranks i < i 0 where it behaves as a decaying power law with exponent γ < 1, and at ranks i i 0 where it decays exponentially. In the first regime f (i) is dominated by i −γ , that is, f (u) ∼ u −1−1/γ using the standard manipulation between rank distribution and cumulative distribution. Yule-Simon model, preferential aggregation, used not only in the economic context, but also in the ecological behavior, generates this power-law distribution with the exponent ρ ≡ 1/γ is a function of α as ρ = 1/(1−α), and α is the probability of introducing a new business type where 1 ≥ α ≥ 0 (notation and derivation are consistent with the original Simon paper in [3]). In this regime we can understand the distribution of business types by a process of stimulated aggregation, where common business types are likely to attract new establishments of their type (with a probability proportional to their frequency), and where new establishments are introduced with probability α = 1 − γ per introduction.
In the exponential regime, at low frequency of occurrences, the logarithmic growth in D ( Fig. 1B in the main text, and Eq. (6)), are the slowest possible in terms of power laws derivable by the Yule process. Let us first derive the form of the distribution f (u, N f ) that corresponds to an exponentially decaying rank-frequency.
Assuming that f (i) is sufficiently described by Ae −i/i 0 when i i 0 , we obtain the cumulative distribution in frequency space, P (X > u), as Changing variables so that u = Ae −i/i 0 and thus i = i 0 ln(A/u) leads to which decays slowly with the value of the frequency u. Finally we obtain the probability density via differentiation with respect to u, so that This form of the probability density is not well defined in the continuum as integrals that include arbitrarily low frequencies result in a log-divergence in the normalization. However, as we have seen more clearly in the rank-frequency picture, there cannot be frequencies lower that u ∼ 1/N f due to the discreteness of establishment numbers. In this sense the distribution (8) is well defined in our problem with a cut off.
in the exponential region, under the assumption that D(N f ) is a logarithmic function of N f as is shown in Fig. 1 in the main text. It then can be shown [3,4] that the exponent ρ of the probability density corresponding to the Yule process with decreasing rates of establishment introduction is The detailed derivation of Eq. (10) can be referred to the Eq. (2.34) in the original Simon's paper in [3]. In case that N f → ∞ such that α 1, and that the resolution is infinite, that is, D 0 D(N f ), the exponent vanishes ρ(N f ) → 0. Then, the probability density f (u, N f ) becomes consistent with the Eq. (8): and the exponential form is obtained in the rank-frequency picture, as observed.
More generally we can take this slow dynamics of ρ seriously across all scales of N f and obtain a rank size frequency that behaves like with ρ(N f ) 1/γ > 1, for small ranks, and 1/ρ ∼ ∞ for large ranks.

Scaling analysis and rank shifts
The universal distribution of frequency does not account for the entire developmental process of economic functionalities in cities because the stochastic model does not speak what business compositions sit in what ranks. It is indeed that the economic compositions that occupy certain ranks differ by cities, as shown in The Fig. 2A in the main text and Fig. 2 -6 in the previous section. This dissimilarity can be studied in the historical or regional context, which is the conventional way of understanding urban economics. Here, we formalize a functional form of the proportion of components as a function of the economic size. One of the framework for such job is allometric scaling commonly used in ecology and paleontology.
Scaling exponents were calculated using ordinary least square (OLS) regression of logtransformed quantities (the number of establishments in a given sector, for example) against log-transformed population of cities. The basic urban scaling finds the total number of establishments N f linear (the first panel of Fig. 8). Then we break the N f into different business sectors classified by first 2 digits of NAICS. There are 19 industry sectors in NAICS each of which is scaled as the following figures: Fig. 8, 9, 10, and 11. All exponents of allometric scaling, marked in corresponding figures, are summarized in the histogram of Fig. Because the NAICS is hierarchical which can further break into 1164 industry types, we also apply the same method to 1164 industry types. Some industry types do not have enough samples to estimate the exponents. Therefore we only include 954 industry types out of 1164 which have more than 100 samples (MSAs that have the types). The scaling exponents of all 954 types are summarized in Fig. 12. As labeled in the figure, the conclusion is consistent with the Fig. 3 in the main text in that primary industries shift out while higher industries shift in as a city is larger.