Comparing and modelling land use organization in cities

The advent of geolocated information and communication technologies opens the possibility of exploring how people use space in cities, bringing an important new tool for urban scientists and planners, especially for regions where data are scarce or not available. Here we apply a functional network approach to determine land use patterns from mobile phone records. The versatility of the method allows us to run a systematic comparison between Spanish cities of various sizes. The method detects four major land use types that correspond to different temporal patterns. The proportion of these types, their spatial organization and scaling show a strong similarity between all cities that breaks down at a very local scale, where land use mixing is specific to each urban area. Finally, we introduce a model inspired by Schelling's segregation, able to explain and reproduce these results with simple interaction rules between different land uses.


Introduction
Land use patterns appear as a natural result of citizens' and planners' interaction with the urban space. However, in a feedback loop, they also play a major role in the experience that residents and visitors have of a city [1]. Land  an effect on the liveability of neighbourhoods and even on the health of the local residents [2]. On the other hand, land use and transportation display a well-established relation [3][4][5][6]. Transport demand depends on the location of residence and business areas, whereas the presence of new transport lines or facilities such as metro stations can substantially modify the land use mixing in a given area of the city. These ideas lie behind the development of the so-called land use transport interaction models [7,8], which are commonly employed in transport planning around the globe [9].
An important issue regarding land use refers to the methods employed to estimate it. City hall registers, surveys or satellite images have been used in the past to this end [10][11][12][13][14][15]. 1 The emergence of geolocated information and communication technologies introduces extra capabilities to directly measure the use that citizens make of each urban space. The information is exhaustive in terms of spatial and temporal resolution, allowing for the detection of concentrations of people second by second along days, weeks and months. Information from mobile phone call records [16][17][18][19][20][21][22][23][24][25][26][27][28][29], geolocated tweets [14,[30][31][32][33][34], credit card use [35] or FourSquare [21] has been considered in the literature. Different data sources have been compared, finding a consistent agreement among the estimations on human concentrations and mobility obtained from different information and communication technology data [26], as well as between information and communication technology data and more traditional techniques [20][21][22][23]26,28,36].
Such wealth of information together with the ability to process massive data brought by the Internet era allows the systematic comparison of features across cities. This analysis can lead to the discovery and confirmation of properties that have been hypothesized to be common to all cities, and also to laws providing insights into the way a property scales with city size. Some examples of these properties include number of patents filed, unemployment rates, gross domestic product per capita, business diversity, consumption of resources, length of road networks or even crime density [37][38][39][40][41][42][43]. The finding of these laws raises the hope of the existence of a coherent framework for city science [38,[40][41][42][44][45][46].
In this work, we explore land use patterns in the five most populous urban areas of Spain. Land use information is obtained from mobile phone records using a new framework based on network theory and systematic comparisons of land use distribution across the five cities are performed at different scales. Our results reveal common features in the land use types' spatial distributions, which can be understood with a model introduced also here. The similarities break down when the land use type mixing is studied at very short spatial scales, exposing patterns characteristic to each city.

A network approach to detect land use
Our database is composed of aggregated and anonymized call records during 55 days between September and November 2009 in Spain. Every time a user receives or makes a call, the event is registered together with the tower (BTS) providing the service. The positions of the BTSs are georeferenced and so the activity levels of each spatial area can be tracked in time. For this work, we selected the five most populated metropolitan areas of Spain: Madrid (with a population over 5.5 million people), Barcelona (3.2 million), Valencia (1.5 million), Seville (980 000) and Bilbao (900 000). There is no unique definition for the border of an urban area. It may refer, for instance, to official, census or economic delimitation of the cities. Because the focus here is on urban land use, we are interested in identifying the inner zones of each city and, therefore, we use the definition of the metropolitan transportation offices: only areas served by metro or urban buses are considered. This is, nevertheless, an important question, because the selection of borders may influence the scaling analysis when comparing across cities [42,47].
The space of the urban areas is divided following a Voronoi tessellation with the BTS locations as centres. The extension of the areas served by each BTS is very different, because it depends on the expected peaks of demand. To ensure a common geographical framework, the five urban areas are divided into a grid with square cells of 500 × 500 m 2 to which the activity is mapped. This should prevent spurious effects owing to the Voronoi area heterogeneity (see electronic supplementary material for a detailed description of the cities and the division process).
The activity (number of users) in each cell is monitored in time and then processed as illustrated in figure 1. Average activity profiles are estimated over each day of the week hour by hour in every cell. These profiles are normalized by the total hourly activity to subtract the trends introduced by the circadian rhythms. A Pearson correlation coefficient is then calculated between the activities of rsos.royalsocietypublishing.org R. Soc. open sci.    figure S4. In order to remove non-significant and negative correlations, we consider only Pearson correlation coefficients higher than a threshold δ. As a result, we obtain one weighted network per urban area. We first note that variations of the threshold do not produce significant changes in the properties of the resulting network. The results in the main text refer to a value of δ equal to the correlation distribution dispersion. Once the networks are built, their mesoscopic structure is analysed using clustering techniques. The main advantage of community detection algorithms in networks compared with more classical clustering techniques based on dissimilarity matrix is that the number of clusters do not need to be fixed a priori. However, it is important to note that different clustering methods can lead to distinct partitions of  the networks. We report next results obtained with Infomap [48], whereas a systematic comparison with results obtained with other clustering tools is provided in the electronic supplementary material. As mentioned previously, Infomap does not require the input of a predetermined number of clusters. Therefore, it is interesting to find that in the five cities, between 98% and 100% of the cells are covered with only four groups. Figure 2 shows what the activity looks like for each of these four clusters in Madrid (similar plots for Barcelona, Valencia, Seville and Bilbao are included as electronic supplementary material, figures S9 and S10). Each of the clusters can be associated with a main land use: (1) Residential (red), which is characterized by low activities from 08.00 to 17.00-18.00. For the cells composing this group, the activity peaks around 07.00-08.00 and during the evening. In the weekend, the activity is almost constant except for the night hours. (2) Business (blue), where the activity is significantly higher during the weekdays than during the weekends. Furthermore, it concentrates from 09.00 to 18.00-19.00. This land use designation can be related to a wide range of commercial, retail, service and office uses. (3) Logistics/industry (cyan), where, as for business, the activity is higher during the weekdays. We observe a large peak between 05.00 and 07.00 followed by a smaller peak around 15.00. This cluster can be related to transport and distribution of goods: for example, 'Mercamadrid' (the largest distribution area of Madrid) belongs to this cluster. More systematically, we compared the land use patterns obtained with our algorithm with cadastral data. The dataset contains information about land use for each cadastral parcel of the metropolitan area of Madrid and Barcelona (about 650 000 parcels). In particular, we have for each cadastral parcel the net internal area devoted to residential, business and industrial uses. This information can be used to identify the dominant cadastral land use in each grid cell classified as residential, business and industrial uses by the community detection algorithm. To do so, we need to define a rule to determine what is the dominant land use in a cell. Intuitively, one would tend to identify the dominant land use in a cell as the land use class with the largest area. However, the nature of both land use assignments is very different: the cadastral data are based on the net internal area officially devoted to each activity and not necessarily on the number of people performing it. Therefore, residential use is the land use class with the largest area in most of the cell leading to an over-representation of residential cells in the metropolitan area. To circumvent this limitation, we introduce two thresholds to identify business and logistics cells with cadastral data in order to obtain a distribution of the fraction of cells according to the land use type similar to the one obtained with our algorithm (see electronic supplementary material for details). The overall agreement is high: we find a percentage of correct predictions equal to 65% for Madrid and 60% for Barcelona which is consistent with values obtained in other studies, 54% in [20] and 58% in [22]. Furthermore, for both case studies, almost all land use types have a percentage of correct predictions higher than 50% (see electronic supplementary material for details).

Comparison of cities
Once defined the clusters, we can compare the proportion of land use type over the five case studies. Figure 3 shows the fraction of cells and the fraction of mobile phone activity averaged over the time period according to the land use patterns for each metropolitan area. We find similar results for the five case studies: the residential land use type represents on average about 40% of the cells and the mobile phone activity against 30% for the business category and less than 15% for the logistics and nightlife clusters.
We can also study how the cells in each cluster are organized in the city's space. For the sake of comparison, we arbitrarily consider as city centre the location of the city hall and build a histogram with the number of cells at a certain distance from it. Because each city has a different spatial extension, distances are normalized by dividing by the maximal distance in each city so as to produce comparable results. The distributions are shown in figure 4, where average curves over all cities have been superimposed. It is interesting to note a certain similarity in the distribution of cells for all urban areas. City size acts as a natural cut-off in the distributions, although no simple functional shape is found in any of the clusters. Residential cells are well distributed across the cities but with a maximum not very far from the centre. Business cells appear at a similar distance as residential but peaking a little further. Logistics and industry are preferentially located in the periphery, whereas the nightlife cells are well distributed along the urban areas but slightly more concentrated in the central areas.
In order to quantify land use distribution patterns, we use the Ripley K [49] defined as where A is the city area, r the search radius (a geographical scale), the index i runs over the cells in the urban area and n is the total number of cells. N i (r) stands for the number of cells of a given type within a distance r from the cell i. This indicator measures the spatial heterogeneity of a given type of cells. The baseline for homogeneous random systems is a growth K(r) = π r 2 until reaching A. If the value of K(r) is over the random curve for a certain r it implies that the system is clusterized at that scale. Because cities have different sizes, both K(r) and the radius must be normalized by their maximum values (A for K(r) and the maximum distance for r). Curves for the normalized Ripley K for each city and land use type are displayed in figure 5a as a function of the normalized radius. The K(r)/A for each city are always above the green curve corresponding to a random distribution of land use types, indicating coarsening of land use. We find a scaling-like curve for all the land use types with most of the cities following well the general trend with some small deviations for nightlife in Seville.
Deepening the analysis, we can also define an entropy index to characterize the land use spatial organization. Let us consider a frame containing the full urban area, which is, in turn, subdivided into a    where α runs over the four clusters. The entropy E i is then averaged over all the divisions to obtain a global metric for the city at a given scale E(D). E(D) tends to zero if the land use within the divisions becomes unique, as occurs for instance at large D (small spatial scales). On the other extreme, when D → 1, E(D) converges to a fixed value describing the full city. Figure 5b shows how the average entropy behaves with D. The curves are similar across cities, recalling the shape of scaling functions. This is not surprising if the concept of a fractal-like distribution of the city activity applies as has been previously discussed in the literature [37][38][39][40][41][42].

Modelling land use
Urban land use models in the literature are typically built with relatively elaborate mechanisms [50,51]. If basic in the rules, the models typically refer to characteristics of cities such as the population or activity distributions [44][45][46]. The shape of E(D) can be explained, however, by a simple model inspired by Schelling's segregation [52]. It is important to stress that this model is not intended to reproduce all the processes leading to the land use formation, but to explain the scaling of its spatial distribution patterns.
The basic framework is a lattice in two dimensions representing the urban space. Initially, a variable t i with a land use type is assigned to every cell i at random (residential R, business B, logistics L or nightlife N). The global fraction of cells of each type respects the proportions found in the empirical data in such a way that E(D = 1) coincides with the observations. A satisfaction index, S i , is then defined per cell taking into account its type and those of its neighbours. Similar to Schelling's model, we assume that the satisfaction increases when a cell is surrounded by cells of its own type. Otherwise, S i depends on the particular combinations of types. Some land uses attract each other as, for instance, residential and business, whereas others repel such as residential and logistics. The existence of rules of attraction and repulsion between different types of land use has already been considered in the past like for example in [53,54]. To be specific, if p i t is the fraction of neighbours of i of type t, then S i is calculated as the same type, and that cells of other types have zero satisfaction if surrounded by any logistic one. With this rule, we introduce a tendency to locate industry and logistics out of the core areas of the cities. Residential and business cells have a certain tolerance to the R, B and N types with γ acting as a mixing control parameter: if γ = 0, mixing is not favoured. A global satisfaction measure is defined as the sum over all the cell satisfaction indices, S = i S i . The model is updated by choosing random pairs of cells and interchanging their land use if the exchange increases S. This process is repeated until the satisfaction reaches a stationary state.
Calibrating the single parameter γ , we can reproduce the observed K(r)/A and E(D) scaling in the real urban areas (see red curves in figure 5). The value of the mixing parameter at which the best average results are obtained is γ = 0.08 (electronic supplementary material, figure S8). For comparison, we have included a null model in which the land use types are distributed at random, keeping the real proportions (green curves in figure 5). Unsurprisingly, the null model fails at reproducing the curves obtained with the data, mainly because, generally, areas of a given land use type tend to cluster together to form land use zones. More interestingly, assuming that the satisfaction of a cell increases when it is surrounded by cells of its own type and that logistic cells and cells repel each other (i.e. γ = 0) is also insufficient to reproduce the properties observed in the data (for more details, see electronic supplementary material, figure S8). Therefore, taking into account the attraction between residential and business areas seems to be crucial for the reproduction of land use spatial organization in cities.

Mixing of land use types
So far, we have considered that each elementary cell has a unique land use type associated. This condition can be easily relaxed. If an average activity profile is defined for each of the four clusters, then a Pearson correlation coefficient between the activity profile in each cell and the clusters' averages can be calculated. The distribution of correlation values is shown in figure 6a. The highest correlation value corresponds typically to the cluster at which the cell is assigned. Still, in some cases, positive correlation values are found for other or even two other clusters. For every cell, we can quantify the intensity of its relation with each cluster by summing over these positive correlations and normalizing by the total. A map of the Barcelona metropolitan area with the intensity of each cell relation with its assigned cluster is shown in figure 6b. The colours represent the four main type of cells and the colour saturation is related to the correlation: darker if the correlation is high, paler otherwise. Most cells match well with their original assigned cluster, keeping darker colours, whereas some are brighter, implying a higher level of land use mixing.
We arbitrarily define a cell as mixed when the normalized correlations fall within the interval 0.3-0.7 for other clusters besides the assigned one. The fraction of mixed cells as a function of the city population is displayed in figure 6c. Larger cities show lower mixing and contain areas devoted to more specific purposes. Figure 6d illustrates how the land use types combine in each cell. Business and residential integrate well together as in our model, increasingly so for smaller cities. The mixing proportions are citydependent and act as a fingerprint to characterize each urban area. This feature may be used to classify cities with similar land use mixing patterns. Besides population size, causal links between level of spatial mixing and city shape, area, age of the city, function, etc., will be explored in future investigations. The mixing proportion can be either related to the organization of cities as monocentric or polycentric [27,55]. Smaller cities display a more monocentric structure, which can be associated with the mixing of land use types given the most restricted space.

Discussion
In summary, we introduce a method to automatically detect land use from electronic records and apply it to the five largest urban areas of Spain in order to perform a systematic comparison across them on the land use distribution. The urban space is divided into a regular grid to prevent geographic heterogeneity and to maintain the spatial scale under control. The user activity profiles are monitored in each unit cell along time, and then a correlation matrix is established between the profiles of every pair of cells. This correlation matrix encodes the functional network of each city. We analyse them by using network clustering techniques, which ensures that cells showing similar use profiles are grouped together. This method has been applied to the five most populated Spanish cities: Madrid, Barcelona, Valencia, Seville and Bilbao. Because the delimitation of urban areas could affect the results, the definition of the municipal transport offices is employed in each case. Interestingly given that the method is unsupervised, four  groups consistently appear as dominant in all cities. They correspond to activity profiles compatible with main land uses in residence, business, logistics/industry and nightlife. Not only are the types the same across cities, but also the proportions of cells and area devoted to each type are similar.
We also study the distribution of the four land use types at different spatial scales. We define the Ripley K and the entropy index for each land use type and the behaviour of both metrics is explored as the spatial scale varies from the full city (macroscopic scale) to a single cell (microscopic). The five cities show similar scaling curves for the metrics, implying comparable structures regarding how the four types amalgamate at the urban level. The shape of the scaling curves can be explained by a simple model that has been proposed in this work. The model is based on a Schelling-like segregation in which the different land use types interact to generate a spatial distribution in the city. Cells in a given land use type tend to maximize the number of neighbours undergoing equivalent uses. This rule induces a tendency to coarsening in land use types. The different land uses interact by attracting each other, such as services and residential areas, or by repelling like industry and almost any other type. The calibration of a single parameter regulating the intensity of the attraction between services, residential uses and nightlife is enough to reproduce the scaling curves observed in the real cities. Moreover, we also demonstrate that a model without land use type interactions cannot recreate the empirical scaling.
Similarities across cities break down when one focuses on how the land use types mix microscopically within each unit cell. A characteristic mixing profile is detected for every urban area, providing an individual city fingerprint. Further data on other cities could help to elucidate whether different typologies exist at this microscopic mixing level. In conclusion, despite that further data from other countries and sources could be important to confirm our results, we find that a coherent picture emerges in the land use organization of major urban areas and that its origin can be explained with a basic model. Data accessibility. The mobile phone dataset used in this study is proprietary and subject to strict privacy regulations. The access to this dataset was granted after reaching a non-disclosure agreement with the proprietor, who anonymized and aggregated the original data before giving access to other authors.