Segregation and polarization in urban areas

Social behaviours emerge from the exchange of information among individuals—constrained by and reciprocally influencing the structure of information flows. The Internet radically transformed communication by democratizing broadcast capabilities and enabling easy and borderless formation of new acquaintances. However, actual information flows are heterogeneous and confined to self-organized echo-chambers. Of central importance to the future of society is understanding how existing physical segregation affects online social fragmentation. Here, we show that the virtual space is a reflection of the geographical space where physical interactions and proximity-based social learning are the main transmitters of ideas. We show that online interactions are segregated by income just as physical interactions are, and that physical separation reflects polarized behaviours beyond culture or politics. Our analysis is consistent with theoretical concepts suggesting polarization is associated with social exposure that reinforces within-group homogenization and between-group differentiation, and they together promote social fragmentation in mirrored physical and virtual spaces.


S1 Data sources
The Twitter data sets were created using the Stream Application Programming Interface (API). Twitter is an online social network whose users share "micro-blog" posts from smartphones and other personal computers. In total we collected over 87 million geolocated tweets from over 2.8 million users between August 2013 and August 2014. Geolocated tweets provide a precise location of the individuals that post messages, and represent around 3% of the overall Twitter stream [1]. Table S1 shows the total number of tweets collected per city.  Table S1: Twitter data sets. Number of geolocated tweets and users analyzed by city.
In order to further assess how representative our Twitter sample is, we compared the number of Twitter users we identified living in each neighborhood with the neighborhood population taken from Census data. Home locations are inferred through individual patterns of nighttime Twitter activity. The results are presented in Table S2. The correlations are all significant and particularly high for four cities.  Table S2: Pearson correlation (r) of Twitter and neighborhood population. The p-value shows the probability that an uncorrelated system produces a correlation of at least r.
The credit card purchase data set was provided by a major financial institution in Istanbul. Records correspond to purchases with credit cards and include the customer (user) and store identifier. Store and user home locations are provided as meta data, together with users' income. The data has been properly annonymized following the current Turkish privacy laws. We analyzed a total of 2.4 million records of individual credit card purchases in a three-month period, made by 85 thousand individuals at 54 thousand stores.
In Table S3 we show the ranges of the income quantiles (Q) used to create Figures 1 and 2. The table elements show the upper limit of the specific quantile in each city. The lower limit is given in the previous quantile element.

S2 Interaction matrices
In order to study patterns of segregation, we built networks of social interactions among neighborhoods according to three types of data: shopping, mobility and communication. Neighborhoods represent official sub-urban administrative units defined by national authorities: mahalle in the case of Istanbul and Census tract in the case of American cities. The shopping network was created by linking users' home neighborhoods to the neighborhoods where they shop, as determined from credit card transaction data sets. The mobility network was obtained by first analyzing individual patterns of nighttime Twitter activity to infer their home locations, and linking users' home districts to the districts they visit and tweet from. Similarly, the communications network was created by linking the home locations of users mentioning each other in their tweets.
To obtain the interaction matrices in 1 and 2, we first normalize each element of these raw interaction matrices by dividing it by the sum of the corresponding column, and then subtract from it the expected value for a uniform probability of interaction (p u = 1/q). Patterns of segregation are noticeable in the red diagonal of the matrices.

S2.1 Statistical test
We measured the statistical significance of segregation patterns in the interaction matrices shown in Figures 1 and 2 by applying a two-sided Kolmogorov-Smirnov test [2] to samples of social interactions. The null hypothesis is that the interaction origins conditioned on their target follow a uniform probability distribution. The KS statistical test estimates the probability that two samples follow the same distribution. The first sample is drawn from the data by (1) selecting a target quantile Q to analyze and (2) sampling the source of N interactions. The second sample is artificially generated by creating a set of N interactions whose target is Q and whose origins equally represent all quantiles. A total of 10 sets of samples are created for each network (one per target quantile); therefore, 10 tests are applied in each case. The results of all tests across all networks and cities are summarized in Table S4, using samples of N = 1000 and N = 5000 interactions. The results show that (1) the segregation patterns are statistically significant, (2) the interactions of the middle quantiles are consistently more similar to a uniform probability distribution than other quantiles, and (3) the highest and lowest quantiles are the most segregated groups.

S2.2 Balanced sample
In order to assess the robustness of the segregation patterns shown in the interaction matrices ( Figures 1 and 2  is proportional to the neighborhood population, following the equation: where s i is the sample size of neighborhood i, c i is the population (according to Census data) of neighborhood i, and t i is the Twitter population of neighborhood i. If s i > t i we pick a random sample of unique users in neighborhood i not greater than t i . In Figure S1 we present the interaction matrices using balanced samples for all cities. The matrices have been normalized and colored as in Figures 1 and 2 (see Section S2). Assuming the uniform distribution as non-segregated and using it as reference (white), values above it (red) show a higher interaction preference and values below it (blue) show a lower interaction preference among the population of income quantiles. The patterns of segregation are just as noticeable as in previous figures. Matrices show a clear red diagonal indicating strong preferences of income quantiles to interact with their own class. In some cases the diagonal structure is even clearer than in previous figures. In New York City, a balanced sample shows that lower income areas visit richer areas (Manhattan) and not the opposite.

S3 Dimensionality reduction
In order to study the structure of both hashtag and mobility spaces, we applied dimensionality reduction to matrices in which rows represent neighborhoods and columns indicate either hashtags posted in those neighborhoods or users who tweet from those locations. The former is an indicator of online conversations and the latter is an indicator of the space of human mobility. Before applying dimensionality reduction, we applied a normalization transformation to the matrices weighting terms according to their frequency of appearance [3]. We then applied PCA analysis and topic modeling to these matrices.

S3.1 PCA analysis
Similarly to Figure 4, in Figure S2 we show the results of the PCA analysis of the hashtag space versus the mobility space, with respect to income for Chicago, Dallas, Detroit, Los Angeles and Philadelphia. Dots represent neighborhoods, and colors indicate income. The patterns in these plots manifest that rich and poor neighborhoods are located in separate data spaces.
The geographical distributions of the hashtag usage and human mobility PCA are shown from Figures S3 to S16. There is one figure for each city and feature space (hashtag and mobility). Panels respectively show the geographical structure of each of the top 20 principal components. The correlation with income of each component is annotated in each panel. The figures show that the main components have clearly defined spatial structures and that one of them (mainly the first or sometimes the second or third) has a high correlation with neighborhood income (also shown in Figures 22 and 23). Higher order components (above the 5th component) explain less variance and show low correlation with income and no clearly defined spatial structure. In the case of human mobility, spatial correlations persist at higher order components, in comparison with the hashtag which becomes random rapidly (see Figures S3 to S16).
In Figure S17 we present the correlation of the top 20 components with income for both hashtag (solid blue line) and mobility (solid red line) matrices in Istanbul and New York City. Similar results for other cities are presented in Figure S18. In all cities one main component (or two at most) is highly correlated with income. In the case of Istanbul, this component is the first one. In other cases, it is the second or third, but never a higher order component. In the case of Dallas, it is the 4th component in the human mobility PCA. Higher order components do not have high correlations with income. In the case of Istanbul, we also present the correlation with political preferences, measured as the percentage of votes to the ruling party by neighborhood (dashed lines Figure S17). The correlation of the first component with political preferences is lower than the correlation with income.
In order to determine the significance of the correlations, we randomly shuffle the hashtag and mobility vectors among neighborhoods, i.e. randomly assign the hashtag or mobility vector of one neighborhood to another, and apply PCA to the randomized space. We repeat the PCA analysis on the randomized space and calculate the correlation of each component with income. We repeat this experiment 100 times and analyze the statistical behavior of the correlations. The results are presented in Figure S17 for Istanbul and New York City, and in Figure S18 for the rest of cities. The green and yellow lines show the correlation of the components after randomizing hashtag usage and mobility vectors among neighborhoods. Error bars show standard deviation. The correlation of main components with income is clearly significant for all cities (p < 0.001). The p-values of the correlation of each component with income are shown in Figure S19. Higher order components are not always significant. Moreover, randomizing the internal structure of hashtag and mobility vectors does not yield any correlation with income in any component.

S3.2 Topic modeling
In order to analyze the significance of topics' correlation with neighborhood income, we applied the topic model to randomized hashtag spaces. We randomized the hashtag space by shuffling the hashtag vectors across neighborhoods i.e. randomly assign the hashtag vector of one neighborhood to another. For each randomization we calculate 20 topics and measure the correlation of each topic with neighborhood income. In Figure S20 we present the distribution of correlations of neighborhood income with the topics obtained from the randomized hashtag space. The black curve shows the distribution of income correlations across all topics, the red curve shows the distribution of minimum correlations and the blue curve shows the distribution of maximum correlations. The vertical blue and red dashed lines show the correlations of the topics used in Figures 5 and 6.
In Table S5 we present the top 10 hashtags related to topics positively and negatively correlated with income in the city of Istanbul. In this city, richer areas are predominantly using hashtags in English that are related to lifestyle. On the other side, in poorer areas, people use hashtags in Turkish related to religion, politics and TV shows.
In Table S6 we summarize the hashtags from American cities by showing the top 30 hashtags that are common in topics positively and negatively correlated with income across all cities. While there are similarities in this case, richer areas continue to predominantly talk about lifestyle, while poorer areas seem to be more interested in horoscopes. The extensive list of hashtags per topic and city in American cites is presented from Tables S7 to S12.

S4 Cultural Distance
Previous studies use the accuracy of prediction tasks in order to measure the distance or difference between cultural features [4]. They assume that low accuracy represents high differentiation among groups and high accuracy represents less differentiation. Using the same methodology, we define differences among income groups via the accuracy of predicting mobility patterns and Topic-1 (r > 0) Topic-2 (r < 0) love, night, sunday, food, summer, photooftheday, breakfast, yummy, huzur, party, fashion, sea, nofilter, istanbul np, GazzedeKatliamVar, offline, Soma, FF, GazzeSiyonizmeMezarOlacak, beyazshow, SomadakiKardeslerimize-YardimEtALLAHIM, GazaUnderAttack, medcezir, nw, 3Adam, takipedenitakipederim, siirsokakta  Table S6: Top 30 common hashtags from topics positively (r > 0) and negatively (r < 0) correlated with income across all American cities.
hashtag usage. Users are characterized by two vectors that respectively represent their mobility and hashtags posted. The non-zero elements of the mobility vector represent city neighborhoods the individual visits and tweets from. The non-zero elements of the hashtag vector indicate the hashtags individuals post. The vectors are transformed using TF-IDF [3]. This transformation is often used to classify documents and highlights local information as opposed to globally used terms.
We predict the individual income quantile using an MLP classifier independently applied to both the mobility and hashtag feature space. For this purpose, we randomly divide the sample in a training set made of 75% of individuals and a test set made of 25% of them. We create multiple sample sets in order to analyze the performance of the predictor as a random variable. Bootstrapping the performance enables more robust understanding of the prediction quality.
The bootstrap of the prediction accuracy is shown in Figure S21 for the city of Istanbul. Both hashtag (orange) and mobility (blue) are significantly higher than the probability of a random guess given the number of possible categories (dashed line). Therefore people of similar quantiles also have similar behaviors. Mobility patterns are a better predictor of income quantile than hashtags used. The hashtag space is larger and more diverse. Therefore, aggregation is necessary to observe patterns of collective behaviors of hashtag usage, as those shown in Section S3.

S5 Distributions
In Figure S22 we present the cumulative density function (CDF) of pairwise neighborhood hashtag usage similarity for Istanbul. We measure hashtag usage similarity by the cosine distance of the neighborhoods' hashtag vectors. We grouped neighborhoods by income in 30 quantiles, representing each curve. The color of each curve indicates the normalized median income of each group (from red to blue) and the black dashed curve shows the global distribution. Wealthier neighborhoods (blue) present a significantly higher hashtag usage similarity relative to the global distribution, indicating a more coherent behavior. Meanwhile, poorer neighborhoods show a significantly lower behavior relative to the global distribution, indicating more dispersed conversations.

S6 Aggregation
We analyze the relationship between mutual exposure (measured via inter-neighborhood mobility) and homogenization (measured via the cosine distance between neighborhood hashtag vectors) by means of aggregation and correlation. We aggregate neighborhoods in 10 groups either by income or random association (similar to Figure 7). The interactions within and across groups are represented as matrices. The matrices are normalized as described in Section S2.
As in Figure 7, the left panels of Figure S23 show the mobility and hashtag similarity matrices for Chicago, Dallas, Detroit, Los Angeles and Philadelphia aggregated by income, along with scatter plots showing the respective element-wise correlation of the two matrices. Analogously, the right panels of Figure S23, show the corresponding randomized versions of the matrices and respective scatter plots. A strong correlation between the elements from the mobility and hashtag similarity matrices is manifest in the scatter plots (similar to the results presented in Figure 7). Figure S1: Interaction matrices by type of activity for all cities using balanced samples. Matrices show normalized sum of interactions between each pair of neighborhoods according to mobility (left) and communication (right) on Twitter. Neighborhoods are aggregated into q = 10 income quantiles represented in the axes. Blue regions represent the pairs of income quantiles where probability of interaction is below the expected value of a uniform distribution p u = 1/q, and red ones above. The samples are weighted proportionally to the Census population. Figure S2: Principal component analysis (PCA) of hashtag usage (y-axis) and human mobility pattern (x-axis) for Chicago, Dallas, Detroit, Los Angeles and Philadelphia. Dots represent neighborhoods, and colors indicate the median income (increasing from red to blue). Each axis represents the projection of the neighborhood data onto one main component correlated with income (see Table 2 in the main text). 10 Figure S3: Spatial structure of the 20 principal components of hashtag usage in Chicago. The correlation with neighborhood income r is shown in each panel. Figure S4: Spatial structure of the 20 principal components of human mobility patterns in Chicago. The correlation with neighborhood income r is shown in each panel. Figure S5: Spatial structure of the 20 principal components of hashtag usage in Dallas. The correlation with neighborhood income r is shown in each panel. Figure S6: Spatial structure of the 20 principal components of human mobility patterns in Dallas. The correlation with neighborhood income r is shown in each panel. Figure S7: Spatial structure of the 20 principal components of hashtag usage in Detroit. The correlation with neighborhood income r is shown in each panel. Figure S8: Spatial structure of the 20 principal components of human mobility patterns in Detroit.
The correlation with neighborhood income r is shown in each panel. Figure S9: Spatial structure of the 20 principal components of hashtag usage in Istanbul. The correlation with neighborhood income r is shown in each panel. Figure S10: Spatial structure of the 20 principal components of human mobility patterns in Istanbul. The correlation with neighborhood income r is shown in each panel. Figure S11: Spatial structure of the 20 principal components of hashtag usage in Los Angeles. The correlation with neighborhood income r is shown in each panel. Figure S12: Spatial structure of the 20 principal components of human mobility patterns in Los Angeles. The correlation with neighborhood income r is shown in each panel. Figure S13: Spatial structure of the 20 principal components of hashtag usage in New York City. The correlation with neighborhood income r is shown in each panel. Figure S14: Spatial structure of the 20 principal components of human mobility patterns in New York City. The correlation with neighborhood income r is shown in each panel. Figure S15: Spatial structure of the 20 principal components of hashtag usage in Philadelphia.
The correlation with neighborhood income r is shown in each panel. Figure S16: Spatial structure of the 20 principal components of human mobility patterns in Philadelphia. The correlation with neighborhood income r is shown in each panel. Figure S17: Correlation of the 20 principal components of hashtag usage (blue) and human mobility patterns (red) with income for Istanbul (top) and New York City (bottom). The correlation of income with the 20 principal components of a randomized hashtag (green) and mobility (yellow) space is also shown. Error bars show standard deviation. In the case of Istanbul, the correlation with electoral results is also shown (dashed lines). Figure S18: Correlation of the 20 principal components of hashtag usage (blue) and human mobility patterns (red) with income for multiple American cities. The correlation of income with the 20 principal components of a randomized hashtag (green) and mobility (yellow) space is also shown. Error bars show standard deviation.  Table 2 Table 3 from the main text.   figure). The black dashed curve shows the global distribution. Figure S23: Inter-neighborhood social exposure through mobility and hashtag usage similarities after aggregating by income (left panels) or random association (right panels), for Chicago, Dallas, Detroit, Los Angeles and Philadelphia. Neighborhoods have been aggregated into ten income quantiles represented in the axes. Interactions depicted in blue are below the expected value from a uniform distribution and those in red above. Matrices have been normalized relative to the corresponding standard deviation from income aggregation (scale in figure). Scatter plots show the correlation (r) between mobility and hashtag similarity for both types of aggregation. In the random case (right panel), we show average correlation and standard error after 100 realizations.  Table S10: Top hashtags from topics positively (Topic-1, r > 0) and negatively (Topic-2, r < 0) correlated with income in Los Angeles.