Scaling in words on Twitter

Scaling properties of language are a useful tool for understanding generative processes in texts. We investigate the scaling relations in citywise Twitter corpora coming from the metropolitan and micropolitan statistical areas of the United States. We observe a slightly superlinear urban scaling with the city population for the total volume of the tweets and words created in a city. We then find that a certain core vocabulary follows the scaling relationship of that of the bulk text, but most words are sensitive to city size, exhibiting a super- or a sublinear urban scaling. For both regimes, we can offer a plausible explanation based on the meaning of the words. We also show that the parameters for Zipf’s Law and Heaps' Law differ on Twitter from that of other texts, and that the exponent of Zipf’s Law changes with city size.


Introduction
The recent increase in digitally available language corpora made it possible to extend the traditional linguistic tools to a vast amount of often user-generated texts. Understanding how these corpora differ from traditional texts is crucial in developing computational methods for web search, information retrieval or machine translation [1]. The amount of these texts enables the analysis of language on a previously unprecedented scale [2,3,4], including the 1 arXiv:1903.04329v1 [physics.soc-ph] 11 Mar 2019 dynamics, geography and time scale of language change [5,6], social media cursing habits [7,8,9] or dialectal variations [10].
From online user activity and content, it is often possible to infer different socio-economic variables on various aggregation scales. Ranging from showing correlation between the main language features on Twitter and several demographic variables [11], through predicting heart-disease rates of an area based on its language use [12] or relating unemployment to social media content and activity [13,14,15] to forecasting stock market moves from search semantics [16], many studies have attempted to connect online media language and metadata to real-world outcomes. Various studies have analyzed spatial variation in the text of OSN messages and its applicability to several different questions, including user localization based on the content of their posts [17,18], empirical analysis of the geographic diffusion of novel words, phrases, trends and topics of interest [19,20], measuring public mood [21].
While many of the above cited studies exploit the fact that language use or social media activity varies in space, it is hard to capture the impact of the geographic environment on the used words or concepts. There is a growing literature on how the sheer size of a settlement influences the number of patents, GDP or the total road length driven by universal laws [22]. These observations led to the establishment of the theory of urban scaling [23,24,25,26,27,28,29,30,31], where scaling laws with city size have been observed in various measures such as economic productivity [32], human interactions [33], urban economic diversification [34], election data [35], building heights [36], crime concentration [37,38] or touristic attractiveness [39].
In our paper, we aim to capture the effect of city size on language use via individual urban scaling laws of words. By examining the so-called scaling exponents, we are able to connect geographical size effects to systematic variations in word use frequencies. We show that the sensitivity of words to population size is also reflected in their meaning. We also investigate how social media language and city size affects the parameters of Zipf's law [40], and how the exponent of Zipf's law is different from that of the literature value [40,41]. We also show that the number of new words needed in longer texts, the Heaps law [2] exhibits a power-law form on Twitter, indicating a decelerating growth of distinct tokens with city size.

Twitter and census data
We use data from the online social network Twitter, which freely provides approximately 1% of all sent messages via their streaming API. For mobile devices, users have an option to share their exact location along with the Twitter message. Therefore, some messages contain geolocation information in the form of GPS-coordinates. In this study, we analyze 456 millions of these geolocated tweets collected between February 2012 and August 2014 from the area of the United States. We construct a geographically indexed database of these tweets, permitting the efficient analysis of regional features [42]. Using the Hierarchical Triangular Mesh scheme for practical geographic indexing, we assigned a US county to each tweet [43,44]. County borders are obtained from the GAdm database [45]. Counties are then aggregated into Metropolitan and Micropolitan Areas using the county to metro area crosswalk file from [46]. Population data for the MSA areas is obtained from [47].
There are many ways a user can post on Twitter. Because a large amount of the posts come from third-party apps such as Foursquare, we filter the messages according to their URL field. We only leave messages that have either no source URL, or their URL after the 'https://' prefix matches one of the following SQL patterns: 'twit%', 'tl.gd%' or 'path.com%'. These are most likely text messages intended for the original use of Twitter, and where automated texts such as the phrase 'I'm at' or 'check-in' on Foursquare are left out.
For the tokenization of the Twitter messages, we use the toolkit published on https: //github.com/eltevo/twtoolkit. We leave out words that are less than three characters long, contain numbers or have the same consecutive character more than twice. We also filter hashtags, characters with high unicode values, usernames and web addresses [42].

Urban scaling
Most urban socioeconomic indicators follow the certain relation for a certain urban system: 3 where Y denotes a quantity (economic output, number of patents, crime rate etc.) related to the city, Y 0 is a multiplication factor, and N is the size of the city in terms of its population, and β denotes a scaling exponent, that captures the dynamics of the change of the quantity Y with city population N . β = 1 describes a linear relationship, where the quantity Y is linearly proportional to the population, which is usually associated with individual human needs such as jobs, housing or water consumption. The case β > 1 is called superlinear scaling, and it means that larger cities exhibit disproportionately more of the quantity Y than smaller cities.
This type of scaling is usually related to larger cities being disproportionately the centers of innovation and wealth. The opposite case is when β < 1, that is called sublinear scaling, and is usually related to infrastructural quantities such as road network length, where urban agglomeration effects create more efficiency. [27] Here we investigate scaling relations between urban area populations and various measures of Twitter activity and the language on Twitter. When fitting scaling relations on aggregate metrics or on the number of times a certain word appears in a metropolitan area, we always assume that the total number of tweets, or the total number of a certain word Y tot must be conserved in the law. That means that we have only one parameter in our fit, the value of β, while the multiplication factor Y 0 determined by β and Y tot as follows: where the index i denotes different cities, the total number of cities is K, and N i is the population of the city with index i.
We use the 'Person Model' of Leitao et al. [48], where this conservation is ensured by the normalization factor, and where the assumption is that out of the total number of Y tot units of output that exists in the whole urban system, the probability p(j) for one person j to obtain one unit of output depends only on the population N j of the city where person j lives as people in all of the cities. Formally, this model corresponds to a scaling relationship from (1), . But it can also be interpreted as urban scaling being the consequence of the scaling of word choice probabilities for a single person, which has a power-law exponent of β − 1.
To assess the validity of the scaling fits for the words, we confirm nonlinear scaling, if the difference between the likelihoods of a model with a β W (the scaling exponent of the total number of words) and β given by the fit is big enough. It means that the difference between the Bayesian Information Criterion (BIC) values of the two models ∆BIC = BIC β=1 − BIC β =1 is sufficiently large [48]: ∆BIC > 6. Otherwise, if ∆BIC < 0, the linear model fits the scaling better, and between the two values, the fit is inconclusive.

Zipf 's law
We use the following form for Zipf's law that is proposed in [49], and that fits the probability distribution of the word frequencies apart from the very rare words: We fit the probability distribution of the frequencies using the powerlaw package of Python [50], that uses a Maximum Likelihood method based on the results of [51,52,53].
f min is the frequency for which the power-law fit is the most probable with respect to the Kolmogorov-Smirnov distance [50].
A perhaps more common form of the law connects the rank of a word and its frequency: We use the previous form because the fitting method of [50] can only reliably tell the exponent for the tail of a distribution. In the rank-frequency case, the interesting part of the fit would be at the first few ranks, while the most common words are in the tail of the p(f ) distribution.
The two formulations can be easily transformed into each other (see [49], which gives us This enables us to compare our result to several others in the literature. 5 3 Results and discussion

Scaling of aggregate metrics
First, we checked how some aggregate metrics: the total number of users, the total number of individual words and the total number of tweets change with city size. Figures 1, 2 and 3 show the scaling relationship data on a log-log scale, and the result of the fitted model. In all cases, ∆BIC was greater than 6, which confirmed nonlinear scaling. The the total count of tweets and words both have a slightly superlinear exponents around 1.02. The deviation from the linear exponent may seem small, but in reality it means that for a tenfold increase in city size, the abundance of the quantity Y measured increases by 5%, which is already a significant change. The number of users scales sublinearly (β = 0.95 ± 0.01) with the city population, though. Model of [48].
It has been shown in [33] that total communication activity in human interaction networks grows superlinearly with city size. This is in line with our findings that the total number of tweets and the total word count scales superlinearly. However, the exponents are not as big as that of the number of calls or call volumes in the previously mentioned article (β ∈

Individual scaling of words
For the 11732 words that had at least 10000 occurrences in the dataset, we fitted scaling relationships using the Person Model. The distribution of the fitted exponents is visible in Figure 5. There is a most probable exponent of approximately 1.02, which corresponds roughly to the scaling exponent of the overall word count. This is the exponent which we use as an alternative model for deciding nonlinearity, because a word that has a scaling law with the same exponent as the total number of words has the same relative frequency in all urban areas. The linear and inconclusive cases calculated from ∆BIC values are located around this maximum, as shown in different colors in Figure 5. In this figure, linearly and nonlinearly classified fits might appear in the same exponent bin, because of the similarity in the fitted exponents, but a difference in the goodness of fit. Words with a smaller exponent, that are "sublinear" do not follow the text growth, thus, their relative frequency decreases as city size increases. Words with a greater exponent, that are "superlinear" will relatively be more prevalent in texts in bigger cities. There are slightly more words that scale sublinearly (5271, 57% of the nonlinear words) than superlinearly (4011, 43% of the nonlinear words).
Three example fits from the three scaling regime are shown in Figure 4.  We sorted the words falling into the "linear" scaling category according to their BIC values showing the goodness of fit for the fixed β model. The first 50 words in Table 1 according to this ranking are some of the most common words of the English language, apart from some swearwords and abbreviations (e.g. lol) that are typical for Twitter language [11].
These are the words that are most homogeneously present in the text of all urban areas.
From the first 5000 words according to word rank by occurrence, the most sublinearly and superlinearly scaling words can be seen in Table 2. Their exponent differs significantly from that of the total word count, and their meaning can usually be linked to the exponent range qualitatively. The sublinearly scaling words mostly correspond to weather services reporting   There is a longer tail in the range of superlinearly scaling words than in the sublinear regime in Figure 5. This tail corresponds to Spanish words (gracias 1.41, por 1.40, para 1.39 etc.), that could not be separated from the English text, since the shortness of tweets make automated language detection very noisy.
Apart from the Spanish words, again some special slang or swearwords (deadass 1.52, thx Thus, when compared to the slightly nonlinear scaling of total amount of words, not all words follow the growth homogeneously with this same exponent. Though a significant amount remains in the linear or inconclusive range according to the statistical model test, most words are sensitive to city size and exhibit a super-or sublinear scaling. Those that fit the linear model the best, correspond to a kind of 'core-Twitter' vocabulary, which has a lot in common with the most common words of the English language, but also shows some Twitterspecific elements. A visible group of words that are amongst the most super-or sublinearly scaling words are related to the abundance or lack of the elements of urban lifestyle (e.g. deer, fitness). Thus, the imprint of the physical environment appears in a quantifiable way in the growths of word occurrences as a function of urban populations. Swearwords and slang, that are quite prevalent in this type of corpus [8,7], appear at both ends of the regime that suggests that some specific forms of swearing disappear with urbanization, but the share of overall swearing on Twitter grows with city size. The peak consisting of Spanish words at the superlinear end of the exponent distribution marks the stronger presence of the biggest non-English speaking ethnicity in bigger urban areas. This is confirmed by fitting the scaling relationship to the Hispanic or Latino population [54] of the MSA areas (β = 1.31 ± 0.14, see SI), which despite the large error, is very superlinear.  That the relative frequency of some words changes with city size means that the frequency of words versus their rank, Zipf's law, can vary from metropolitan area to metropolitan area.

Zipf 's law on Twitter
We obtained that the exponent of Zipf's law depends on city size, namely that the exponent decreases as text size increases. It means that with the growth of a city, rarer words tend to appear in greater numbers. The values obtained for the Zipf exponent are in line with the theoretical bounds 1.6-2.4 of [55]. In the communication efficiency framework [55,56], decreasing β can be understood as decreased communication efficiency due to the increased number of different tokens, that requires more effort in the process of understanding from the reader. Using more specific words can also be a result of the 140 character limit, that was the maximum length of a tweet at the time of the data collection, and it may be a similar effect to that of texting [57]. This suggests that the carrying medium has a huge impact on the exact values of the parameters of linguistic laws.
The Zipf exponent measured in the overall corpus is also much lower than the β = 2 from the original law [40]. We do not observe the second power-law regime either, as suggested by [58] and [49]. Because most observations so far hold only for books or corpora that contain longer texts than tweets, our results suggest that the nature of communication, in our case Twitter itself affects the parameters of linguistic laws.   The decrease in β for bigger cities (or bigger Twitter corpora) suggesting a decreasing number of words with lower frequencies is thus confirmed. There is evidence, that as languages grow, there is a decreasing marginal need for new words [59]. In this sense, the decelerated extension of the vocabulary in bigger cities can also be regarded as language growth.

Conclusion
In this paper, we investigated the scaling relations in citywise Twitter corpora coming from the Metropolitan and Micropolitan Statstical Areas of the United States. We could observe a slightly superlinear scaling decreasing with the city population for the total volume of the tweets and words created in a city. When observing the scaling of individual words, we found that a certain core vocabulary follows the scaling relationship of that of the bulk text, but most words are sensitive to city size, and their frequencies either increase at a higher or a lower rate with city size than that of the total word volume. At both ends of the spectrum, the meaning of the most superlinearly or most sublinearly scaling words is representative of their exponent. We also examined the increase in the number of words with city size, which has an exponent in the sublinear range. This shows that there is a decreasing amount of new words introduced in larger Twitter corpora.

Data availability
Owing to Twitter's policy we cannot publicly share the original dataset used in this analy-

Competing interests
The authors declare no competing interests.
Author contributions E.B. and G.V. designed the study, E.B. and D.K. analyzed the data, E.B., D.K. and G.V.
synthetized the results, E.B. and D.K. wrote the manuscript. All authors gave final approval for publication and agree to be held accountable for the work performed therein.

Funding
The authors thank the support of the National Research, Development and Innovation Office of Hungary (grant no. KH125280).

Research Ethics
We were not required to complete an ethical assessment prior to conducting our research.

Animal Ethics
We were not required to complete an ethical assessment prior to conducting our research.

Permission to carry out fieldwork
No permissions were required prior to conducting our research.