In the mood: the dynamics of collective sentiments on Twitter

We study the relationship between the sentiment levels of Twitter users and the evolving network structure that the users created by @-mentioning each other. We use a large dataset of tweets to which we apply three sentiment scoring algorithms, including the open source SentiStrength program. Specifically we make three contributions. Firstly, we find that people who have potentially the largest communication reach (according to a dynamic centrality measure) use sentiment differently than the average user: for example, they use positive sentiment more often and negative sentiment less often. Secondly, we find that when we follow structurally stable Twitter communities over a period of months, their sentiment levels are also stable, and sudden changes in community sentiment from one day to the next can in most cases be traced to external events affecting the community. Thirdly, based on our findings, we create and calibrate a simple agent-based model that is capable of reproducing measures of emotive response comparable with those obtained from our empirical dataset.


Introduction
It has been noticed long before the Internet that emotions appear to be contagious [1]. While different mechanisms were proposed to explain this phenomenon, from complex cognitive processes [2], to automatic mimicry and synchronization of facial, vocal, postural and instrumental expressions with those around us [3], it is not yet clear how reverberating or inhibiting is online social media regarding contagion of emotions. Agent-based modelling was used to model dynamics of sentiments in online forums [4,5] and to look at the recent rise of the 15M movement in Spain [6]. It has been shown in [7] that positive and negative affects [8]

Communicability and sentiment
In this section, we investigate how users with the highest potential communication reach tend to use sentiment in their messages. We use dynamic communicability, a centrality measure for evolving networks, to assign broadcast scores to users; these scores are one method of quantifying communication reach that has been investigated in the literature. Our investigation is motivated by the finding, in three small observed social network studies [9], that the individuals with large broadcast scores, in general, had very low levels of negative affect at the beginning of the studies.

Broadcast scores
In this subsection, we briefly describe the measure we used to quantify potential communication reach. The measure, called dynamic communicability [12], is a centrality measure for evolving networks based on Katz centrality [13]. Katz centrality in static networks counts all possible paths from and to each vertex, penalizing progressively longer paths. Let an evolving network be represented by a sequence of adjacency matrices A t , where t = 1, . . . , n is the time step. Then dynamic communicability counts all the possible time-respecting paths over the evolving network: such a path can make for example one hop at time step t = 1 and the next hop at time step t = 3, but not vice versa. The formal definition we use for a dynamic communicability matrix is where I is identity matrix, α < (ρ(A t )) −1 is a penalizing factor and ρ(A t ) is the largest eigenvalue 1 of A t . When α is small, short paths in the network are valued highly relative to long paths; when α is larger, long paths are given a relatively larger weight. Here we use one 'snapshot' A t for each day.  Q is a square matrix, with rows and columns representing vertices or individuals in the network. The kth row and column sums each represent a measure of communicability for the vertex (user) k. The row sum represents the broadcast index while the column sum measures the receive index. As the respective names suggest, they measure how well the vertex k is able to broadcast and receive messages over the network.

Extracting a 'mentions' network to analyse broadcast scores
Using the @-mentions in the tweets we collected, we extracted an evolving social network to use for our investigation. This process was rather involved, for two reasons: (i) Because the snowball sampling data collection process itself took several weeks, and because we collected only the last 200 tweets for each user, the time period for which we had data was not the same for all users. Thus, we needed to balance the desire for an evolving network covering a longer period with the desire to have complete data for as many users as possible for that time period. (ii) We wanted to focus our analysis on ordinary human users of Twitter, so we wanted to screen out outlier users such as celebrities and bots. Celebrity accounts tend to be mentioned by a vast number of users, and some types of bot mechanically mention huge numbers of users. Including these accounts could cause the network structure to become degenerate, with a path of length two existing between most pairs of users via an intermediate celebrity or bot.
We extracted an evolving mentions network for the 7-day period from 9th October to 15th October 2014, consisting of 6 052 615 edges between 285 168 users. These edges came from 4 389 362 tweets (one tweet can mention multiple users, giving rise to more than one edge). Details of the extraction and filtering steps are given in appendix B. We calculated a broadcast score for each user, using a range of values of α: 0.15, 0.3, 0.45, 0.6, 0.75 and 0.9.
The distribution of the (SS) scores for all the tweets in our one-week network is shown in figure 1. The mean sentiment was mildly positive for all three measures: 0.297 for (SS), 0.823 for (MC) and 3.669 for (L). The limitations of the sentiment scoring algorithms explain the high proportion of tweets assigned a zero score (as shown, for example, in figure 1). Some of these are genuinely tweets with a neutral tone, but some are tweets where the algorithm cannot detect any sentiment, so we think of the zero score as indicating 'neutral or not detected' sentiment. At the level of individual tweets, Pearson's correlation coefficients between the three sentiment measures (MC), (SS) and (L) are as follows: Although the correlations at the individual tweet level are moderate, we will later see in §4.2 that when we aggregate to groups of tweets, such as all the tweets sent within a particular community, the correlations become very strong.

Broadcast scores versus average sentiment
We now compare broadcast scores with users' sentiment use. For this we need user-level sentiment attributes, but the three sentiment scoring algorithms that we used assign a sentiment score to each tweet. Therefore, we aggregated the sentiment scores of each user's outgoing edges within the network, to get the following seven attributes (for each of the three measures): -Mean sentiment: the mean of the sentiment scores for the user's outgoing edges.
-Mean absolute sentiment: for (MC) this is the mean of the absolute values of the sentiment scores for the user's outgoing edges; for (SS) and (L), where separate positive and negative components were available, we summed the two components' absolute values for each edge, and then took the mean across the user's outgoing edges. -Positive sentiment fraction: the fraction of the user's outgoing edges having a sentiment score greater than zero. -Zero sentiment fraction: the fraction of the user's outgoing edges having a zero sentiment score (indicating a neutral sentiment or that no sentiment could be identified by the scoring system). -Negative sentiment fraction: the fraction of the user's outgoing edges having a sentiment score less than zero. -Average positive sentiment strength: the sum of the user's sentiment scores over the outgoing edges with positive scores only, divided by the count of the user's outgoing edges (this count includes all outgoing edges the user sent, not just those with a positive score). -Average negative sentiment strength: the sum of the absolute values of the user's sentiment scores over the outgoing edges with negative scores only, divided by the count of the user's outgoing edges (this count includes all outgoing edges the user sent, not just those with a negative score).
The purpose of the two sentiment strength attributes is to take into account not only how often a user expresses positive or negative sentiment, but also how extreme that sentiment is when it is expressed. Users with no outgoing edges on the first day of our studied 7-day evolving network are at a disadvantage in terms of broadcast scores, because their messages have only six (or fewer) days to propagate through the network, rather than seven. So for the rest of this section we report on just the 153 691 users who tweeted within the network on the first day.
In figures 2 and 3, we compare the means of the above attributes for the top 500, 1000 and 5000 broadcasters with the means over all users, using (SS) and α = 0.75. We see that: -Top broadcasters send messages with positive sentiment more frequently, and neutral and negative sentiment less often. -When we additionally account for the extremity of the sentiment that is used as well as the frequency, top broadcasters use more positive sentiment, and less neutral and negative sentiment.
The differences are most pronounced for the top 500 broadcasters; as we move from the top 500 to the top 1000 and then top 5000, the means for the top broadcasters gradually become closer to the means for the whole population of users. But even for the top 5000 broadcasters there are still substantial differences.
To confirm the statistical significance of this finding, we have used randomization testing to estimate (one-sided) p-values 2 which are shown as annotations in figures 2 and 3. Note that this does not mean that every user in the top 500 has a higher positive sentiment fraction (i.e. uses positive sentiment more frequently) than the average user. Figure 4 shows the distribution of 2 To explain how these are produced, we shall sketch the calculation of the p-value for one of the attributes, the negative sentiment fraction as shown in figure 3. The average across all users is 0.142, whereas for the top 500 broadcasters it is only 0.119. We randomly generated 100 000 subsets of the 153 691 users and calculated the means for those subsets. From this we estimate how the mean of the attribute is distributed for randomly chosen sets of size 500. From this distribution, we calculate the p-value as the probability that a randomly selected set of 500 users would have a mean equal to 0.119 or more extreme (smaller). This probability is very close to zero (0.00022). Informally, this means we can be very confident that the relationship we have found-that the top broadcasters use negative sentiment less often-has not simply happened 'by chance'; the odds of that are less than 3 in 10 000.   positive sentiment fraction for the top 500 broadcasters, and for all users, using (SS). The distributions overlap, of course, in particular there are a few top broadcasters with low positive sentiment fractions. Nevertheless, one can clearly see that the distributions are not the same: the distribution for the top 500 broadcasters is, in general, shifted towards the higher end of the horizontal axis, showing that on average top broadcasters use positive sentiment more often.
Although we have shown the results for (SS) and α = 0.75, with one exception the same pattern of results was found for all tested values of α, and also using the other sentiment measures (MC) and (L) (again for six tested values of α), and the p-values were all less than 0. 026   (SS) and α = 0.15, the 'negative sentiment strength' and 'negative sentiment fraction' attributes for the top 5000 broadcasters were very nearly equal to the mean over all users. In addition to investigating the sentiment use of the top broadcasters, we looked for general trends relating sentiment use to broadcast rank. Figure 5 plots moving averages of the (SS) sentiment fraction attributes against broadcast rank, using a window of 1000 observations to smooth the noisy data. We

Sentiment and evolution of communities on Twitter
In this section, we describe how we identified meaningful communities or 'sub-networks' of Twitter users, and we present the results of our analysis of how these communities evolved over time, including how their sentiment evolved. The existence of communities has been observed in all kinds of real-world networks and identifying them has been the subject of considerable research effort in recent years, much of which can be traced back to a seminal paper of Girvan & Newman [14]. In the vast literature on community detection (e.g. [15]), a community is often taken to be a group of users with two characteristics: (i) The community is densely connected internally, i.e. people within the same community talk to each other a lot. (ii) There are relatively few links crossing from the community to the outside world, i.e. people talk to fellow members of their community more often than they talk to non-members.

How we detected communities and selected a subset for further study
Because we wanted to find communities that would endure over time, we needed to take a longer period of data than the 7 days we analysed in §3. We can imagine online discussions that spring up, rage feverishly for a few days and then largely disappear, 3 and that is not what we wanted to find. Yet, as described in §2, we had only the last 200 tweets per user, so we needed to limit ourselves to a period where the data was most complete. We extracted a mentions network from 22 September 2014 (inclusive) until the end of our snowball-sampled data, 6 November 2014, a period of 46 days. The process for creating the network was the same as described for the 7-day network, described in appendix B. The resulting network consisted of 491 417 users with 31 299 836 edges between them, coming from 22 594 048 tweets. For the first 40 days, the daily average was 776 k edges; for the last 6 days, when data collection was coming to an end, the daily average was only 40 k edges. The network has an average of 63.7 outgoing edges per user, corresponding to 46.0 tweets per user, and each user mentioned an average of 30.9 distinct recipients. With the dataset chosen, we turn to the question of algorithms. Discovering communities by algorithms requires one to first formulate a precise definition of how 'good' a given division of a social network into communities is. The most widely used formula for quantifying the 'goodness' of a division is called modularity [16], and it compares the fraction of edges that lie within a community in the network with the expected fraction of edges that would lie within the community if the edges were placed at random. Many different versions of modularity have been proposed in the last decade. As we look at relatively unbalanced divisions (trying to identify small portions of a large network), we considered instead a different measure called conductance [17] which takes values from 0 to 1. Groups of users that are well connected internally but well separated from the rest of the network have values close to 0, and groups with few internal connections but lots of connections to the rest of the network have values close to 1.
There is also a variant of conductance, called weighted conductance, that takes into account the weights on edges, rather than just their presence or absence. We use the number of messages exchanged between two users (in either direction) as the weight of the edge between them. Thus, weighted conductance depends not only on which users have corresponded with which others but also on how often. If W ij is the weight of the edge from user i to user j, S is a community andS denotes the remaining users, the weighted conductance of S is where a(S) = i∈S j∈V W ij (with V being the set of all vertices, i.e. all users). We used the following three algorithms to identify communities: -The Louvain method on unweighted graphs, described in [18], as implemented in Python in the library [19] and in C++ by Lefebvre and Guillaume. 4 -The Louvain method on weighted graphs, using the C++ implementation.
-The k-clique-communities method 5 presented in [20] as implemented in the NetworkX Python library.
Using these three methods with different parameters, we produced a list of 98 078 candidate communities. For each community we calculated: -the size of the community (number of nodes), -the number of internal edges (mentions between users), -the number of internal edges (mentions) per node (this gives a measure of how much activity there is inside the community), -the conductance and the weighted conductance of the community within the whole network, -the mean sentiment of edges within the community, using the (MC) measure, 6 -whether the community consisted of a single connected component (good candidate communities will of course be connected; however, very infrequently the Louvain method can generate disconnected communities, by removing a 'bridge' node during its iterative refinement of its communities), -the fraction of internal mentions with non-zero sentiment (some of our candidate communities were composed mainly of users speaking a non-English language, and we used this measure to filter them out; tweets in other languages are likely to be assigned a zero sentiment score, because the sentiment scoring algorithm does not find any English words with which to gauge the sentiment), -some statistics summarizing the role played in the community by recently registered users; and -a breakdown of the frequency of participation of users in the community. (For each user in the community, we counted how many distinct days they had been active on Twitter in our data, and then calculated the percentage of these days on which they had posted within the candidate community. We calculated the average across all users in the community, and also split the users up into five bins.) Based on the above statistics, we short-listed a subset of communities and performed a manual inspection of a sample of the tweets within the community, to assess the topics talked about and a visualization of the community, using the program VISONE (http://visone.info/html/about.html) for this subset.
In the end, we selected 18 communities to monitor and study. Table 1 shows most of the statistics listed above for these 18 communities, in size order. In each numerical column, the highest six values are highlighted in italics and the lowest six values are highlighted in bold (recall that for conductance and weighted conductance, lower values indicate a more tightly knit community). The 'Algorithm' column contains 'L' for the Louvain method, 'W' for the weighted Louvain method and 'K' for the k-cliquecommunities method. We chose six communities from each algorithm. Table 2 shows frequency of participation, with communities ranked by the third column, which gives the average user participation. This is expressed as a percentage: the percentage of days on which the user was active on Twitter (in our dataset) that they were active in the community. The rightmost five columns show, for each community, how the users' participation levels break down into five bins. Bins with disproportionately many users in them (i.e. with values more than 0.2) are highlighted in italics. We can see that with the exception of community 4 (weddings), every community has at least a 20% 'hard core' of users, who are active in the community nearly every day they are active on Twitter.
Once we had selected the communities of interest, we collected a more detailed tweet history for each participating user, as described in §2.

Analysing the endurance of the communities
We analysed how well our communities endured over time. We examined a 28-day period starting on 22 September 2014 (which we will call the 'autumn period') and a 28-day period starting on 2 February 2015 (which we will call the 'spring period'), and compared how many users in each community were active (mentioned or were mentioned by other users) within the community. Would the same users still be tweeting each other in the spring, or would the communities have dissolved over time? Figure 7 shows a log-log plot of the results.
We see that the communities persisted well from autumn to spring. In three of them, communities 14 (human resources), 17 (friends chatting) and 18 (friends chatting), all the original users were still active in the community. These are three out of the four smallest communities. The other 15 communities lost between 6.5% (for community 16, nursing) and 39.3% (for community 7, Islam) of their users, with an average loss of 18.6%. We can see differences in the communities produced by the three algorithms here: the six produced by k-clique-communities lost an average of 3.8% of their users, compared to 16.4% for the Louvain method and 26.3% for weighted Louvain.
Let us say user loss factor to mean the number of users active in the 28-day autumn period divided by the number active in the later 28-day spring period. When the user loss factor is 1, then the community has retained all its users; the higher the value, the more users the community has lost. We looked to see whether the conductance, sentiment or size of communities is related to their endurance. In figure 8, one can see that conductance is a predictor of what proportion of users will stop participating in the community, with correlation coefficient 0.42. When conductance is lower (so that the community is more densely connected internally and better separated from the rest of the network) then fewer users stopped participating on average.
Similarly, the community sentiment is a predictor of community endurance, as shown in figure 9: the more positive the initial sentiment (measured in the autumn period), the fewer users stopped participating on average. For (SS) (as shown in figure 9) the correlation coefficient is −0.60; for (MC) it is −0.48 and for (L) it is −0.58. On the other hand, community size was not correlated to user loss factor; the correlation coefficient was 0.07.
We noted in §3.2 that the correlations between the three sentiment measures (MC), (SS) and (L) at the individual tweet level were only moderate. The following shows the correlations between the community sentiments produced by the three measures, in the autumn and spring periods:

Dynamics of sentiments in communities
Here, we analyse the changes in sentiment/mood of our communities over time (or the lack thereof, as it generally turns out). Figure 10 plots the mean (SS) sentiment of each community over the autumn period against the mean (SS) sentiment over the spring period. We see that the sentiments persisted very strongly: the correlation between the autumn sentiment and spring sentiment is 0.982. The corresponding correlation under the (MC) measure was 0.982, and under (L) was 0.960. We looked for explanations for the (small) changes in sentiments that did occur. On the vertical axis of figure 11, we show the change in mean sentiment between the autumn period and spring period using (MC); a positive number means that the sentiment became more positive over time. On the horizontal axis, we show the mean sentiment during the autumn period. What we find is that when the sentiment is initially at the negative end of the spectrum, it tends to increase slightly; on the other hand, if the sentiment is initially at the positive end, it tends to decrease slightly.
In fact, the sentiment in 16 of the 18 communities moved slightly towards a moderate (MC) value of 0.4 (which is approximately where the line of best fit cuts the horizontal axis in figure 11). This could be because extreme sentiment in a community is 'whipped up' by external events and then, once those events are over, tends to dissipate naturally with time.
We point out, however, that there is probably also an element of statistical 'regression to the mean' occurring. We did not choose our communities at random: we chose five of them because they were among those with the most extreme sentiment in the autumn period. 7 Figure 9. Communities with more negative sentiment, measured by (SS), tended to lose more of their users over time.
makes it more likely for the sentiment in these five communities to become more moderate by the spring period (which it does, in all five cases). This bias is unavoidable when one disproportionately selects communities with extreme sentiment for study. The correlation coefficient in figure 11 is −0.71. The relationship was less apparent using the other sentiment measures, though still present, with corresponding correlations of −0.59 for (SS) and −0.32 for (L). The robustness of the weekly sentiment measures suggests that only a limited amount of data, say for two or three weeks, is needed to give a good idea of the sentiment of a Twitter community, and if a drastic change in sentiment does occur within a community, this is a rare event and may indicate that something important has happened to or within the community.  Looking at the daily average sentiment in each community, that is, looking at a higher resolution, more detail is evident. Figure 12 shows the daily mean sentiment in community 2 (Indian politics), also for the period 22 September 2014 to 1 March 2015. Large day-to-day variations can be seen, and we have noticed that often such abrupt changes can be traced to real events affecting the community. In figure 12, we have highlighted five dates where the sentiment measures show spikes or troughs. By examining the tweets sent on those dates we identified the significant event that drove the sentiment change:

An agent-based model of sentiments dynamics in communities
It has been discovered time after time that the collective behaviour of populations of interacting individuals is difficult to understand, challenging to predict and sometimes even seemingly paradoxical. In order to be able to predict the likely evolution of sentiment within a community and to explore its dynamics under various change scenarios, such as the departure of particular users or the arrival of a new vocal user, we built an ABM of our Twitter communities. This includes modelling the sentiment of individuals in the network, and how sentiment spreads from one user to another.
The agents in the model represent Twitter users, and they are arranged in a static undirected graph; only pairs of agents connected by an edge are able to exchange messages. The simulation proceeds in discrete time steps; the number of these steps per day is a parameter of the model. At each time step the following things happen: -Each agent performs an action which consists of sending a burst of messages to all/some/none of its neighbours, influenced by the agent's current state. -Each agent evolves into a new state, influenced by the actions of other agents in this step, i.e.
influenced by the messages it has received this step.
Specifically, an action by an agent consists of: a subset of neighbours who will be messaged at this time step; for each neighbour messaged, the number of messages sent to them at this time step; for each neighbour messaged, a sentiment for the messages sent to them at this time step. The state of an agent consists of two variables. The first is a real number representing the current sentiment level of the agent, on the same scale as the sentiment scores used for messages. The second is a record of who sent a message to the agent recently: this is the subset of the agent's neighbours who sent the agent a message at the previous time step; these are candidates for the agent to reply to. In addition to its evolving state, each agent A has a set of constant characteristics that influence its behaviour but do not evolve: (i) an initiation probability P(init, A) which controls the tendency of the agent to initiate new conversations with other users when it has received no messages recently; (ii) a reply probability P(reply, A) which controls the tendency of the agent to reply to messages it has received; (iii) a propagation probability P(prop, A) which controls the tendency of the agent to propagate messages, that is, to message some other user B after being prompted by a message from a different user C in the previous time step; (iv) a baseline sentiment level S(baseline, A): this is the sentiment level the agent starts off with, and to which it may reset from time to time (as described below); and (v) a neutral sentiment level S(neutral, A): when the agent receives messages with sentiment higher than this level, the agent's sentiment will be raised, and when the agent receives messages with sentiment lower than this level, the agent's sentiment will be lowered.
The model also has six global parameters: the number of iterations (discrete time steps) per day, the mean number of messages per burst, a contagion of sentiment factor, a sentiment reset probability, a sentiment noise level and a neighbour frequency threshold. The details of the global parameters, how the agents decide to send messages and how the agents' sentiments evolve are given in appendix C. The process of using this model to simulate a real Twitter community is then as follows. First, we construct the graph from the historical data for the community, connecting the users that have exchanged more messages than the neighbour threshold. We set the baseline sentiment S(baseline, A) for each agent A by computing the mean sentiment of messages sent by each user, and we set the neutral sentiment S(neutral, A) of each agent to the mean sentiment of all messages sent in the community. To estimate the initiation probability P(init, A) for each agent A, we split the historical data into windows, with length determined by the number of iterations per day. We count the number of opportunities A had to initiate a conversation (i.e. how many windows there were in which A received no messages), and also how many times out of these A actually initiated a conversation. The reply and propagate probabilities P(reply, A) and P(prop, A) are set similarly.
To perform a simulation run of the model, we set all the agents to their initial state, and then we evolve the system for the required number of steps, recording the messages that were sent for later analysis. The initial state of each agent is that the agent has received no messages to consider replying to, and its current sentiment is equal to its baseline sentiment. The required number of steps is the number of days in the real data multiplied by the number of iterations per day (so that the time period of the simulation matches that of the real data).

Calibration
We now describe how we calibrated our model to our Twitter data. The purpose of the six global parameters is to make our ABM 'tuneable', so that we can fine-tune it to match the behaviour observed in different kinds of online community. Calibrating the model to a particular community means finding the values of the six parameters that maximize the match between the model and the real data, i.e. the parameter values that make the simulation runs of the model most closely resemble the real data. In our case, the specific metrics that we use to compare the simulated data with the real data are: the activity levels (number of messages sent per day) of each individual user, and the day-to-day volatility of this, as well as the sentiment of the whole network, and its day-to-day volatility. Comparing the real data and simulated data in this way is an instance of the method of simulated moments.
We therefore propose the following function ρ to score a particular simulation run (smaller scores mean a better match): Here N is the number of users. We denote with C i , std(C i ) the average and standard deviation, respectively, of the number of messages sent each day by user i in real data , and with Ĉ i , std(Ĉ i ) the corresponding values in the simulation run. Similarly, E c , std(E c ) denote the average and standard deviation of daily community sentiment and Ê c , std(Ê c ) those values in the simulation run. The relative sizes of the constants α, β, γ and δ are set to reflect how we prioritize the various aspects of the comparison between the real and simulated data. We have used α = 1, β = 0.1, γ = 10 and δ = 100, which means that we are putting a lot of emphasis on matching the volatility of daily community sentiment, and less emphasis on matching the level of daily community sentiment. Conversely for the number of messages sent per day by each agent, we prioritize matching the level over matching the volatility. We chose to model a small community so that we could trace through the simulations, in order to understand them better. We concentrated on modelling community 17 (friends chatting) which has 28 users. We calibrated the model for each of the three sentiment measures (MC), (SS) and (L). Each calibration was performed with an iterative grid search: we used five successive grid searches, each time zooming in on the area of the parameter space that appeared most promising in the previous search. The initial ranges searched for each parameter are given in appendix D. Because the simulation runs are randomized we performed 50 simulation runs for each combination of parameters tested, taking the mean of the resulting 50 scores as the score for the choice of parameters. The parameters found by the repeated grid search were as follows:  Figure 13 compares the mean daily count of messages sent for each user, in the real data and averaged over 500 simulation runs. As we can see, the match is extremely close. Figure 14 similarly compares the standard deviation (variability) of the daily count of messages sent for each user, in the real data and averaged over 500 simulation runs. The match is less good here, which reflects the fact that when setting the constants α and β in the scoring function, we chose to prioritize matching the means instead of the standard deviations. The sentiment statistics of the real data are matched closely by the simulated data (again averaged over 500 simulation runs), particularly for (MC): Finally, in figure 15 we plot the initiation probability P(init, A) of each agent against the reply probability P(reply, A); recall that these are set from the historical data of the community. We include this plot to emphasize the lack of correlation between the two. This confirms that users really do appear to play different roles in the community, with some initiating relatively often but not replying much, and others replying readily while initiating but little.

Predicting the effects of introducing a new user
We now consider a scenario where a new user joins the network and becomes the neighbour of any three existing community members that we choose. Which three community members should our new user befriend? For illustration, we explore four possible choices: (i) Befriend the three users with the most positive sentiment (ii) Befriend the three users with the most negative sentiment (iii) Befriend the three users with the highest reply probabilities (iv) Befriend the three users with the lowest reply probabilities For the purposes of this example, we assume that our user will be vocal but with sentiment matched to the prevailing sentiment of the existing community: the new user's initiation (resp. reply, propagation) probability is set to three times the maximum initiation (resp. reply, propagation) probability found in the existing community. Also, the new user's baseline sentiment level is set to the existing community sentiment level. Figures 16, 17, 18 and 19 show how our four choices of neighbours affect four aspects of the community: the activity level, the standard deviation (variability) of the daily activity levels, the

Discussion
Despite the deluge of data on human communication, dynamics of collective mood is still mainly an uncharted area. While different theories of emotion contagion exist in the literature, we are still far off being able to predict the occurrences, intensity and durations of collective compassion, happiness or outrage on Twitter. Here we presented findings from one large Twitter dataset. While we are conscious of some serious limitations of our approach-the lack of representativeness of Twitter users, and the noisy nature of sentiment scores-we believe that our methodology can be generalized to other datasets of human interactions which allow for sentiment scoring.  Looking to wider socio-economic horizons and smart cities opportunities, social media is slowly but steadily becoming an important channel to run policy information and education campaigns on a mass scale. Additionally, it has become an exclusive channel to get the attention of some socio-demographic groups, especially in the younger population, who decreasingly consume traditional media such as local newspapers and television.
For these reasons, a data-driven model of collective sentiment captured through social media is one of the most important tools that social data analytics can offer to a city leadership. It allows gauging public opinion on different topics and understanding/predicting the dynamics of public opinion. Most importantly, it can help to uncover public evaluation of local decisions. It also allows, as mentioned previously, to engage different communities into a conversation and to reach to under-represented groups. Our framework can be applied over a wide range of topics: energy, transport, education, tourism, local leadership and so on.
We demonstrated that by using a number of community detection algorithms in combination with sentiment scores, we can identify stable communities of Twitter users. Users within these communities are well connected and send messages to each other frequently compared with how frequently they send messages to users not in the community. The communities and their 'community sentiment' were relatively stable over a time scale of months. More loose-knit communities and communities with more negative sentiment tended to lose more users over time. We find that when the sentiment in a community temporarily shows a large deviation from its usual level, this can typically be traced to a significant identifiable event affecting the community, sometimes an external news event.
We have developed an ABM of online social networks. The model consists of a population of simulated users, each with its own individual characteristics, such as its tendency to initiate new conversations, its tendency to reply when messaged, and its usual sentiment level. The model allows for sentiment contagion. We have demonstrated that this model, when calibrated with the data from a real Twitter community, accurately reproduces activity levels and sentiment strength of that community. We have shown an example of using the ABM for exploring 'what if. . .?' scenarios, such as 'What if we encourage a new user to interact with particular users in the community?'. To do this, we fit the parameters of the model to a particular social network and then make the corresponding modifications to the model. By running a large number of simulations on the modified model, we obtain a prediction of the likely effect of the change on the activity levels and sentiment levels of the community.
Ethics. Following University of Reading's research ethics committee guidelines, because the human data that we analysed is in the public domain there was no need to obtain ethical approval (https://www.reading.ac.uk/internal/res/ ResearchEthics/reas-REwhatdoIneedtodo.aspx).
The curated datasets used for the various analyses in described in this article are available at http:// dx.doi.org/10.5061/dryad.5302r. These are: -The 7-day evolving network used for the communicability analysis described in §3.
-The graph used for community detection, as described in §4.1.
-The attributes of tweets within each community that we collected (as described in §2); this data cover the analysis done in § §4.2, 4.3 and 5.
For each tweet in these datasets, we included the following attributes: (i) an anonymized tweet ID, (ii) a timestamp, (iii) who was the sender (an anonymized user ID), (iv) who was mentioned in the tweet (anonymized user IDs); and (v) sentiment scores for the three measures (MC), (SS) and (L).

Appendix B. Extracting a mentions network
To get the best results, we have chosen for analysis the period with the largest possible number of users active in our data. Figure 20 shows the number of users active in the data each day for the period from 22 April 2014 until the end of the snowball-sampled data. Figure 21 shows the same thing but 'zoomed in' to a restricted range of dates. We note that the number of users oscillates between weekdays and weekends, and the weekly total gradually increases and peaks on 15 October 2014. Then the number of users falls off quite rapidly. The shape of figure 20 is largely due to the fact that only the last 200 tweets (from the time of the API request) per user were collected. Thus, those users who tweet frequently do not show up in the first half of the chart as their earlier tweets have not been collected.
We used the Twitter dataset to create an evolving network, where the set of vertices is fixed and the edges between them can change in each time step (a day in our case). In order to choose a fixed set of vertices, we have chosen the week from 9 October to 15 October 2014 inclusive, which is the week with the highest activity measured by the number of tweets.
We then filtered the data using several criteria in order to focus on 'regular' human users. A number of classes of user with unusual behaviour were filtered out, as they would skew the results of our analyses (and threaten to make the network structure become degenerate, as we described in §3.2): -Users with a very high tweeting frequency. If a user tweets hundreds of tweets in a few hours then these messages might have been automatically generated. This practice is followed by many companies and organizations for advertising purposes, but the messages are not genuinely representative of human behaviour. Figure 22 shows the number of days (rounded up to the next value) between the first tweet and the last tweet in our snowball-sampled data, just for those users who had posted at least 200 tweets since account creation. When setting a threshold on tweeting frequency to exclude users, we should of course filter out only a small minority of users. We observe in figure 22 a natural 'gap' in the data at the value 1, with only 48 users appearing in this bin. We select a day difference of 1-equivalently a tweeting frequency of 200 tweets per day-as our threshold, excluding 1153 users with a higher tweeting frequency than this. -Users who mention themselves very frequently, who may also be bots. We used a threshold of 0.5 for the ratio of the number of self-mentions to the number of all mentions made by a user, chosen because the number of users with a self-mention ratio larger than R drops off rapidly as R increases above 0.5. The 0.5 threshold also seems to be a reasonable choice because it indicates that outliers mention themselves more often than they mention all other users. -Users with a high ratio of in-degree to out-degree. Examples of these users are celebrities or well-known services which attract a high number of mentions relative to their activity. Looking at figure 23, we observe that the number of users smoothly decreases as the in-degree to outdegree ratio increases. Since there is no value beyond which the number of users drastically decreases, there is no clear choice of threshold. We set the threshold at 50, meaning we treat as outliers, and exclude, users with in-out ratio greater than 50. In other words, we assume that users that receive mentions 50 times more than they mention others are celebrities/politicians or big organizations that skew the network and should be excluded. Indeed among the users with exceptionally high ratio one can find TheEconomist, UberFacts, MayorofLondon, amandabynes, NatGeo, HillaryClinton, Ed_Miliband, BBCPanorama, David_Cameron, JunckerEU, BillGates and YouTube.
After these filtering steps 304 349 users remained. We wanted our evolving network to reflect users' conversations, rather than one-way messaging, so we performed one more filtering step. We formed an undirected network on the remaining users by using only reciprocated mentions; this means that we put an edge between users A and B just when A had mentioned B sometime during the chosen week and also B had mentioned A during the chosen week. Then we found the largest connected components of this graph, which contained 285 168 users (i.e. 94% of the 304 349 users). We took these 285 168 users as our final set of nodes; they form a 'proper' social network in the sense that there is a path of reciprocal mentions connecting any pair of users.
We emphasize that the reciprocal mentions as undirected edges were only used for choosing the final node set; the seven 1-day snapshots that formed the evolving network we studied did include all the mentions between the chosen users, even unrecriprocated ones.