Interest communities and flow roles in directed networks: the Twitter network of the UK riots

Directionality is a crucial ingredient in many complex networks in which information, energy or influence are transmitted. In such directed networks, analysing flows (and not only the strength of connections) is crucial to reveal important features of the network that might go undetected if the orientation of connections is ignored. We showcase here a flow-based approach for community detection through the study of the network of the most influential Twitter users during the 2011 riots in England. Firstly, we use directed Markov Stability to extract descriptions of the network at different levels of coarseness in terms of interest communities, i.e. groups of nodes within which flows of information are contained and reinforced. Such interest communities reveal user groupings according to location, profession, employer and topic. The study of flows also allows us to generate an interest distance, which affords a personalized view of the attention in the network as viewed from the vantage point of any given user. Secondly, we analyse the profiles of incoming and outgoing long-range flows with a combined approach of role-based similarity and the novel relaxed minimum spanning tree algorithm to reveal that the users in the network can be classified into five roles. These flow roles go beyond the standard leader/follower dichotomy and differ from classifications based on regular/structural equivalence. We then show that the interest communities fall into distinct informational organigrams characterized by a different mix of user roles reflecting the quality of dialogue within them. Our generic framework can be used to provide insight into how flows are generated, distributed, preserved and consumed in directed networks.

: Some properties of the UK riots' influential Twitter users network. A: A directed network is created by intersecting each user's friends with the user list. The interest (or attention) is directed at friends, and information travels in the opposite direction (from the friend to the follower). B: Cumulative re-tweet distribution. The dotted line is the fit retweets −α , where α = 2.12. C: Cumulative in-degree (left, in red) and out-degree (right, in blue) distributions. The dashed lines show fits of the data to e −λ in k in and e −λoutkout , with λ in = 0.0313 and λ out = 0.0312.
The remaining 86 nodes are either completely disconnected (i.e., they do not follow nor are followed by anyone on the list), or their accounts have since been discontinued. All the subsequent work in this paper uses this 914-node network.

Some statistics of the network
The distribution of the number of retweets of the members of the list (Fig. S1B) is compatible with a power-law with exponent α ∼ 2.12 (p = 0.75, using the criterion in Ref. [2]), which is consistent with previous analyses of Twitter data sets [3]. The cumulative in-and out-degree distributions of the connected component are shown in Fig. S1C. Although both distributions appear exponential with, respectively, parameters λ in ≈ 0.0313 and λ out ≈ 0.0312, a Kolmogorov-Smirnov statistic does not provide statistically significant support for this hypothesis (p < 0.1). The distributions of k in and k out in this data are less skewed than other published results [4]. Note also that the in-and out-degrees are similarly fat-tailed, in contrast with other studies in which k in was found to be more skewed than k out [5]. These differences of our distributions with other published reports may be due to the fact that our relatively small user list corresponds to a subset of users with a high number of retweets in The Guardian's riot tweet database, and may not be representative of the wider network of all Twitter users. 2 Interest communities at different levels of resolution in the Twitter network of the 2011 UK riots As a relevant application of current interest, we use the Markov Stability framework to analyse a directed graph derived from a social network: the Twitter network derived from The Guardian list of influential users during the UK riots. In Fig. 1A of the Main Text we show the number of communities and in Fig. S2, we show the variation of information (VI) obtained after optimising stability for Markov times between 10 −3 and 10 1.5 . Below we provide some examples of communities found at different Markov times (marked with red circles on the variation of information (VI) in Fig. S2). These examples were highlighted due to the relative robustness of the communities, as well as a means to showcase the different types of communities found at different levels of resolution.
In the Supplementary Spreadsheet, we provide a spreadsheet with all the communities found at all Markov times. The convention we have followed to name the communities is T[Markov time]-C[community number].

Communities at high level of resolution
At short Markov times (t = 0.15), we find a revealing partition with 149 communities. Though granular, the community structure of the partition shows interesting features. Some of the communities in this partition correspond to a precise geographical location in England or within a city (Fig. S3). For example, there are communities from Hackney (where the riots began) and Croydon. The latter includes the account of London's Mayor (mayoroflondon). Other geographically homogeneous communities are from the Midlands, Liverpool, and Manchester (T0.15-C10, T0.15-C11 and T0.15-C28).
Coexisting with those geographical communities at this level of resolution, we also find journalists and media outlets in other communities defined by their affiliation. Figure S4 shows communities from The Daily Telegraph, The Independent, ITV, Sky and the BBC. Other interesting communities in this partition are, for example, formed by UK activists (T0.15-C0), and another formed by the Anonymous Internet activist group (T0.15-C12), see the Supplemental Material.

Communities at medium level of resolution
As the Markov time grows, the communities become coarser. In Fig. S5, we show the partition of the network at Markov time t = 0.5, when there are 48 communities in the riot network. The largest community in this partition (T0.5-C0) with 57 members is the 'Sports' community. Other examples of communities found at this level of resolution include a community of 'Comedians, writers and presenters'; a 'Parody' community; a community of 'Music journalists and artists'; a community of 'Police forces and crime journalists'; a 'London' community; a community of 'Activist, students and journalists with a focus on the Middle East'; and an 'Online media' community of (mostly) Internet media outlets, individuals and companies.

Communities at a coarser level of resolution
When the Markov time is longer (t = 1.3) we find a partition into 15 communities in the network. In Fig. 6 of the Main Text we present a coarse-grained overview of the communities, their relationships, and their word cloud self-descriptions. This partition, its communities, and the global view of the network it provides are discussed at length in the Main Text. 2.5 The change in the community structure when directionality is ignored

Communities at low level of resolution
As discussed in the Main text, the community structure detected is significantly different if the directionality of the edges is ignored. This phenomenon was exemplified through two examples: the BBC community (in Fig. 2 of the Main Text, which was heavily affected when directionality was neglected) and the Monbiot community (in Fig. 3 of the Main Text, which remained relatively unaffected).
To complement this view, we show here the comparison of the communities found at all Markov times for the directed riots network and two undirected versions of it: an undirected network obtained by simply ignoring the direction of the edges, and a symmetrised version of the network whose adjacency matrix is A + A T (i.e., reciprocal edges have twice the weight of nonreciprocal ones). In Fig. S6A-B, we show that the partitions found in the directed network are different from those found in the undirected and symmetrised versions across all Markov times, whereas both symmetrised versions are similar to each other (Fig. S6C). The differences between the directed and undirected versions are high at small Markov times and, as expected, they become smaller as the Markov time grows (i.e., at lower resolutions). Hence the most prominent effect of ignoring directionality in this network is to blur the fine structure of the flow communities.
For ease of comparison, the full sets of partitions for both the directed and undirected graphs at all Markov times is given in the Supplementary Spreadsheet.

Comparison with other community detection methods-Infomap
As a comparison with our directed Markov Stability methodology, we analysed the community structure of the directed Twitter network using Infomap [7,8], as downloaded from http://www.mapequation.org/. Infomap is a well-known method for community detection in directed and undirected networks based on information compression, which has been shown to perform well in some benchmarks with clique-like communities.
In this case, Infomap obtains partitions only at two levels of resolution. The finer of these two partitions consists of 342 communities: 318 communities contain only one node; 50 communities contain only two nodes; and the largest community contains 60 nodes. The coarser of these two partitions has 60 communities: 26 communities still have only one node while the largest community has 342 nodes, i.e., more than a third of the nodes in the network. Hence, in this case, the communities obtained by Infomap lead to an over-partitioned description of our Twitter network.
We provide all the partitions obtained with Infomap in the Supplemental Spreadsheet. These results of Infomap are consistent with the analysis presented in Refs. [9,10], where it is shown that Infomap is a one-step method, which is highly efficient for the detection of clique-like communities but which may lead to over-partitioning when the communities are non-clique like. In contrast, Markov Stability makes use of the full transients (i.e. the complete dynamics with paths of all lengths, as shown in Fig. S8A and Fig. 1 from the Main text) to unfold the community structure across scales. This tendency of Infomap to over-partitioning in some networks is signalled by a large compression gap [9] with respect to the optimal compression. In this particular case, the code-length of the Infomap partition is 8.3977 bits (after 500 trials) and the compression gap is 0.3647, which is more than three times larger than is achieved for clique-like communities [9]. Hence, although in other instances and benchmarks Infomap performs very well, for this network it does not produce the nuanced description at different levels of resolution that our method delivers through the full use of flow transients [9,10]. If the direction of the edges is ignored, this over-partitioning effect of Infomap is even more striking. For the undirected graph, Infomap obtains two partitions (code-length: 9.22 bits, compression-gap: 0.455), with 800 communities in the finer partition and 6 communities in the coarser partition, one of which contains 894 nodes. Figure S7: Example word clouds created from the 50 most frequently-used words in two communities from Fig. S5. A: Word cloud from the biographies of an activist community (T0.5-C5). More frequent words appear larger than less-frequent ones (function words ignored). B: Word cloud of the Parody account community (T0.5-C11) whose members are linked by their interests but do not use a common vocabulary to describe themselves collectively. In this case the word frequencies do not help establish the nature of the community.

Self-descriptions from Twitter biographies
Interpreting the communities found in the analysis by looking through all their members is mostly impractical. As an aid to assess the quality and intrinsic content of the communities found, we tap into the information contained in the mini-biographies provided by the users. Our premise is that we can learn valuable information about a community from the small texts that the members write about themselves. To do this in a more systematic manner, we collect all the biographies of a community in a single file (removing urls, emails, numbers, function words, and other nonstandard characters) and count the occurrences of all words. We then compile the most frequently used words as an aid in the characterisation of each community and construct word cloud visualisations [6] of the word-frequencies of the 50 most used words in each community. The word frequencies (and their word-cloud representation) acts as a 'self description' of each community. In some cases, the self-descriptions of the users in a community share highly indicative words, which appear prominently in the word clouds. Figure S5 shows that the word clouds of several communities at t = 0.5 all represent well their character. For example, the members of the 'Middle-East activism' community describe themselves using a consistent vocabulary representing their interests, as shown in Fig. S7A.
On the other hand, other communities are more heterogeneous in the self-descriptions of their members. For instance, the members of the 'Parody' community do not use a common vocabulary to describe themselves (Fig. S7B), so in this case the word frequencies do not help establish the nature of the community. Indeed, this group does not share a common thematic content but are otherwise linked by their acting as ironic reflections of a variety of celebrities. This is reflected in their identification as an interest community. If we analyse the membership of the community carefully, one can see that the community contains many parody accounts (e.g., parodies of the Queens of England and the Netherlands, and Star Wars and Lord of the Rings characters).
In general, given the small amount of text available in the self-descriptions, the word frequencies and word-clouds must be used judiciously. However, they can be of great aid in providing a simple visual interpretation of the communities, as shown in the figures in this Supplemental Information and the Main Text.
4 Using the RMST-RBS similarity graph to uncover roles in the network The RMST-RBS graph thus constructed is a new graph (undirected and unweighted), which captures geometrically how similar two nodes are, based on their vectors of incoming and outgoing flow profiles. Clearly, this role similarity graph is distinct from the original graph that originated it: two nodes are connected in the role similarity graph only if they have similar profiles of incoming and outgoing paths in the Twitter network, regardless of whether they are neighbours in the original network. Figure 5 in the Main Text shows the role similarity graph constructed from the Twitter riot network. We apply a graph-theoretical community detection method (in this case undirected Markov Stability) to the role similarity graph to find if there any significant groups of nodes with similar flow profiles, without imposing their number or type a priori. Figure S8A shows that the Markov Stability analysis of the role similarity graph finds a very robust partition into five communities of approximately similar sizes at t = 97.712. At this Markov time, the variation of information is 0, which means that in all of the 100 times we ran the community detection algorithm we obtained the exact same partition. These five communities in the role similarity graph correspond to classes (or types) of roles in the network. In the Supplemental Material we provide the full classification of nodes according to these roles.
To interpret the five clusters found by our analysis, we examine a posteriori different characteristics of their members. Figure S8B shows the mean in and out degree in the original Twitter network of the nodes in each class found. Two of the groups found have higher mean in-degree (i.e., Twitter followers) than out-degree, while for the other three groups the reverse is true (i.e., they follow more than they are followed by). If we coarse-grain the original Twitter network lumping together all the nodes with the same role (as in Fig. 5 of the Main Text), a striking pattern of connectivity is revealed with classes of nodes mostly receiving attention (sinks of interest or sources of information), other classes mostly behaving as sources of interest (recipients of information) and other classes in between. This leads to our renaming our role classes as: references, engaged leaders, mediators, diversified listeners, and listeners (see Main Text for more details). Finally, the PageRank distributions shown in Fig. S8C also help illustrate the differences between leader, follower, and mediator roles, but would not be able to discriminate between the roles obtained from our analysis of the role similarity graph.

Comparison to other classic notions of roles in social network analysis
The notion of node roles in graphs has been studied from different perspectives, especially in social network analysis. A classic example is structural equivalence (SE) in which two nodes have the same role if they share the same neighbours [11,12], i.e., if they are swapped the network remains the same. One can compute how similar (in the SE sense) are two nodes based on the number of common neighbours. SE roles are thus based on computing the number of common immediate neighbours and bear no resemblance to the flow roles detected via the RMST-RBS approach. In particular, our approach allows nodes with no common neighbours to have the same role, counter to the SE definition.
Another classic notion of role in social networks stems from the theory of regular equivalence (RE). RE uses node colorations (or labellings) to find groups of nodes with the same role [13,14,15]. Suppose u is a node in the network with in-neighbours N i (u) = {j : A j,u = 1} and out-neighbours N o (u) = {j : A u,j = 1}. The colour of u is C(u) and the colour of the in and out-neighbourhoods are C(N i (u)) and C(N o (u)) respectively. A coloration is said to be (exactly) regular [14] if for any two nodes u and v In a regular (or approximately regular) coloration of a network, nodes with the same colour are said to have the same role [15,16]. In its strict sense, the RE definition of role is combinatorial (and thus lacking robustness in many real-world networks networks). Furthermore, it is only based on the consideration of the coloration of immediate neighbourhoods of each node. Hence it leads to a very different classification of roles to that obtained with the RMST-RBS algorithm.
We have obtained the roles of nodes in the riots network obtained using RE models, and compared them to the RMST-RBS roles obtained above. To obtain the RE classes we use two well-known algorithms: REGE [17], which obtains similarities between the nodes based on RE, and EXCATRE [15] which produces a sequence of regular (or approximately regular) colorations. We have included the EXCATRE and REGE partitions in the Supplemental Spreadsheet.
The EXCATRE algorithm applied to the riots network finds only one non-trivial coloration with 734 roles. (Two trivial colorations are also obtained: all nodes with the same colour and each node with its own colour.) On closer investigation, the roles identified by EXCATRE correspond trivially to nodes with identical in and out-degree, and hence it corresponds to considering only immediate neighbourhoods in the graph (i.e., it would be found by RMST-RBS setting both α and K max equal to 1 so that only paths of length one are considered). In our analysis above, we used α = 0.95 (which converges at K max = 133), and the roles obtained by RMST-RBS incorporate global flow information from the graph.
The REGE algorithm iteratively constructs a similarity matrix between nodes based on the similarities between their neighbours. The similarity matrix is then clustered using hierarchical clustering techniques to obtain roles [16]. We apply the REGE algorithm to the riots network (which converges after 6 iterations) and obtain five clusters (or roles) with 806, 59, 46, 2 and 1 nodes each. These roles are not informative: the cluster with 59 nodes contains all the source nodes (in-degree 0); the cluster with 46 nodes contains all the sink nodes (out-degree 0); while almost all the nodes in the network fall in the cluster with 806 nodes which lies between them. The two small clusters with 2 and 1 nodes lie in slightly different positions with respect to the sink and node clusters. As with EXCATRE, these roles are the result of an analysis based on immediate neighbourhoods (in and out-paths of length one).
Although RE-based methods and RMST-RBS attempt to find roles in networks, their conceptual foundation is different. Both approaches aim to identify the 'types' of nodes that exist in the network (allowing for the possibility that two nodes on opposite ends of the network and no common neighbours could be of the same type) but they use different information to do so. RE-based methods analyse the similarities between neighbourhoods under colorations, while RMST-RBS identifies similarities in the transient pattern of long-range flows though each node. Because of the use of flows at all scales in the network, from local to global neighbourhoods, the RMST-RBS method provides a more balanced classification into node classes. In addition, RMST-RBS is less sensitive to small changes in connectivity, and thus more robust for the analysis of realistic networks. For example, if we create an edge from a node in the 806-node REGE cluster to the 59-node cluster of sources (e.g., when a Twitter user decides to 'follow" another user), this node would change roles immediately. Hence in this specific context, a RE-based analysis of the network is non robust.

Integrating interest communities and roles: informational organigrams
Our two analyses (interest communities and user roles) can be brought together to classify the informational organisation of communities, as given by the different mix of roles within each community. In Fig. 6 of the Main Text we show that the 15 communities obtained at t = 1.3 can be broadly classified into four organigrams, which go from a purely broadcast community (of references and listeners) to other communities that involve dialogue between engaged leaders, mediators and diversified listeners. The four organigrams were found by using the mix of roles of each community (a five-dimensional vector containing the proportion of nodes in the community that belong to each role-class) and performing a simple k-means clustering on the communities. In other words, each community can be represented by a point inside the unit cube in R 5 , and with k-means we identify the clusters of communities whose role mixes are similar.