Characterizing super-spreaders using population-level weighted social networks in rural communities

Sociocentric network maps of entire populations, when combined with data on the nature of constituent dyadic relationships, offer the dual promise of advancing understanding of the relevance of networks for disease transmission and of improving epidemic forecasts. Here, using detailed sociocentric data collected over 4 years in a population of 24 702 people in 176 villages in Honduras, along with diarrhoeal and respiratory disease prevalence, we create a social-network-powered transmission model and identify super-spreading nodes as well as the nodes most vulnerable to infection, using agent-based Monte Carlo network simulations. We predict the extent of outbreaks for communicable diseases based on detailed social interaction patterns. Evidence from three waves of population-level surveys of diarrhoeal and respiratory illness indicates a meaningful positive correlation with the computed super-spreading capability and relative vulnerability of individual nodes. Previous research has identified super-spreaders through retrospective contact tracing or simulated networks. By contrast, our simulations predict that a node’s super-spreading capability and its vulnerability in real communities are significantly affected by their connections, the nature of the interaction across these connections, individual characteristics (e.g. age and sex) that affect a person’s ability to disperse a pathogen, and also the intrinsic characteristics of the pathogen (e.g. infectious period and latency). This article is part of the theme issue ‘Data science approach to infectious disease surveillance’.

Sociocentric network maps of entire populations, when combined with data on the nature of constituent dyadic relationships, offer the dual promise of advancing understanding of the relevance of networks for disease transmission and of improving epidemic forecasts. Here, using detailed sociocentric data collected over 4 years in a population of 24 702 people in 176 villages in Honduras, along with diarrhoeal and respiratory disease prevalence, we create a social-network-powered transmission model and identify super-spreading nodes as well as the nodes most vulnerable to infection, using agent-based Monte Carlo network simulations. We predict the extent of outbreaks for communicable diseases based on detailed social interaction patterns. Evidence from three waves of population-level surveys of diarrhoeal and respiratory illness indicates a meaningful positive correlation with the computed super-spreading capability and relative vulnerability of individual nodes. Previous research has identified super-spreaders through retrospective contact tracing or simulated networks. By contrast, our simulations predict that a node's super-spreading capability and its vulnerability in real communities are significantly affected by their connections, the nature of the interaction across these connections, individual characteristics (e.g. age and sex) that affect a person's ability to disperse a pathogen, and also the intrinsic characteristics of the pathogen (e.g. infectious period and latency).

Introduction
Previous studies of biological and sociological features of human social interactions-including the evolutionary biology and genomics of social networks, their physiological implications and their possibly ancient heritage-suggest that natural selection has shaped social network structure and function [1][2][3][4]. One possible function of social networks relates to infectious diseases. Traditionally, the spread of infection in human communities has been analysed using compartmental epidemiological models like SIR/SEIR models [5]. However, these models generally assume uniform and fully mixed populations, which often lead to incorrect estimates of predicted infection counts and R0 (basic reproduction number) [6,7]. Therefore, we need approaches for characterizing possible heterogeneity in transmission across individuals, which, in particular, is defined based on a host's social interactions (quantity and quality of interactions); their intrinsic ability to disperse a pathogen (e.g. based on their age and sex); and the transmission characteristics of the pathogens themselves (e.g. infectious periods and probability of transmission).
A previous study has reported a very well-illustrated characterization of heterogeneity in transmission based on variation across individuals in their ability to spread a pathogen [8]. And specific evidence from SARS outbreaks in 2003 (and for other outbreaks) has shown higher levels of super-spreading originating in older than younger individuals [9][10][11]. Other research has revealed how pathogens affect transmission, documenting that transmission is an essential combination of pathogen infectious period and its overlap with host social interactions [12][13][14].
In order to characterize the foregoing heterogeneity in transmission and accurately predict the trajectory of potential outbreaks, we collected sociocentric data from real networks. In this work, we ascertained social networks among 24 702 villagers in 176 villages in Honduras. In particular, this also includes the 'weight' of a tie (e.g. as measured by the frequency of interaction as well as the form in which pairs of individuals greet each other, signifying the type of physical contact). Therefore, we were able to synthesize a social-network-based transmission model where we implement a bottom-up approach by constructing the effective probability of passing on a pathogen from an ego to their alters. We perform Monte Carlo simulations to better predict a node's likelihood of passing on the infection, on the one hand, and of being infected, on the other hand, as a function of the node's individual and network attributes, as well as the node's pathogen-coupled attributes.
We chose a commonly studied and important infectious disease (diarrhoea) to evaluate our model. In Honduras, where our study of social networks is based, currently 2% of the deaths, i.e. 737 deaths per 100 000 population, can be attributed to diarrhoea-related causes [15]. Epidemic investigations in Ghana with a similar socioeconomic background as in Honduras have reported high R0 = 2.09 in several diarrhoeal outbreaks [16][17][18]. Furthermore, we also used influenza A as an example. Prior empirical research has given us sufficient insight to characterize features of these pathogens that are likely to influence how the disease spreads through social interactions [12]. Our results show that network topology, quality of interaction and pathogen-specific dispersion play a crucial role in determining how vulnerable or super-spreading a node is.

Material and methods
We assessed social networks in a population of 24 702 people across 176 villages in the Western highlands of Honduras, as a part of a network-targeting public health intervention [19]. The total number of individuals who consented to participate and provided detailed social network, demographic, socio-economic and health data was 22 1 (2016). We performed sociocentric mapping of the study population using the photographic network census mobile app Trellis 1 (which we developed) to collect various ego-alter connections (using diverse name generators and villagers' photographs to verify the identity of social contacts) [19][20][21][22]. Using the Trellis platform, we also recorded the quality of interactions. We asked our respondents, 'Who do you spend free time with?' This was followed by 'How do you greet each other?' (characterizing levels of contact ranging from a smile, a bow/nod/wave, a verbal salute, a hand-shake/highfive, a pat on the back, a hug, or a kiss on the cheek). The respondents were also asked 'In the last month, how often did you see each other?' thus quantifying the frequency of their interaction. Therefore, these questions defining the levels of contact (salutations) and frequencies of contact add a qualitative dimension to the social network. Finally, we also used Trellis to collect information regarding subjects' history of diarrhoea and cough (respiratory related-infection).
We developed an agent-based model combining host, pathogen and (observed) social network characteristics. Each agent engages in social interaction with their alters, defined qualitatively (nature and frequency of contact) and quantitatively (number of connections). We then introduce the pathogen to these networks, which, with its own characteristics-such as infectious period, incubation time and transmission probability-and host-specific characteristics (related to intrinsic host ability to disperse a pathogen) transform these social edges into temporal probabilistic paths for disease transmission. Using temporally dynamic Monte Carlo simulations, we observe the epidemic evolution of the disease over a period of 100 days in 1-day intervals.

(a) Model parameters
The characteristic equation (ρ(t), electronic supplementary material, table S1-S6 and highlighted as green in figure 1) captures the transmissibility of the pathogen during the infectious period [23], which for shigellosis (as a cause of diarrhoea) was 7 days, and, on each day, the probability of transmission varies depending on incubation time, symptom progression, and recovery during that time frame. For the disease transmission model, diarrhoea (e.g. due to shigellosis) was chosen as a model, due to its prominence in the developing world [16][17][18]24]. To decouple the effects of the dispersion factor of the pathogen, we also investigate the transmission of influenza A through our network. In contrast to shigella, influenza A has an infectious period of 15 days [25].
The next vital component is a set of social network attributes (factors highlighted as purple in figure 1). The number of connections a node has (degree) determines an equivalent number of possible routes a pathogen can transmit through (to and from). Based on the qualitative attributes of the social interaction (salutations), connections exhibiting a hand-shake/high-five, a pat on the back, a hug, or a kiss on the cheek were considered to be riskier behaviour, and therefore to lead to an increase in probability (p s ); the other salutations were considered as relatively safe. For frequency of contact, everyday contact was considered to have a probability of 1, and lesser frequencies were considered as an equivalent of their fractions (p f ).
To fully characterize the transmission model, we follow previous research [8] that has pointed out that airborne pathogen transmission is also affected by biological characteristics of the individuals spreading a pathogen (electronic supplementary material, tables S3-S6 and factors highlighted in red in figure 1). An individual can spread the same airborne pathogens differently depending on their breathing patterns, quantity of aerosols during exhalation, and host-pathogen interactions, which might affect viral shedding [8]. Thus, these factors may be allowed to vary with a person's age (p a ) [8,10] and sex (p g ) [11]. For instance, naturally, this factor related to an individual's propensity to disperse a pathogen is different for different pathogens [8]. For our influenza transmission model, the dispersion would account for all the airborne-related transmission factors (as above). Additionally, for diarrhoea, there is also evidence suggesting that individuals with different ages and sexes show varying severity in infection and also transmission [10,11]. Hence, to account for such factors in our model, without loss of generality, we also . The pathogen-characteristic equation (highlighted in green) is the pathogen's day-to-day transmission probability, decoupled from the human element in transmission. Salutations and frequency probabilities (highlighted in purple) are the qualitative sociocentric aspects of a node's transmission. Dispersion probabilities (highlighted in red) are the egocentric aspect of a node's propensity to transmit a pathogen. Noise (highlighted in black) was also included to account for variations in the node's interaction pattern. (Online version in colour.) consider the age and gender of the individual transmitting the pathogen. All above mentioned model parameters can be found in electronic supplementary material, tables S1-S6.
With respect to the quality of interactions, we assume there is an inherent uncertainty in the frequency of interactions in the social network. For example, a person interacting with their alters once a week may vary this frequency in some weeks. Therefore, to account for this uncertainty, we introduce a small multiplicative Gaussian noise term, with N(μ, σ ), where μ = 1, σ = 0.01, so that there is a small probability that the person may interact more or less frequently than their reported interaction.

(b) Model formulation
An agent-based network model is implemented using Monte Carlo simulations for pathogen transmission. As described in figure 2, initially, the model begins by seeding a node with an infection. In the second step, the node's age and sex determine its dispersion factor (in the sense of the likelihood of conveying the pathogen to others). The degree, or the number of connections the node has, determines how many probabilistic transmission routes the pathogen can take. For each connection, the unique quality of interaction, i.e. the salutation and frequency of this edge, defines its corresponding probabilities. The characteristic equation of the disease, which varies from day to day, along with the noise, ultimately results in a combined transmission probability of β ρ,s,f ,a,g (t), as shown in equation (2.1).
The connected nodes have a random chance of getting infected with a probability β ρ,s,f ,a,g (t). The β ρ,s,f ,a,g (t) varies from day-to-day and also from edge-to-edge depending on the quality of interaction for the same transmitting node. Similarly, the neighbouring nodes also have a infection-seeded node newly infected node newly infected node probability of not getting infected equivalent to 1 − β ρ,s,f ,a,g (t). If the neighbouring node is infected, the newly infected node will have a new β ρ,s,f ,a,g (t), depending on its edges dispersion and the day. Upon infection, every infected node (including the seeded node) can only transmit infection for a period equivalent to the infectious period of the disease (7 days for shigella and 15 days for influenza A). Thus, this transmission process is finite, leading to the ultimate state of the network. A single Monte Carlo simulation has 100 steps, each representing 1 day, netting a total of 100 days of transmission in a single simulation. This simulation is repeated 10 000 times for every infection-seeded node (red coloured node). Every node in the village also takes a turn in becoming an infection-seeded node. Therefore, for Hacienda San Juan, for example, which has 58 nodes, a total of 580 000 simulations were performed. Overall, for all 176 villages, 2 470 200 000 simulations were performed to characterize all possible transmission scenarios for shigellosis (diarrhoea). Furthermore, in addition to diarrhoeal system [ρ diarrhoea (t)], simulations were also repeated for influenza A transmission [ρ influenza (t)]. Finally, we also repeated diarrhoeal simulations excluding the effect of quality of interaction [ρ uniform (t)]. Figure 3a shows one possible final state, where there were 12 new infections (yellow nodes) arising from the initial infection (red node) at the end of the 100-day Monte Carlo simulation. Figure 3b shows four possible final states, as an illustration, by repeating the simulation four times. The variation in the final states can be explained by the transmission probability β ρ,s,f ,a,g (t) giving rise to differing numbers of infection-transmitting agents in each simulation.

(c) Outputs measured
The ultimate goal of these transmission simulations is to measure every node's vulnerability and also super-spreading capability. The vulnerability is measured by determining how many times a node transforms into a newly infected node (yellow node). The super-spreading capability is measured by the average number of new infections arising from the infection-seeded node in all of the 10 000 simulations.   The node's vulnerability will be shown as a relative measure, compared to the rest of the village in quintiles (relative vulnerability, RV). The node's super-spreading capability will be shown as relative measure in quintiles (relative super-spreading capability, RSS), and also as an absolute measure as percentage of total transmission in the village (super-spreading index, SSI). To understand the importance of the quality of social connections (i.e. salutations and the frequencies of interactions), we repeated the simulations without considering the quality of interactions, obtaining new transmission probability of β ρ,a,g (t).

(d) Real infections from the survey network
In our Honduran social network, we also collect information on whether the individual respondent has had diarrhoea in the last four weeks. This is represented in figure 3 (as square nodes). Simultaneously, we also record if the respondent has been coughing for at least two weeks at the time of survey. Using these, we later analyse their correlations with tendencies to infect (super-spreading capability) and be infected (vulnerability).  This difference can also be visualized by figure 4i, which shows the relative rank difference in the super-spreading capability between the two systems (most nodes have a shift in ranking, visualized as difference between 4b and 4e). For example, in this village, only two nodes in the entire village of 58 nodes retained their relative super-spreading rank after interaction quality was taken into account. The relative rank difference can also be visualized in figure 4j, which shows that most nodes change ranking when including the quality of interaction in the model, regardless of their topological position in the network graph.

Results
Essentially, removing the quality of interactions can give rise to a completely different set of super-spreaders (nodes in the top 5th quintile of super-spreading capability) and completely different indices of highly vulnerable individuals. Thus, the quality of interaction plays an important role in predicting super-spreading capability and vulnerability (see below for the analysis across all 176 villages).

(b) Role of pathogen
To consider the role of the pathogen in the transmission of infection, we perform analogous analyses with the influenza transmission model. Figure 5a-c shows the infection counts RV, RSS and SSI of diarrhoeal simulations, whereas panels (d-f ) show these quantities for the influenza simulations in this illustrative village.
From this comparison, the influenza and diarrhoeal systems are different in vulnerability (p = 0.05) but not in super-spreading capability (p = 0.284). However, the relative rank of superspreading capability is completely different for the two pathogens, with no villager retaining their same relative rank upon changing the pathogen. Thus, pathogen characteristics play a vital role in predicting the super-spreading capability and vulnerability.     They also showed a positive correlation of 0.02 with RSS (p = 0.021). Thus, individuals with higher degree and/or those who engage in 'riskier' interactions (salutations and frequencies) are much more likely to be vulnerable to diarrhoeal infection, and slightly more likely to become super-spreaders of diarrhoeal infection. By switching to the influenza model (electronic supplementary material, figure S1), where the only changed model parameters are the pathogen characteristic equation and infection probability with individuals' age and sex (related to their ability to disperse the pathogen), the individuals with persistent cough over two weeks showed a positive correlation of 0.12 with degree (p = 5.77 × 10 −3 ), 0.12 with salutations (p = 5.26 × 10 −3 ), 0.17 with frequency (p = 7.93 × 10 −5 ). Moreover, they also showed a slightly less, but statistically significant positive correlation of 0.09 with RV (p = 0.031), 0.09 with RSS (p = 0.047), and 0.09 with eigenvector centrality (p = 0.028).

Discussion and conclusion
Location-tracking and phone-data surveillance or contact-tracing mechanisms are often considered as ground truth for any epidemic investigation [8,[26][27][28][29][30]. But the nature of dyadic social interactions (figure 4), i.e. salutations and frequency, captures the quality of the relationship between nodes, thereby transforming regular social connections into weighted connections. Previous studies on weighted interactions in social networks have shown that weaker ties play an important role in preserving the local structure of the network [31]. Moreover, considering a social network with the quality of interaction has two significant advantages over location-tracking or contact-tracing data. First, surveillance datasets have a fixed observation time-window, i.e. tracked data is recorded and analysed in a specific time frame [8,32]. In our model, the responding individual can assess their own quality of interaction which is typically not confined to the past few hours or weeks. Second, the quality of interaction can also be used in predicting future pathogen transmission; this cannot be as reliably forecast through location tracking or contact tracing or other conventional epidemiological investigation methods. Predicting individual transmission or infection risk without considering quality of interaction may thus lead to far less accurate predictions (as shown in figures 4 and 6).
When constructing a disease transmission model, dispersion is another key factor to be considered ( figure 5). Dispersion can be defined here as an individual's transmitting capability, depending on their age, sex, breathing patterns, vocal activities, and host-pathogen interaction factors [8]. Hence, dispersion of a pathogen arises due to the unique interaction between the pathogen and the hosts' characteristics. As shown in figure 6 and electronic supplementary material, figure S1, switching the model from diarrhoea to influenza also significantly changes an individual's RV (which can also be understood as personal infection risk) and their RSS. Analogously, there are no unique super-spreaders for all pathogens. Rather, the characteristics of the pathogen also determine who is likely to be more vulnerable or more super-spreading within any population.
In sum, by considering the quality of social network interactions in addition to network structure and pathogen and host characteristics, we constructed a comprehensive social-networkbased disease-transmission model with significantly higher accuracy in predicting an individual's vulnerability to infection and their spreading capability. Estimating these variables across all 176 villages reveals that the RV, RSS and SSI of all the villagers shows strong significant correlation with salutations, on par with the number of connections individuals may have. This further strengthens the need for considering the quality of interaction in any disease transmission model. This general model can be applied to any contact-dependent communicable disease. The quality of social interaction can be incorporated to provide an estimate of future personal risk of either becoming infected or being a super-spreader. This model can be used even in the absence of location-tracked pathogen transmission data available for a population, or even before the pathogen has a chance to invade a region, in order to identify vulnerable individuals based on structural data combined with interaction data alone.
Ethics. All participants provided informed consent at the time of data collection. Ethics approval was obtained from the Institutional Review Board (protocol no. 1506016012) at Yale University in New Haven, USA.
Data accessibility. Additional figures and tables have been included as a supplement, and the source code is available at: https://github.com/shiv-y23/Super-spreader_paper.git.