Social capital predicts corruption risk in towns

Corruption is a social plague: gains accrue to small groups, while its costs are borne by everyone. Significant variation in its level between and within countries suggests a relationship between social structure and the prevalence of corruption, yet, large-scale empirical studies thereof have been missing due to lack of data. In this paper, we relate the structural characteristics of social capital of settlements with corruption in their local governments. Using datasets from Hungary, we quantify corruption risk by suppressed competition and lack of transparency in the settlement’s awarded public contracts. We characterize social capital using social network data from a popular online platform. Controlling for social, economic and political factors, we find that settlements with fragmented social networks, indicating an excess of bonding social capital has higher corruption risk, and settlements with more diverse external connectivity, suggesting a surplus of bridging social capital is less exposed to corruption. We interpret fragmentation as fostering in-group favouritism and conformity, which increase corruption, while diversity facilitates impartiality in public life and stifles corruption.


Description of iWiW data
In line with previous work on iWiW we filtered the data used in our analysis. We use the data from the network at its peak activity in 2012. Out of roughly 4.5 million user accounts, we dropped the roughly 500,000 accounts with location outside of Hungary. We follow Lengyel et al. [6], we dropped the 193 users with more than 10,000 connections, arguing that such a large number of connections cannot represent social ties. We argue that this cutoff balances two concerns: it excludes those accounts with so many connections that it brings into question the nature of its connections, and we avoid truncating the tail of the distribution of social connectivity too much, allowing for sociality to range over several orders of magnitude. Many approaches to detect "fake" accounts in social network use the degree of a node as an important input [3].
In Plot A of Figure 1 we plot the sensitivity of fragmentation and diversity to the maximum degree threshold. If we discard all users with more than 100 connections (compared to the 10,000 connection cutoff we use in our paper), fragmentation would be significantly higher and diversity significantly lower than the versions we use in the paper. However this is not a reasonable cutoff as nearly 10% of users have more than 500 connections (see Plot B, Figure 1). The settlement fragmentation and diversity measures are within 5% of the versions we use in the paper if the threshold is set at 500, 1000, or 2000 connections.
In Figure 2 we show the relationship between settlement population and the number of iWiW users listing their location in the settlement, and the share of the population registered to iWiW.
As mentioned in the text, user privacy is a key concern. The anonymized iWiW data was made available to a consortium of researchers in Hungary, each of whom signed a non-disclosure agreement (NDA) to use the data for research purposes only. As a result, only settlement level aggregated data can be shared. Figure 1: A) The sensitivity of diversity and fragmentation to changing the maximum degree threshold, relative to the 10,000 degree threshold used in the paper. Error bars represent 95% confidence intervals. The measures are within 5% of the version we use in the paper for cutoffs at or above 500. B) The distribution of user connections on a log scale. Very few users (193) have more than 10,000 connections, while many (405,337) have more than 500.

Corruption risk indicators
In this section we go into more detail regarding the individual corruption risk indicators. Each indicator quantifies different ways bureaucrats have excluded competitors in qualitative work on ground truth corruption cases from around the EU [4]. We stress that while no individual indicator or composite measure can credibly suggest that an individual contract was awarded by a corrupt process, aggregated over many contracts issued by the same institution these indicators map highly suggestive patterns. This point is an important motivation for filtering out towns awarding less than five contracts a year.
• Single bidder (C singlebid ) is an outcome: was the contract awarded in a competition attracting only a single offer.
• Closed procedure (C closedproc indicates when the contracting authority has decided to award a contract by direct negotiation with a firm or via an invitation-only bidding process. This decision can be used to completely subvert competition. • No call for bids (C nocall ) indicates when, in the case that the contract was awarded via an open competition, no contract announcement or call for bids was published in the official procurement journal. A corrupt official can greatly decrease the chance of non-favored firms participating by limiting access to information.
• Long eligibility criteria (C eligcrit ) captures how bureaucrats can box out specific firms by adding requirements to participation criteria. By including many such restrictions (regarding previous experience, company size, qualifications), a corrupt bureaucrat can systematically exclude nonfavored firms.
• Extreme decision period (C decidetime highlights suspicious activity between the end of a competition and the decision to award a contract. If the decision period is extremely short, this suggests that the decision to award a specific firm was premeditated, and that the bids were not carefully checked. If the decision period is very long, it may indicate that legal challenges about the contract may be delaying the award decision. • Short time to submit bids (C bidtime ) indicates that favored firms may have been tipped off about a competition for tenders ahead of the public announcement. By leaving only a short time between the announcement and the award for non-favored firms, the corrupt official makes it very difficult to submit a bid. It is important to remember that bids are complex legal documents, including at times cost estimates, schematics, and references.
• Non-price criteria (C nonprice ) tracks the share of non-price related or subjective criteria in the evaluation of bids. For instance, a corrupt bureaucrat may reject a lower cost bid if, according to a subjective criteria of the quality of a bid, it is less favorably evaluated than that of a higher cost bid of a favored firm.
• Call for bids modified (C callmod ) checks to see if a call for bids was modified between the initial announcment and the deadline. This potential corruption strategy closely emulates C bidtime in that a corrupt official can suddenly change the specifications or rules of a tender shortly before the deadline.

Relationship between fragmentation and diversity
Fragmentation and diversity, our measures of bonding and bridging social capital respectively, are positively and significantly correlated (ρ ≈ 0.46). Though fragmentation considers only edges within the settlement and ego diversity includes external edges, both variables measure modularity in the network. However, according to our hypotheses, they are expected to capture different kinds of socialization. We found that despite their positive correlation these features have opposite relationships with our corruption risk measures: high fragmentation is positively and high diversity is negatively correlated with corruption risk. To test whether inter-settlement edges or the ego focus of diversity does more to distinguish the measure from fragmentation we recalculated the diversity considering only edges within the settlement. This alternative "internal" diversity measure is weakly correlated (ρ ≈ 0.28) with fragmentation, and strongly correlated with diversity (ρ ≈ 0.72). This suggests that both the connections to other settlements and the ego-focus of the diversity measure distinguish fragmented settlements from diverse ones.

Model covariates and controls
In this appendix section we present the settlement-level variables used as controls in our models. We also report their summary statistics. Note that in our models, we scale all features to have mean 0 and standard deviation 1. Our controls mostly refer to data from 2011, when the last large scale Hungarian census took place and the data are of highest quality.
• Average income per capita (2011): Wealthier places tend to be less corrupt [7] as competition for limited resources is expected to create greater incentive to cheat. Data on median income or the income distribution at the settlement level were, to the best our knowledge, not available in Hungary.
• Population (log)(2011): Larger cities may have different contracting needs, different political and social norms, and different network characteristics.
• Number of contracts awarded (log): Settlements contracting more frequently may be more experienced and may follow better practices. As more people are involved in contracting, corruption may become more difficult.
• Rate of iWiW use (2012): The rate of iWiW use both proxies for the economic development of the settlement and controls for differences in observed social network structure resulting from differences in access to the web. Previous work suggests that iWiW users, especially the early adopters, skew young and wealthy [6].
• Average mayoral victory margin: Measured across three elections (2002,2006,2010), this variable proxies for the lack of political competition in the settlement. The absence of political competition has been shown to correlate with corruption [1].
• Share of population with at least a high school diploma (2011): Education is typically correlated with better control of corruption [9].
• Share of working-age population inactive and unemployment rate (2011): Counting the longterm and short-term unemployed respectively, these variables quantify economic stagnation. The economic hardship connected with high unemployment is conjectured to worsen political corruption [10].
• The minimum travel distance to Budapest, the capital city: This variable captures the physical isolation of the settlement from the main economic, political, and social hub of the country. Past research has shown that geographic isolation reduces accountability and increases corruption [2].
• Share of population over 60 years old (2011): This variable controls for the over-representation of the elderly. The elderly are underrepresented on online social networks and tend to use these platforms differently than younger users [8].
• Whether the settlement has a university (2011): This variable controls for the presence of a place of higher education in the settlement, including local branches of universities headquartered elsewhere. this which inflates the number of young people, hence likely iWiW users in the settlement.

Model results, diagnostics, and feature importances
We present the full model results in Table 2. Note that all variables are standardized with mean 0 and standard deviation 1. This aids interpretation, for example: a one standard deviation increase in the settlement's mayor's average margin of victory increases corruption risk by roughly one quarter of a standard deviation. We also present models including only one of the two network measures in Table 3. The effect and significance of both features is preserved when the other is excluded. The estimated coefficients of the control variables and their levels of statistical significance offer additional insight into the phenomenon of corruption risk. Wealthier settlements are in general less corrupt, though the effect is not significant for CRI. Rate of iWiW use is not related with corruption risk and this does not change when we include the social capital features. The average mayoral victory margin is a highly significant positive predictor of corruption risk. One potential explanation is that mayors, who do not face significant competition do not fear being voted out of office if they are corrupt. Similarly settlements that are far from Budapest, which our models predict to be significantly more corrupt, may be insulated from investigation by the central authorities simply by being out of the spotlight.
One potential source of bias in the coefficient estimates of multiple regression models is collinearity among the predictors. We test for multi-collinearity for each predictor using a variance inflation factor (VIF) test, defined as the ratio of variance in the full model over the variance of the single-predictor model. We run this diagnostic for each predictor used in models (2) and (4) in the main text and report the results in Table 5. A popular rule of thumb is that VIF values under 10 denote acceptable levels of correlation between variables [5]. As it is near our limit, we reran our analyses without the "Share of population inactive" control variable, finding no substantive change in our results. The relevant model tables are available on request.
We show the relative variable importances of Model (6) (column 6 in Table ??), the fully specific model predicting average CRI, using an Analysis of Variance F-test in Figure 3. We include only terms with a significant ANOVA F-test. Though other features have stronger predictive power, the social network features are more useful in predicting corruption risk than economic variables like unemployment, inactivity, and average income.