Generalized Erdős numbers for network analysis

The identification of relationships in complex networks is critical in a variety of scientific contexts. This includes the identification of globally central nodes and analysing the importance of pairwise relationships between nodes. In this paper, we consider the concept of topological proximity (or ‘closeness’) between nodes in a weighted network using the generalized Erdős numbers (GENs). This measure satisfies a number of desirable properties for networks with nodes that share a finite resource. These include: (i) real-valuedness, (ii) non-locality and (iii) asymmetry. We show that they can be used to define a personalized measure of the importance of nodes in a network with a natural interpretation that leads to new methods to measure centrality. We show that the square of the leading eigenvector of an importance matrix defined using the GENs is strongly correlated with well-known measures such as PageRank, and define a personalized measure of centrality that is also well correlated with other existing measures. The utility of this measure of topological proximity is demonstrated by showing the asymmetries in both the dynamics of random walks and the mean infection time in epidemic spreading are better predicted by the topological definition of closeness provided by the GENs than they are by other measures.


Introduction
The study of complex networks has increased enormously in recent years due to their applicability to a wide range of physical [1,2], biological [3], epidemiological [4,5] and sociological [6] systems. Two basic goals in this regard are to understand and quantify the structure of the network to better characterize the relationship between the interacting members of the network (the nodes), while also characterizing the dynamical processes on the network [6] that may shed light on the processes by which they form [7].
Understanding the topological properties of the network on both a global and local level can be useful in approaching both of these goals. Global properties of interest may include simple measures of the distribution of node properties, such as the & 2018 The Authors. Published by the Royal Society under the terms of the Creative of closeness (D ij ) based solely on the matrix of weights between nodes i and j, w ij (with an undirected network where w ij ¼ w ji is assumed throughout this paper).
The proximity or closeness between nodes, D ij , will be small for nodes that are close to one another and large for distant nodes, with a simple and common choice being D ij ¼ w 21 ij (so strongly connected nodes are 'close', and disconnected nodes are 'far'). Alternatively, in an unweighted network, the length of the shortest path between a pair of nodes is a natural definition [28,29] and is the basis for the classic Erdó´s numbers in the context of an unweighted collaboration network [30].
Improvements on this simple measure which incorporate the effect of multiple paths between nodes (see figure 1a for a schematic diagram) include the resistance distance [14,31], self-consistent similarity measures [32] and communicability [33], to name only a few. An additional approach to defining similarity between nodes is found by positing a multidimensional 'latent space' of node properties [34], with the assumption that nodes that are close in the latent space are likely to be connected in the network and each node's position in the space inferred from the observed connectivity. Each of these methods incorporates the global topology of the network into a symmetric measure of closeness between pairs of nodes (D ij ¼ D ji ).

Finite resources and asymmetric measures of proximity
Finite resources are shared in some networks, with examples including collaboration on networks (where time with one collaborator reduces the available time for others), multi-core processor components [35] (where finite memory or other hardware must be shared) and random walks (where the walker can only move to a single neighbour at a time with a transition probability P i! j ¼ w ij /W i with W i ¼ P k w ik the total strength of the node i). In the context of these networks of limited resources, closeness measures such as resistance distance may be undesirable [22], because the addition of a new edge in the network should be detrimental to some nodes (those who receive less of the finite resource due to the new edge) and beneficial to others (those who receive more due to the edge). For closeness measures based on the direct weight between nodes (where the 'closeness' between i and j is often taken to be w 21 ij ) or resistance distance between nodes, it is straightforward to see that the newly measured closeness between nodes i and j, D (new) ij D (old) ij for all pairs, i.e. the addition of an edge can never cause nodes to become less close to one another. This is not sensible in the context of nodes that share a finite resource with their neighbours, as shown in figure 1b: if a node i has many neighbours, each receives less of the resource than if i had few neighbours.
The expectation of the influence of resource shared in figure 1 is satisfied by a number of existing measures of proximity. A quantity such as the transition probability in a random walk, P i! j , is blue close to red many shared neighbours few unshared neighbours blue less close to red blue close to red blue less close to red (b) (a) Figure 1. Two competing requirements for global 'closeness' in a network with shared resources. In (a), many short paths between nodes increase the closeness between them. This is similar to the resistance distance between nodes: additional parallel paths between them reduce their resistance distance. In (b), the finite resources of the high-degree blue node suggest that it should be less close to the red node than for the lower-degree blue node above, as resources are shared also with the other neighbours. This is similar to the transition probability from the blue node in a random walk: the more connections the blue node has, the lower probability of visiting the red node. rsos.royalsocietypublishing.org R. Soc. open sci. 172281 asymmetric and ensures that nodes are closer if they have few neighbours, pictured in figure 1b (so a walker is more likely to pass between them than if they had many connections). However, it is not a global measure of closeness because the transition probability incorporates only the nearest neighbour connections between nodes (so there is no proximity between disconnected nodes, even if multiple paths exist between them). The PageRank matrix [18] B i! j ¼ gP i! j þ (1 2 g)/N with g a teleportation parameter gives a modified estimate of proximity, a uniform measure of closeness for disconnected nodes independent of the network's geometry.
The more refined non-backtracking matrix [36][37][38], as the name suggests, captures the transition probability between pairs of nodes with the walker forbidden to retrace the previous step in the reverse direction. The non-backtracking matrix has previously been used to identify a measure of centrality that does not suffer from localization for highly connected nodes [36]. A simple measure of node proximity can be established using the non-backtracking matrix, the probability of a non-backtracking walker moving between pairs of nodes in two steps. Note that in every random-walk-based case, these measures of proximity satisfy the expectations in figure 1b (many unshared neighbours reduce D ij ) but not figure 1a (many shared neighbours increases D ij ): a walker on blue moves to red in two steps with 50% (100%) probability using the random walk transition matrix (non-backtracking transition matrix) regardless of the number of shared neighbours. It is useful to develop a measure of closeness that incorporates these two (sometimes seemingly contradictory) aspects depicted in figure 1: nodes are close to one another if there are many paths between them, but popular nodes are less close to their neighbours than unpopular nodes.

The GENs: measuring closeness via a weighted harmonic mean
We have recently shown [23] that the E ij or GENs, describing the topological closeness from node j to node i, satisfy the expected properties for the sharing of finite resources described in figure 1. The GENs on a weighted network of N nodes and M non-zero edges are defined as where w jl ¼ 0 if nodes j and l do not share an edge. This form is chosen such that the node i is as close as possible to itself and that if j is connected to only one node k, j's closeness to i satisfies E ij ; E ik þ w 21 jk . If there are multiple paths between nodes, the closeness from j to i is strengthened if there is a direct connection between them but also includes a contribution from all other neighbours of j weighted by their connection strength. By choosing a harmonic mean for the form of the contribution, we bias our measure of closeness towards neighbours that themselves are close to i. There is no possibility of zerovalued E ij for i=j due to the offset w 21 ij , avoiding the possibility of a numerical instability [39] due to a vanishing denominator. E ij is thus always smaller for directly connected than indirectly connected nodes, as the contribution from direct connections in equation (2.1) is w 2 ij , strictly greater than w il / (E il þ w 21 jl ) for indirect connections. The GENs are defined using the global topology of the network, and E ij is finite even for nodes i and j in the same component that share no neighbours (as may not be the case for more local measures of closeness [22]).
In appendix A, we demonstrate a number of features of the GENs when applied to synthetic networks. For homogeneous networks such as the Erdó´s-Rényi (ER), whose degree distribution is sharply peaked about the mean, the topological closeness between connected nodes is likewise peaked about the mean which is proportional to the mean degree of the nodes kkl, while the closeness between disconnected nodes is dominated by the network size N. Networks with heterogeneous topologies, such as the Barabási -Albert networks that have a degree distribution of P(k) k 23 , likewise have a scale-free distribution of the GENs for connected nodes, indicating that the GENs are indeed able to distinguish between distinct network topologies.
The nonlinear form of equation (2.1) makes analytical work intractable in all but the simplest cases, and we must generally resort to numerical work to determine the topological closeness between nodes in a network. E ij can be computed numerically in an iterative fashion [23], with E ij ; E (1) ij and the recursive definition W j =E (tþ1) ii ¼ 0 continually enforced). In this paper, the iteration is halted when max ij jE (tþ1) ij j e ¼ 0:005. The method also requires an initial guess,  ) for dense networks. This scaling is problematic for large dense networks, but the worst-case scaling of N 3 is common for many existing measures of centrality [15]. We note that other pairwise measures of proximity (such as resistance distance or MFPT) will generally require a matrix inversion, at a typical cost of O(N 3 ) and thus comparable to the cost of evaluating the GENs. We also note that the evaluation of the set fE 1,j g is independent of the evaluation of fE 2,j g, meaning the calculation of the GENs can be parallelized to provide a significant boost in the speed of evaluation. In addition to other existing measures of proximity that satisfy the expectations of figure 1, there is a great deal of functional freedom in writing equation (2.1). For example, any measure E il þ w À1 lk ) will satisfy the desired behaviour depicted in figure 1 for a monotonically decreasing g(x), with g(x) ¼ x 21 in the definition of equation (2.1). Another alternative definition replaces the direct weight between adjacent nodes, w 21 lj , with the closeness, E lj , in the denominator of equation ( . While these alternative definitions may be of interest in certain contexts, we continue to use equation (2.1) throughout this paper, due to its simplicity and previously demonstrated successes in prediction algorithms [23] and community detection methods [11]. Variations in the definition of E ij will certainly change the numerical values of the closeness, but the qualitative behaviour of the closeness between nodes is expected to be robust to perturbations of the definition of the GENs.

Erdó´s centrality and mean importance
The GENs incorporate a simple idea of what is meant by the 'closeness' between nodes in a network where limited resources are shared, and we expect that a node j that is topologically close to node i (having small E ij ) considers node i to be 'important' in some sense. We may therefore regard the inverse of the closeness between nodes (c ij ¼ E 21 ij ) as an unnormalized personalized measure of importance, allowing a ranking of all nodes in the network from the perspective of the node j. Because c ij measures the importance of i from a particular node j (rather than the network at large), it is not equivalent to a centrality measure.
Having defined a pairwise measure of the importance a node j assigns to i using c ij , we naturally expect that we can leverage this definition into a global measure of the importance of node i. There already exists a wide variety of methods for measuring centrality from a global perspective, including the degree [15,40,41], PageRank [18,41], random walk [13], betweenness [13,15] and non-backtracking [36] centralities. Each measure tends to rank high-degree nodes above low-degree nodes in complex networks, but take the global network topology into account in different ways. The importance of global topology is perhaps most clear in betweenness centrality, where high-degree nodes often have high centrality, but nodes of low degree that act as bridges between components of the network may have high centrality.
To convert our personalized importance measures into a single global measure for an unweighted network, we define C i ¼ P l[C i c il as the sum of the importance the neighbours of i assign to it (akin to the approach of [32]), which we refer to as an Erdó´s centrality. In figure 2a, we compare C i to a variety of other measures of centrality for a single realization of a Barabási -Albert network [7] (generated using the algorithm described in appendix B) with N ¼ 512 and kkl ¼ 4. In all cases, there is correlation between these various measures but with differences between the numerical values of the centrality measures for both central and non-central nodes alike. The clear correlation seen here is consistent with other realizations of the BA network, other values of kkl, and is also seen in ER networks (not shown). Figure 2b,c shows the same data plotted logarithmically for PageRank (b) and the non-backtracking (c) centralities in comparison with C i for one realization of the network. The degree of each node can contribute significantly to its centrality depending on the measure, and the clustering of the data in figure 2b is driven by nodes with identical degree with different nearby network topologies that lead to differing values for the GENs. Non-backtracking centrality is less dependent on node degree (as evidenced by the lack of clustering), indicating the other topological features of the network are important using this measure.
The clustering of some measures of centrality tends to occur for predominantly low-degree (and thus low-centrality) nodes, and it is preferable [20,42] [19,20]. We compare the Erdó´s centrality ordering to the other measures of centrality using the fractional intersection between the top-n orderings [43], with o X (k) the top-k ordering using method X. In figure 2d, l XY (k) is plotted for X ¼ C i and X the other centrality measures, averaged over 100 realizations of the network. We see that comparison of other measures of centrality to the Erdó´s centrality exhibits a high degree of overlap at n ¼ 1 with a sharp jump in l for n 10 in all measures. Beyond n 10, there is a slow variation, but all top-n lists remain similar above 80 -90% with the exception of the nonbacktracking centrality. Despite their different formulations, the top-n list for C i compares best to the list from random walk centrality (dashed turquoise line) above 90% for low-and high-degree nodes, indicating C i is most closely related to the random walk centrality over all node degrees.

Importance eigenvector centrality and teleportation in random walks
The Erdó´s centrality, C i ¼ P l[Ci c il described in the previous section, is a natural definition arising from the pairwise importance c ij assigned to it by all of its direct neighbours. While well correlated with other centrality measures (suggesting its utility), a significant amount of information regarding the global importance is neglected: the value of the importance assigned to nodes that are not directly connected to i are all ignored. This is true of many centrality measures, generally counting the number of direct paths between nodes to identify an overall measure of importance (degree, random walk and betweenness all proceed solely through direct links between nodes). PageRank centrality differs from a purely random-walk-based measure by accounting for indirect links between nodes through the steady state probability of a Markov process with transition probability B ij ¼ ga ij /k i þ (1 2 g)/N. In this process, the random walker moves between connected nodes (randomly) with probability g, but jumps between disconnected nodes (again, randomly) with probability (1 2 g). The leading eigenvector of the matrix B reduces to solving the coupled equations Pr j with C i the set of nodes connected to i (in a directed network, this is the set of nodes with edges directed towards i).
In the limit of g ¼ 0, Pr i ¼ N 21 is uniform as is expected for pure teleportation. In the limit of g ¼ 1 (no teleportation), the PageRank equation reduces to Pr i ¼ P j[C i k À1 j Pr j , and it is straightforward to see that the anzatz Pr i ¼ k i /N is a solution (as the equation becomes Pr i ¼ ak i ¼ a P j[Ci 1). A uniform probability of teleporting between distant nodes may be an imperfect model for the dynamics of a random walker on a network and a number of modifications to the PageRank algorithm have been proposed that account for inhomogeneous teleportation probabilities between nodes [44,45] in a variety of contexts.
A similar Markov process strongly related to the PageRank algorithm can be defined using personalized importance: a random walk performed with a transition probability B 0 ii ¼ 0, meaning the walker never remains at i). This process has an interpretation similar to that of PageRank: the most probable transition for a walker at node j to make passes through direct connections (moving to i with w ij . 0), but has a non-zero probability of jumping to a disconnected node. Unlike the PageRank methodology, a walker in this process has a non-uniform probability of choosing to move along an edge versus teleportation.
As an example of the heterogeneity of the teleportation in this process, a node i with degree k ¼ 1 in an unweighted network will have a most probable transition to its sole neighbour (with the greatest importance j assigns going to i with c ij ¼ 1). However, the total probability of teleporting (moving from i to a node without a direct connection) is p teleport ; In appendix A, we show that the average closeness felt between disconnected nodes in a large network scales as E d N 1/2 , which suggests that ( P l=i c li ) N À1=2 . This indicates that walkers at low-degree nodes will usually teleport to more important nodes in the network (as p teleport 1 À 1= ffiffiffiffi N p % 1 for large N). Teleportation between distant nodes in the network will be highly heterogeneous in this walk, and we expect it to have a significant contribution to the centrality for large networks with low-degree nodes.
The leading eigenvector of the matrix B 0 can be compared to that of the PageRank transition probability matrix B, which has a uniform probability of teleporting to any node in the network (regardless of the network topology). In figure 3a, we show the steady-state probability of being found at a node i for this random walker in this process, computed from the leading eigenvector of B 0 0.010  Figure 3. Importance eigenvector centrality g i extracted from the transition matrix defined by pairwise importance B 0 . (a) Shown are 10 realizations of BA networks with N ¼ 512 nodes: kkl ¼ 20 (red) and kkl ¼ 4 (blue). An approximate scaling of g i / k a g i is observed, with the best fit of a g ¼ 0.55 for the different ensembles. The behaviour of ER networks is similar, but with greater clustering of the observed PageRank values (not shown). (b) Comparison of the importance eigenvector centrality g i with PageRank at g ¼ 1 (filled circles, pure random walk) and g ¼ 0.85 (empty circles, 15% teleportation probability) for the largest connected component of the political blogs network [46]. The dashed line shows a scaling of g 2 i % Pr i . Disagreement between the two methods in PageRank's teleportation parameter primarily effects the ordering of low-degree nodes, which become more homogeneous for increasing g.
rsos.royalsocietypublishing.org R. Soc. open sci. 172281 with elements g i , termed importance eigenvector centrality in this paper. A clear correlation with the degree centrality is observed, with the solid line indicating a scaling of g i / k i a g for a g % 0.55. A similar quality of fit is found for larger N (discussed further below) as well as for the ER networks (not shown). Excellent agreement is found for high-degree nodes (as was the case in §3.1 for the Erdó´s centrality), with deviations occurring primarily for low-degree nodes that are clustered based on the node's degree. For all nodes of a fixed degree k, PageRank will tend to give a higher centrality to those nodes that are connected to high-degree hubs. By contrast, importance eigenvalue centrality g i will tend to give a lower centrality as the hub's attention is divided among many nodes and it assigns a lower importance to its neighbours. This effect produces the downward slope in the clusters of data in figure 3a, and is more pronounced for low-degree nodes.
The relationship between PageRank and the importance eigenvector centrality g i persists even for real-world networks with neither a homogeneous nor scale-free degree distribution, such as the lognormally distributed 2004 political blogs network [46]. In this network, each node is a liberal or conservative blog in the lead-up to the 2004 presidential election and each edge indicates a link between the blogs. In order to implement the GENs in equation (2.1) on this network, we converted the network from a directed network (where w ij =w ji ) to an undirected network (where w ij ¼ max(w ij , w ji )) and retained only the largest connected component of 1222 nodes. In figure 3b, we see g 2 i and Pr i are both highly correlated with the degree centrality (R 2 ¼ 0.999 and 0.982, respectively), indicating that both measures are dominated by node degree rather than other details of the network topology (as was the case in the BA networks in figure 3a). In the case of PageRank, this is due to the fact that hubs are connected to low-degree nodes, so walkers on low-degree nodes tend to move towards high-degree nodes if they do not teleport (occurring 85% of the time). In the case of importance eigenvector centrality, the model is entirely different: with more than 90% probability walkers on low-degree nodes (k 10) will teleport, but preferentially teleport to high-degree nodes. Despite the different dynamics in the walks, the steady-state probability of arriving at any node is nearly identical in both cases.

SIR model on an ER network
The spreading of an epidemic has been studied by many authors and in a wide range of contexts [16,17,[47][48][49], with the susceptible-infected-recovered (SIR) model being one of the simplest and most commonly used models. The SIR model assumes that a population of susceptible individuals becomes infected due to interactions with previously infected individuals, and infected individuals may recover and become non-infectious. A simple schematic of the SIR model is shown in figure 4a, with infections occurring at a constant rate, r I , due to direct interactions between individuals, and the recovery at constant rate, r R . A number of more complex models have been considered extensively for a homogeneously mixed population of individuals [49], but non-uniform interactions between individuals, represented by networks, can have a profound impact on the dynamics of epidemic spreading in the SIR model [4,16,17]. The existence of epidemic thresholds [4,50] for homogeneous networks (or the lack thereof for scale-free networks [16]) are well-studied global quantities of interest [51], while more local quantities such as the probability of a particular node i becoming infected, sparking an epidemic [52], and quarantine or immunization strategies [48,53] have also been examined.
While it is clearly useful to understand the global properties of the epidemic (such as the expected number of infected individuals), a particular individual j may also be interested in its own probability of becoming infected given the origin of the disease and may reasonably be less concerned if no neighbours are infected than if many neighbours are infected. However, it is not straightforward to analytically calculate how long the disease will take to reach j from any point in the network, and it would be useful to have a measure for how 'close' the epidemic is from an individual node. If the infection begins with a single node i, we expect that the disease will more rapidly propagate to nodes for which i is topologically close, and it is therefore worthwhile to compare the pairwise infection times (infection time of node j given an initial infection at i) with measures of topological closeness, such as the resistance distance R ij , MFPT in a random walk t ij , and the GENS E ij . PageRank and betweenness are single-node properties (not properties of a pair) and cannot be used for comparison. The resistance distance and MFPT in a random walk can be computed directly from the graph Laplacian L [14,15].   To see the relationship between infection time and topological closeness, we simulate an SIR epidemic (diagrammed in figure 4a), using Gillespie dynamics [54] on an ER graph (with a uniform probability of connection and each node having kkl ¼ 4 or kkl ¼ 20) and N ¼ 512. The infection rate r I ¼ 1 and recovery rate are varied, but always above the epidemic threshold [4,16] r I . r R /kkl. Even above the epidemic threshold, the disease may stochastically die off, and we take the pairwise infection time to be the harmonic mean of the infection time of a node j given an initial infection at i over all of the simulations, h À1 with K i simulations initiated at site i for each r R . To compute the infection time h ij between all nodes, K i ¼ 100 simulations were run for every node i being the sole infected node at t ¼ 0.

Comparing topological closeness with infection time
The infection time can be compared to a variety of measures of topological closeness, and in this section we focus on the GENs (E ji ), the MFPT in a random walk (t ij ) and the resistance distance (R ij ). Infection that originates at a high-degree node (i) will rapidly spread throughout the network, but infections starting at a low-degree node will tend to spread only locally until a high-degree node is encountered. We thus expect the rate of infection of a non-nearest neighbour (j) of the initial infection site i to be positively correlated with its topological closeness using all three measures.
In figure 4b -e, we compare h ij in a network with N ¼ 512 and kkl ¼ 4 to E ji (b, d) and t ij (c, e), normalized by hEi ¼ N À2 P ij E ij (since the GENs do not contain any dynamic information and the numerical values are thus arbitrary) and hti ¼ N À2 P ij t ij (for comparison with the GENs), respectively. The figures show a random sample of 20 target nodes j with k j . 4 (for which there is a consistent relationship for kkl ¼ 4, discussed further in appendix C). As expected, infection times of non-nearest neighbours are lowest for nodes that are topologically close (low E ij or t ij ), with the lines showing an empirical power-law fitting of h ij / x ax ij for x ¼ E or t. The exponent is non-universal, depending on N, kkl and the recovery rate. It is apparent that the fit using the GENs is more robust than the MFPT, due to the clustering of t (akin to the degree-driven clustering in figure 2b) with larger variation in h ij for a given value of t ij than is seen for E ji . This is driven by the fact that t ij is much more strongly correlated with the degree of the target node j than is h ij (shown in appendix C). The comparison of h ij with R ij has a trend similar to t ij , and is not shown in the figure.
The quality of the fit between the infection time h ij and any of the measures of closeness x ij are shown in figure 4f using the standard deviation of the residuals s 2 x ¼ N À1 P i (h ij À cx a ij ) 2 for the power law best fit h ij ¼ cx a ij . The mean of the residuals m ¼ N À1 ¼ P i (h ij À cx a ij ) generally satisfies jmj 10 23 for all measures at all r R . Figure 4f shows that all closeness measures perform worse when r R increases, due to the fact that node recovery is independent of the network topology. The figure also clearly demonstrates that the GENs are a significantly better predictor of the infection time than either the MFPT or resistance for spreading on an ER network, indicating that they correspond to a relevant measure of topological closeness that has an impact on the spreading process. For an ER network with kkl ¼ 20, all nodes have degree k . 4 with high probability, and in this case the results are consistent with those pictured in figure 4b -f without restriction on the degree. For kkl ¼ 20, we find that s x increases overall for each measure of proximity (all on the order of s x % 0.3 2 0.4 for r R /r I % 0), as shown in appendix C. Consistent with the behaviour in figure 4, s E is lower than s t and s R for non-zero r R /r I , indicating that the GENs remain a better predictor overall than resistance distance or MFPT.

Random walks and the GENs
A surprising feature of figure 4 is the significant difference between the accuracy of E ji and t ij in predicting the infection time. Based on the good agreement between the importance centrality C i and random walk centrality c i in figure 2d, one might have expected to find consistency between the GENs and the MFPT in a random walk. Random walk centrality is defined based on the differences in MFPT [13], with t ij 2 t ji ¼ c j 2 c i , rather than the particular values of t ij themselves. The MFPTs are asymmetric (t ij . t ji if i is more easily reached than j ), as it is easier to reach a high-degree node than a low-degree node, with a similar behaviour for the GENs (with E ji . E ij if i is topologically closer to j than j is to i). This suggests a comparison of the asymmetry between the two measures that could explain their agreement in figure 2d. In figure 5, we compare DE ij ¼ E ij 2 E ji to the difference in the MFPT between nodes Dt ij ¼ t ij 2 t ji for an ER network with various N and kkl. The asymmetry in the MFPT is highly correlated with the asymmetry in the GENs, with an empirical scaling of Dt ij % ÀDE ji ffiffiffiffiffiffiffi aN p and a % 4 (determined using Mathematica's FindFit function

Conclusion
In this paper, we have shown the utility of the GENs in measuring a non-metric topological closeness between nodes in complex networks lacking a well-defined distance metric. Derived from simple principles based on a conceptual picture of nodes sharing finite resources, the GENs incorporate the global topology of the network into a pairwise measure of closeness for connected and disconnected nodes alike. Other non-local pairwise measures can be found in the literature (e.g. the MFPT in a random walk or resistance distance between nodes), and we have shown that the GENs are able to describe the structure of and dynamics on networks in a manner consistent with or outperforming these existing measures.
The utility of the GENs was first demonstrated by identifying two potential measures of centrality derived from the GENs that identify important nodes in heterogeneous networks consistent with existing methods. The Erdó´s centrality, , defines centrality in terms of the importance assigned by nearest neighbours and is appropriate for unweighted networks. An alternative measure of centrality that takes the importance assigned between all node pairs i and j into account arose from a novel definition of a random walk with teleportation: the importance eigenvector centrality was defined as the steady state probability of being found in a node i in a walk with transition probabilities p j! i / E 21 ij . This is conceptually related to the teleportation probability in PageRank, but with our eigenvector centrality having an inhomogeneous teleportation probability depending on the importance of each node. In both cases, we showed that these centrality measures are consistent with existing approaches despite the very different origins they all have.
The GENs were further shown to be useful in quantifying the impact of the network topology on the dynamics on epidemic spreading on an ER network. Nodes that are disconnected but topologically close in a network should more quickly spread the infection between each other than nodes that are distant. While the resistance distance and MFPT in a random walk are both positively correlated with infection time (as expected), the GENs are an overall better predictor for high-degree nodes. We note that the dynamics of the SIR model were not chosen to match the dynamics of the epidemic spreading, as the SIR model does not have a finite resource shared between nodes (as each node can infect all of its neighbours with equal rate). The GENs are expected to perform well on predicting the infection risk of nodes for other disease models in which the process of infecting one node may reduce the infection rate of other neighbours. Taken together, the quality of the centrality measures and the correlation with dynamical processes on networks suggest that the GENs are a meaningful measure of topological proximity and may be of potential benefit in a variety of contexts. Competing interests. We have no competing interests. Funding. We do not acknowledge any specific funding source for this work. Acknowledgements. We thank O. Peleg and G. Strang for their useful comments on the manuscript. We also thank anonymous reviewers whose helpful comments significantly improved the paper.

A.1. Homogeneous networks of small diameter
While equation (2.1) is not exactly solvable for all but the simplest of network topologies, the general properties of the GENs can be explored for sufficiently homogeneous networks. The unweighted ER networks have a degree distribution sharply peaked about the mean (k i % kkl, where k i is the degree of the node i in an unweighted network), and we expect the closeness between nodes will still be broadly distributed due to the complex network topology. The mean closeness between nodes can be derived by assuming that E ij ¼ E c (the 'typical connected' closeness) if i and j are connected, and the 'typical disconnected' closeness, E ij ¼ E d , if they are not directly connected. In an unweighted regular network, with all nodes having the same degree k i ¼ k, it is possible to examine the mean closeness between connected and disconnected nodes using the GENs. For homogeneous degree distributions such as the ER networks, we expect an approximation k i % kkl to be reasonable, with fluctuations in the degree expected to have a relatively minor impact, particularly for high mean degree. For these homogeneous networks, we assume that nodes that are directly connected have a typical closeness E c between each other, and another closeness E d ! E c to nodes that are not. If i and j are directly connected, they have on average (k 2 1) 2 /(N 2 2) neighbours in common (since both have exactly k edges, one of which connects to the other), and they have k 2 /(N 2 2) neighbours in common on average if they are not connected. A mean field approximation will treat connected (disconnected) nodes as having a fixed closeness E c (E d ) between each other, and split the sum in equation (2.1) into rsos.royalsocietypublishing.org R. Soc. open sci. 172281 two parts: a sum over nodes neighbouring both i and j, and a sum over nodes only connected to j. This gives the approximate equations for an unweighted network of constant degree It is possible to solve E c exactly in terms of k, N and the unknown E d , with

Substitution of equation (A 3) into equation (A 1) and collecting terms implies that
An exact solution to this is not enlightening, but in the limit of N ! 1 an asymptotic solution can be found. E d cannot be independent of N in the limit of N ! 1 else E d would be imaginary. Rather, E d must be an increasing function of N, implying that the highest order terms must have the same scaling, with E 4 d NE 2 d for large N. Then we expect E d N 1/2 to leading order, and we find for large N that Comparing this expression to the numerical solution of the equation shows less than 1% deviation for N 1000 and k 300, suggesting the truncation to terms of order O(N 0 ) is sufficient for large N over a wide range of k. A good approximation for E c can be found by setting k ¼ kN and taking the limit of k ! 0. We find where the latter is the scaling for sufficiently large N ) k 4 . Note that this scaling does not emerge immediately: even for N 10 4 , higher order terms can contribute in the series for only moderate values of k, and the full expression is required to obtain an accurate estimate for finite size networks.
In an alternative limit of N ! 1 but k ¼ k/N finite (i.e. a large, densely connected ER network), we find the connected GENs scale as E c ffiffiffiffi N p À 1=k þ O(N À1=2 ), converging on the disconnected nodes E d ffiffiffiffi N p þ 1 but remaining closer to zero. This scaling is consistent with that for a fully connected network, with [23] , indicating (unsurprisingly) that a dense random network is structurally similar to a fully connected one.

A.2. Large diameter networks
This simple two-state approximation in equations (A 1) and (A 2) assumes there all nodes not directly connected to i are identical, a reasonable assumption only in the case of networks with a very small diameter. As ER networks have diameter [56] D % log(N)/log(k) for the networks with kkl ¼ 20, to a good approximation each disconnected node is only a distance 2 away from i for the networks considered in figure 2b,d. The approximation in equations (A 1) and (A 2) is poorly satisfied for kkl ¼ 4, where the diameter is larger and fluctuations in the degree of each node are of much greater importance due to the smaller mean degree. This heterogeneity in disconnected nodes may be important for networks with small kkl/N due to the larger diameter, and in the same spirit as equations (A 1) and (A 2) we define e x to be the mean value of the GENs from a node j a distance x from node i (so e 1 % E c in equation (A 5)). For kkl ( N, we can write approximately with e 0 ; 0 and where n x is the average number of nodes a distance x from node i. The first term accounts for the fact that a node distance x from i must be connected to at least one node distance x 2 1 from i, by rsos.royalsocietypublishing.org R. Soc. open sci. 172281 definition, and the second term accounts for the other potential connections: those a distance x 2 1, x or x þ 1 from i. Note that these are the only possible connections for a node a distance x from i, which can be connected to (a) more than one node a distance x 2 1 from i, (b) other nodes a distance x from i, or (c) any number of nodes a distance x þ 1 from i. In the limit of large N for small k/N, n x % N[1 2 (1 2 k/ N) n x21 ] % n x21 k, implying that n x % k x for connected or disconnected nodes with sufficiently small k/N. Substitution of n x ¼ k x into equation (A 6) and taking the anzatz e x e (lx) readily shows that l log(k) for sufficiently large k (still constraining k/N ( 1). The GENs thus grow exponentially for small x, a scaling similar to that of the GENs on a tree [23]. We empirically find that for sufficiently large x the growth of the GENs saturates for sufficiently large x (as was observed in tree networks of finite size [23]), no longer satisfying the exponential growth of e x k x . For nodes at the diameter of the network (x ¼ D) with n Dþ1 ;0, equation (A 6) implies that e D % e D21 þ (k þ 1)/2 þ O(e 21 D21 ), taking the limit of e D21 ) 1. In order to determine the behaviour of the GENs for a pair of nodes separated by x ¼ D 2 l for some l ( D, we take the anzatz that e D2l % e D2lþ1 þ j l for j l , the difference between e D2l and e D2lþ1 a function of l assumed small relative to e D2l . Substituting into equation (A 6) and in the limit of large e D2l21 , we find the asymptotic relationship ). For large k (but still satisfying k(N ), this implies j l % k(x l21 þ 1), and with j 0 % k/2 we find j l % k lþ1 /2. Asymptotically then, e x % e x21 þ k D2xþ1 /2 for x sufficiently close to D. The exponential growth for small x is therefore converted to a saturation when k x k D2xþ1 or when x % (D þ 1)/2.
For large N and assuming log(k))1 while still satisfying k(N, a continuum approximation for the mean value of the disconnected GENs is determined by dividing the predicted GENs into exponential growth for d D/2 and a constant term for d D/2. We estimate is the expected graph diameter for an ER network, n l % e kl is the approximate number of nodes a distance l from i, and l % log(k) is the asymptotic growth rate of the GENs before saturation. This leads to a scaling law of in agreement with scaling for the two-state results, even in the limit of large D. Equations (A 1) and (A 2) approximate the mean of the nonlinear terms by the function evaluated at the mean: x for any sequence fx l g with small variance s x , we expect that the approximation underlying equations (A 1) and (A 2) tends to overestimate the value of the mean of kE 21 l ¼ kEl 21 þ s 2 E /kEl 3 ! kEl 21 and thus our predicted value of kE c l is expected to be underestimates (with a similar argument true for E d ). We emphasize here these limits are valid only for 1 ( kkl ( N, and these simplified models cannot accurately capture the statistics of low-degree networks for which the neighbour statistics cannot be captured by a simple mean value.

A.3. Simulated distributions of the GENs for ER networks
In figure 6, we show the distribution of the GENs for ER networks with varying N ¼ 512 and 1024 and with kkl ¼ 4 and 20. In figure 6a,b we see that changing kkl radically alters the mean values of E ij as well as the shape of the distributions, while changing N only marginally affects the distribution of the connected GENs, shifting the peak a small amount while retaining a similar functional form. For kkl ¼ 4 the distribution of E ij exhibits multiple peaks in figure 6a, with each local maximum corresponding to a different degree of the node j and with the width of the distribution about the peak coming from differing degrees of the node i. Such heterogeneity is less apparent for high-degree nodes (figure 6b), where fluctuations in the degree of i or j have less of an impact on the GENs, and the distributions are unimodal. For disconnected nodes, the distributions have a single dominant peak ( figure 6c,d), and the location of the peaks is well predicted by equations (A 4) and (A 5) for kkl ¼ 20. Owing to the significance of degree fluctuations for the smaller kkl ¼ 4, there are large differences between the predicted and observed means.
The growth in the mean value for the disconnected GENs for increasing N is due to the increasing sparsity of the network. Each node still has kkl neighbours on average, but a pair of nodes has only %kkl 2 /N neighbours in common for large N. As the size of the network increases, there will be fewer shared neighbours and the nodes will tend to be less close to one another. This has a marginal effect on the closeness between nodes that share a direct link (for which we expect E c k for large N), but have a significant effect on disconnected nodes (for which E d ffiffiffiffi N p ). In the limit as N ! 1 and for fixed k, an ER network will drop below the percolation threshold (with kkl/N , 1) and become a set of small components; in this limit the approximations underlying equations (A 1) and (A 2) break rsos.royalsocietypublishing.org R. Soc. open sci. 172281 down. For a fixed attachment probability kkl ¼ pN with N ! 1, we expect the homogeneity conditions required for equations (A 1) and (A 2) to remain valid, and thus that E c ffiffiffiffi N p .

B.1. Generation of Barabási -Albert networks
The Barabási -Albert model generates a scale-free random network by combining the notion of growth and preferential attachment. Beginning with some small initial network (a kernel), the method works by adding new nodes incrementally, attaching each new node to existing nodes in the networks. Attachment to existing nodes is preferential in that a new node has a probability of being attached to an existing node proportional to the degree of the existing node: existing nodes with higher degree will tend to increase degree, while existing nodes with lower degree will only rarely acquire a new connection. The parameters in the model are n: number of nodes in initial clique m: number of edges in initial clique k min : degree of new node upon addition (number of new edges added at each step) -N: total number of nodes -M: total number of edges and the mean degree kkl of a node in the final network is given by kkl ¼ 2M/N. To generate a network with a prescribed kkl, we need to choose n, m and k min properly. If we require that our initial clique is fully connected, preventing any initial node from being preferred over any other at first attachment, then m ¼ (n 2 2 n)/2. We can also observe that for N/n ) 1, M % Nk min , as each new node introduces k edges by definition. It is thus natural to choose this limiting case as a constraint to enforce for any network size, meaning that we require M ¼ Nk min and thus hki ¼ 2k min . This determines m and k min and allows us to M ¼ m þ ðN À nÞk min , þ Nk min À nk min and n ¼ 2k min þ 1: So the algorithm to generate a random Barabási-Albert network can be sketched as follows. Beginning with a fully connected clique of n ¼ 2k min þ 1 nodes, add N 2 n new nodes incrementally. Each new node is attached to the existing nodes by choosing k min unique existing nodes, each chosen with probability proportional to the existing node's degree, and add edges between the new node and this existing set to the network. Because this algorithm requires beginning with a relatively large initial clique to satisfy the mean degree constraint exactly, the final degree distributions feature heavier tails than typical scale-free graphs, especially for large values of kkl/N ( figure 7).

B.2. Topological closeness in scale-free networks
In contrast to the homogeneous degree distribution of the ER random network model, Barabási -Albert (BA) networks [7] have a scale-free, heterogeneous degree distribution, and figure 8 shows that the distribution for the GENs for BA networks are likewise heterogeneous for directly connected nodes. The distribution for the GENs between nodes that share an edge (shown in figure 8a,b) appear to have a heavy tail and approximately satisfy Pr(E ij ¼ E) E 2l for nodes that share a direct connection, with an empirically determined scaling exponent near 1.5 for kkl ¼ 4 and around 2.1-2.2 for kkl ¼ 20 (shown in figure 9, found using Mathematica's LinearModelFit function). This is in comparison to the heavy tailed degree distribution with the P(k) / k 23 scaling of the BA networks for both values of kkl. Variations in the scaling exponent for E ij despite the fixed scaling exponent in the degree distribution does not indicate a lack of robustness of the model: as N increases, each node with degree k . 1 is connected to a greater number of nodes with degree k ¼ 1, thus decreasing the impact of shared neighbours for each node in the network. The eventual scaling of the GENs for BA networks in the limit of N ! 1 is not readily derived analytically, due to the heterogeneity of the networks that prevent mean field approximations as in equations (A 1) and (A 2) from being appropriate. Low-degree nodes are often linked to high-degree hubs in the BA algorithm, which leads to a significant decrease in the most probable value of E ij seen in figure 9a,b compared to figure 6a,b. This is because randomly selected nodes in the homogeneous ER networks probably have degree k, whereas a randomly selected node j will most likely be of low degree in a BA network, and will have a smaller value of E ij to a hub (i). Interestingly, the distribution of the GENs for disconnected nodes does not depend as strongly on the scale-free nature of the degree distribution, with similar qualitative features found in both figure 6c,d for the ER networks and figure 8c,d for the BA networks. While the existence of hubs in the BA networks tends to give a higher probability of finding smaller values of E ij for disconnected nodes in comparison to ER networks, the most likely values of E ij are similar for disconnected nodes in either network topology (in contrast to the radically different distributions for connected nodes). We have considered only unweighted networks in this analysis, and allowing weighted edges further complicates the analysis of the 'typical' GEN between nodes unless a homogeneity assumption on the distribution of weights is likewise made.
Deviations from the best fit power law in figure 9 occur for large E due to the finite size of the network. The scale-free nature of the network does not alter the arguments used to show the saturation of the GENs for nodes at the diameter, and we therefore expect some upper bound on the maximum value of closeness. We expect an exponential growth in the GENs for nodes that are a large distance away from one another (as the network becomes more tree-like, with a low probability of overlap in the neighbours) as was seen for the more homogeneous ER networks. There also appears a lower bound on the GENs in figure 9, due to the fact that even neighbours shared between nodes reduce the closeness between them. If we imagine that two nodes with degree k have a direct connection between them and all neighbours are shared, representing the topology producing the lowest closeness between the pair, the GENs will be E min ffiffi ffi k p . This produces the lower bound at E % 2 in figure 9a for kkl ¼ 4 and at E % 4.47 in figure 9b for kkl ¼ 20.

B.3. Asymmetry in random walks in Barabási-Albert networks
In the main text, we found that the asymmetry in the MFPT in a random walk on an ER network was highly correlated with the asymmetry in the GENs, with a proportionality constant % ffiffiffiffiffiffi ffi 4N p for a wide range of N. In figure 10, we see a similar scaling holds for random walks on BA networks, consistent with the good agreement between the Erdó´s centrality and the random walk centrality for BA networks in figure 2.

B.4. GENs in networks with community structure
The usefulness of the nonlinear importance c ij ¼ E 21 ij on a network can rapidly determine meaningful relationships between nodes in complex networks. To illustrate this, we consider the benchmark of Liancichinetti, Fortunato and Radicchi (LFR) [10], which constructs a network of communities of variable sizes n (distributed as P(n) / n 2b ), a scale-free distribution of the nodes (with P(k) / k 2g ), and which is characterized by the mixing parameter, m, as the fraction of inter-community edges. We have previously shown [11] that the GENs can be used to detect the community structure underlying this benchmark. When measuring the importance of a node, a global measure of centrality will generally focus on nodes with high degree, but due to the heterogeneous density of edges between communities, we expect a meaningful definition of the importance j assigns to i to differ significantly depending on if i and j are in the same community.
Note that the determination of the GENs does not require or use knowledge of the community structure. In figure 11, we determine the distribution of importance c ij between nodes i and j that do not share a direct connection (w ij ¼ 0) for nodes within i's community (red) and outside of i's community (blue) on an LFR network with N ¼ 10 3 , k ¼ 25, g ¼ 2, b ¼ 1 and m ¼ 0.3. There is an immediately apparent difference in the distributions, with a greater probability of a high importance if i and j are in the same community due to the increased number of shared neighbours (even in the absence of a direct connection). However, the intra-community and inter-community distributions overlap, indicating that some pairs assign a greater importance across communities than another pair within the same community. This is driven by the heterogeneous node degrees, with high-degree nodes assigning little importance to any node (including within their own community) but receiving high importance from low-degree nodes (including outside of their community). Increasing the LFR parameter m (which increases the number of edges between communities) reduces the difference in the distributions, but varying the other system parameters has only a minor impact on the clear distinction between the two distributions (data not shown).

Appendix C. Topological closeness and dynamics on networks
In §4, the infection time of a target node j for a disease originating at node i was shown to be an increasing function of three different models of topological closeness: R ij , t ij and E ji . The infection time h ij tended to be clustered when compared to t ij (leading to significantly greater variation in the residuals). This is driven by the very strong relationship between t ij and the degree of the target node, depicted in figure 12. To determine the relationship between degree and topological closeness, we computed ht (k) i ¼ P ij t ij d k,k j = P ij d k,k j with d x,y ¼ 1 if x ¼ y and 0 otherwise. This represents the mean MFPT (averaged over all origin nodes i and all target nodes j ) with the constraint that the degree of the target node is k. We find a very strong dependence of the MFPT on the degree of the target node, with the blue line showing t (k) / k 21 . This strong relationship may be unsurprising, as the steady state probability of being found at a node j is proportional to k j (as discussed in the main text). We can likewise compute kh (k) l and kE (k) l and find they both have a much weaker dependence on the degree of the target node (error bars are standard deviation of the mean). Figure 4 was restricted to target nodes j for which k j . 4 for an ER network with kkl ¼ 4. This is because while the infection times of low-degree nodes are still correlated with the GENs, with approximately the same exponent in the empirical fit h ij / E a ji , the value of the coefficient of proportionality c appears to vary with k j for low-degree target nodes. This is illustrated in figure 13a for r R ¼ 0.01r I , with the dashed line the same fitting exponent as in figure 4b but the points corresponding to low-degree nodes (with k j 4, different colours indicate different initial nodes). The inter-community importance intra-community importance LFR benchmark P(y) Figure 11. The GENs applied to the Liancichinetti, Fortunato and Radicchi benchmark [10]. The red shows the distribution of importance for nodes i and j that are in the same community but do not share a direct connection. The blue shows the distribution for those in different communities (and still sharing no direct connection). Owing to the high density of links inside of the communities, the GENs accurately indicate that c ij is likely to be larger if i and j are in the same community.  Figure 12. The dependence of the MFPT t ij (blue circles), GENs E ji (black triangles) and mean infection time h ij (red squares) as a function of the degree of a target node j. The MFPT is well fit by a t ij / k 21 j , while h ij and E ji have a much weaker dependence on the target node's degree. All constrained values kx (k) l are scaled by kxl (the unconstrained average over all nodes) to permit comparison. rsos.royalsocietypublishing.org R. Soc. open sci. 172281 best fit for the GENs tends to underestimate the infection time of low-degree nodes. The same qualitative behaviour is seen for t ij as well in figure 13b, with h ij tending to be underestimated by the best fit. The wide variation in figure 13 is consistent with that of figure 4c, and we expect that t ij will be a worse predictor of the infection time than the GENs. In figure 13c, we see the resistance distance is qualitatively similar to the MFPT (more clustered and with greater fluctuations than the GENs), but importantly the predictions for the infection times are not systematically underestimated as they are in figure 13a. We find that the GENs remain a better predictor of the infection time than either R or t for r R % 0, but that resistance distance quickly overtakes the GENs as r R increases. It is important to note the difference in the axes between figures 13d and 4f, with the standard deviation for all three measures significantly higher with the inclusion of low-degree nodes than was seen for solely high-degree nodes.  Figure 13. Inclusion of low-degree nodes in the prediction of h ij in the fitting reduces the quality of the agreement for all measures of topological closeness for N ¼ 512 and kkl ¼ 4. In (a), low-degree nodes tend to become infected slower than would be predicted by the GENs, leading to significant weight far from the fit. (b,c) The quality of the fit for the MFPT and resistance distance (respectively). In (d ), the poorness of the fit is quantified using the standard deviation. At r R ! 0 the GENs perform best, but resistance distance is a better predictor of infection time for larger r R . Note the change in scale from figure 4f.