Network comparison and the within-ensemble graph distance

Quantifying the differences between networks is a challenging and ever-present problem in network science. In recent years, a multitude of diverse, ad hoc solutions to this problem have been introduced. Here, we propose that simple and well-understood ensembles of random networks—such as Erdős–Rényi graphs, random geometric graphs, Watts–Strogatz graphs, the configuration model and preferential attachment networks—are natural benchmarks for network comparison methods. Moreover, we show that the expected distance between two networks independently sampled from a generative model is a useful property that encapsulates many key features of that model. To illustrate our results, we calculate this within-ensemble graph distance and related quantities for classic network models (and several parameterizations thereof) using 20 distance measures commonly used to compare graphs. The within-ensemble graph distance provides a new framework for developers of graph distances to better understand their creations and for practitioners to better choose an appropriate tool for their particular task.

In the main text figures, we plot the within-ensemble graph distances of networks with a fixed size. However, one important behavior of graph distance measures is how they change as networks increase in size.
As an example, the Jensen-Shannon divergence between the degree distributions (DJS) of two ER graphs will decrease as n → ∞, since the empirical degree distributions get closer and closer to a binomial distribution. On the other hand, for graph distances that are explicitly accompanied by a size-normalizing term (e.g. HAM), we would expect that the mean within-ensemble graph distance does not change as network size increases.
In Figure 1, we show how the within-ensemble graph distance changes as n increases, both for a fixed density in G (n,p) as well as a fixed average degree in G (n, k ) . Here, we generate pairs of ER networks with either a fixed density, p or with a fixed average degree, k , as we increase the network size, n. In each subplot, the mean within-ensemble graph distance is plotted as a solid line with a shaded region around for the standard error ( D ± σ D ; note that in most subplots above, the standard error is too small to see), while the dashed lines are the standard deviations.

B. Descriptions of graph distance measures
Throughout the appendix, we assume graphs G and G are undirected and unweighted so that the adjacency matrices are binary and symmetric. We first consider several projections for distances given a description which is the full adjacency sequence or matrix, followed by projections involving statistical and ad-hoc descriptions. The list of graph distances used in this work is {JAC,  HAM, HIM, FRO, POD, DJS, POR, QJS, CSE, GDD, REP, LSD, LGJ, LLE, IPM, NBD, DNB, DMD, DCN, NES}.

(a) Jaccard Distance
The Jaccard measure is computed using the adjacency matrix ψ G = A ∈ {0, 1} n×n . For two graphs vertex-labeled G and G , where S ij = A ij A ij represents the intersection of edge sets between graphs G and G , while represents the union of edge sets between graphs. Here, |S| is the sum over the S ij and similarly for |T|. The computational complexity of the Jaccard distance is O(|E| + |E |) when using unordered sets to get the union and intersection sets and their cardinality. This is what is done in the netrd package [1]. Since nearly empty graphs likely have nearly zero edges in common, the |S||T| will be nearly zero for p close to 0, so that d JAC approaches 1 at low p.

(b) Hamming Distance
Similarly, the Hamming measure may also be computed using the adjacency matrix A ∈ {0, 1} n×n . For two vertex-labeled graphs G and G , the Hamming distance counts the number of elementwise differences between ψ G = A and ψ G = A : The computational complexity of the Hamming distance is O(n 2 ) if one compares the elements A ij for each node-pair ij. This is what is used in the netrd package [1]. For sparse graphs, one could use unordered sets of edges to only compute the distance on edges (i, j) in the union set E ∪ E , leading to a computational complexity of O(|E| + |E |).

(c) Frobenius
The Frobenius distance d FRO is simply the norm of matrices, so that: Note that for binary adjacency matrices, |A ij − A ij | 2 = |A ij − A ij |, and A ii = A ii = 0 ∀i given that there are no self-loops. Note that, because the distance operates on the adjacency matrices directly, it implicitly assumes the graphs are vertex-labeled. FRO has the same computational complexity as the Hamming distance due to their similarity. It is O(n 2 ) if one compares all entries, as is in the netrd package, but it could be improved to O(|E| + |E |).

(d) Polynomial Dissimilarity
The polynomial dissimilarity, POD, between two unweighted, vertex-labeled graphs is based on the eigenvalue decompositions of the two adjacency matrices of the graphs, G and G [2]. rspa.royalsocietypublishing.org Proc R Soc A 0000000 To compute the polynomial dissimilarity between two graphs, first decompose A as Q A Λ A Q T A , where Q A is an orthogonal matrix and Λ A is the diagonal matrix of eigenvalues. Second, construct vectors P (A) and P (A ) for each graph, where P The polynomial dissimilarity, then, is calculated as the Frobenius norm between P (A) and P (A ) In this work, we consider a default value of K = 5 in order to accommodate potentially informative higher-order interactions in each of the graphs. Here, α = 1 by default, though in [2], α = 0.9 is commonly considered.
The computational complexity of POD is O(n 3 ) in practice, which arises from it requiring two n × n matrix eigendecompositions, which is O(n 3 ) for general matrices and a method based on the QR algorithm [3], as used in the netrd package. Note that recent techniques based on message-passing can give fast and exact results for sparse networks with short loops in O(n log n) [4] and could be used to reduce the computational complexity of spectral graph distances.

(e) Degree Distribution Jensen-Shannon Divergence
A simple graph distance measure is the Jensen-Shannon divergence [5] between the empirical degree distributions of two graphs. In this case for an n-node graph G the descriptor ψ G is the empirical degree distribution encoded in the set of numbers {p k (G)} k≥0 := p given by p k (G) := n k (G)/n, where n k (G) = n i=1 1{k i = k}, with 1{·} being the indicator function and k i = n j=1 A ij being the degree of node i in terms of the adjacency matrix A of G. The Jensen-Shannon divergence between two such distributions [6] is the degree Jensen-Shannon divergence or DJS distance between the graphs: where p + = {(p k + p k )/2} k≥0 is a mixture distribution and H[p] = − k p k ln p k is the Shannon entropy.
The computational complexity of DJS is O(n), which arises from computing two degree distributions (which is O(n)) and then comparing them (which is O(k + ), with k + < n being the maximum degree in either network).

(f) Portrait Divergence
The portrait divergence, POR, compares using the JSD a description for each of two graphs called their network portrait [7,8]. The network portrait is a matrix B with elements B lk such that B lk ≡ number of nodes with k nodes at distance l.
Alternatively stated, B lk is the kth entry of the empirical histogram of l-th neighborhood sizes. These elements are computed using a breadth-first search or similar method. The portrait divergence of G and G is the JSD of probability distributions associated with their portraits, B and B [8]. Note that each row in B can be interpreted as the probability distribution that there will be k nodes at a distance of l away from a randomly chosen node such that: which can be normalized of the number of paths of length l such that the probability distribution is the probability that two randomly selected nodes are at a distance l away from each other: where nc is the number of nodes within a connected component, c. The joint probability of choosing a pair of nodes at a distance, l, away from each other and that one node has k nodes in total at distance, l, away is: There is now a P B (k, l) and P B (k, l) for each portrait, B and B , as well as a "mixed" distribution for both, which is specified as P * = 1 2 (P B (k, l) + P B (k, l)). The portrait divergence between G and G is the JSD between their portraits as follows where D KL is the Kullback-Leibler divergence. Note that √ D POR satisfies the properties of a metric (satisfies the triangle inequality, is positive-definite, symmetric) [9].
The computational complexity of POR is O(n(n + |E|) log n), which comes from the requirement of computing shortest paths between all pairs of nodes in the network. In our implementation, computing the shortest path between a source and all nodes is done with the Dijkstra's algorithm with a binary heap, which takes O((n + |E|) log n) operations in the worst case. Constructing the portrait and calculating the JSD between the associated distributions has a lower computational complexity.

(g) Quantum Spectral Jensen-Shannon Divergence
This method compares graphs via the Jensen-Shannon divergence (JSD) between probability distributions associated with density matrices of two graphs G and G [10][11][12], denoted ρ and ρ respectively, defined by where L(G) is the Laplacian matrix of graph G, and constant Z ≡ n i=1 e −βλi(L) , with λ i (L) being the ith eigenvalue of L. Description-distance pair (ρ, JSD) yields the "Quantum Spectral Jensen-Shannon Divergence" (QJS) [13], which compares two graphs by the entropy of the eigenvalue spectra of their density matrices ρ. Treating the spectrum {λ i } n i=1 as a normalized probability distribution, the spectral Rényi entropy of order q is given by which, if q = 1, reduces to the Von Neumann entropy: The QJS distance between two graphs is defined to be: For default parameter values, we use β = 0.1 and q = 1.0, based on the explanations in [13]. QJS requires computation of Laplacian matrix spectra of two graphs, and comparison thereof, which yields a computational complexity of O(n 3 ) (see Appendix (d)).

(h) Communicability Sequence Entropy Divergence
The communicability sequence entropy divergence CSE between two graphs, G and G , is the JSD between the communicability distributions of G and G [14]. In order to have a communicability distribution, we first construct the communicability matrix, which is an n × n matrix corresponding to the communicability between two nodes, v i and v j .
In other words, the communicability matrix, C, is computed as a matrix exponentiation of the adjacency matrix. The elements C ij , i ≤ j, are stored in a vector (of length n 2 ) and normalized to create the communicability sequence, P and P , for each graph. The Shannon entropy of P is and the communicability sequence entropy divergence is calculated as the JSD between P and P , where M is the mixed sequence of P and P .
The computational complexity of CSE is O(n 3 ), with the computationally intensive step being to compute the exponential of both adjacency matrices A and A . Our implementation uses Padé approximants through the SciPy package to perform this step, which takes O(n 3 ) operations to get an approximation [15].

(i) Graph Diffusion Distance
The graph diffusion distance [16] GDD between two graphs, G and G , is a distance measure based on the notion of flow within each graph. As such, this measure uses the unnormalized Laplacian matrices of both graphs, L and L , and uses them to construct time-varying Laplacian exponential diffusion kernels, e −tL and e −tL , by effectively simulating a diffusion process for t timesteps (as a default, t = 1000), creating a column vector of node-level activity at each timestep.
The distance d GDD (G, G ) is defined as the Frobenius norm between the two diffusion kernels at the timestep t * where the two kernels are maximally different.
The computational complexity is O(n 3 ) since a spectral decomposition of the Laplacian matrices is used (see Appendix (d)).

(j) Resistance Perturbation Distance
The resistance perturbation distance RES between two vertex-labeled graphs, G and G , is the p-norm of the difference between two graph resistance matrices [17]. The resistance perturbation distance changes if either graph is relabeled (it is not invariant under graph isomorphism), so node labels should be consistent between the two graphs being compared. The distance is not normalized.
The resistance matrix of a graph G is calculated as where L is the Moore-Penrose pseudoinverse of the Laplacian of G.
The resistance perturbation graph distance of G and G is calculated as the p-norm (the pth root of the sum of the pth powers of elements) of the difference in their resistance matrices, R (1) The default value chosen in experiments is p = 2. The computational complexity of RES is O(n 3 ) for our implementation, since we need to compute the Moore-Penrose pseudoinverse of the Laplacian matrix of both graphs, which is O(n 3 ). Note that low-rank approximations can be used to reduce the computational complexity [17].

(k) NetLSD
The NetLSD distance LSD between two graphs, G and G , is the Frobenius norm between the heat trace signatures of the normalized Laplacians L and L [18]. The heat kernel matrix is calculated as The ij-th element of H t contains the amount of heat transferred from node v i to node v j at time t (default of 256 log-spaced time intervals between 10 −2 and 10 2 ). From the heat kernel matrix H t , the heat trace, h t is defined as The heat trace signature of graph G is the set {h t } t≥1 . Upon computing heat trace signatures of both G and G , they are compared via a Frobenius norm The computational complexity of LSD is O(n 3 ) due to the spectral decomposition of the Laplacian matrices of both graphs (see Appendix (d)).

(l) Laplacian Spectrum Distances
Many distances between two graphs, G and G , use a direct comparison of their Laplacian spectrum. For all the methods below, we use the eigenvalues {λ 1 = 0 ≤ λ 2 ≤ · · · ≤ λn} of the normalized Laplacian matrices L and L . To perform the comparison, a subset of the whole spectrum can be used, e.g. the k smallest [19] or largest [12,20] in magnitude. Unless specified, we used all eigenvalues for comparison (k = n).
The distances compare the continuous spectra ρ(λ) and ρ (λ) associated with the graph G and G . A continuous spectrum is obtained by the convolution of the discrete spectrum i δ(λ − λ i ) with a kernel g(λ, λ * ) where Z is a normalization factor. Different types of distribution can be used for the kernel, for instance a Lorentzian distribution [21] g(λ, λ * ) = γ π[γ 2 + (λ − λ * ) 2 ] , (A 24) or a Normal distribution Different types of metrics can then be used to compare the spectra, such as the Euclidean metric d(ρ, ρ ) = whereρ = (ρ + ρ )/2. Various combination of kernels and metrics yield the following distinct distance measures: • Laplacian spectrum: Gaussian kernel, JSD distance LGJ • Laplacian spectrum: Lorenzian kernel, Euclidean distance LLE For both kernels, we use a half width at half maximum of 0.011775 (which means the standard deviation for the Gaussian kernel is ≈ 0.01).
While we only focus on the two specific distances above, we note again that there is a world of possible combinations of descriptor-distance pairs to possibly use for comparing graphs. We selected the two above because their within-ensemble graph distance curves differed the most (e.g. as opposed to including Gaussian kernel / Euclidean distance or Lorenzian kernel / JSD). The computational complexity of this suite of graph distances is O(n 3 ) due to the spectral decomposition of the Laplacian matrices of both graphs (see Appendix (d)).

(m) Ipsen-Mikhailov
The Ipsen-Mikhailov distance [21] IPM between two graphs, G and G , is a spectral comparison of their Laplacian matrices, L and L . This approach treats the set of nodes in G and G as molecules with an elastic connection between them, which casts the distance measurement between G and G as the solution to a set of differential equations between the vibrational frequencies between the nodes. The vibrational frequencies, ω i , of each node in G is related to the eigenvalues, λ, of L such that λ i = ω 2 i . With this, one can construct a spectral density for each graph as a sum of Lorenz distributions as follows where Z is a normalization term, and γ is a fixed scaling term that controls the width of the Lorenz distributions (as in [21], we use γ = 0.08 as a default). The distance between G and G is then calculated as The computational complexity of IPM is O(n 3 ) due to the spectral decomposition of the Laplacian matrices of both graphs (see Appendix (d)).

(n) Hamming-Ipsen-Mikhailov
The Hamming-Ipsen-Mikhailov distance HIM between two vertex-labeled graphs, G and G is expressed as a weighted combination of the IPM (Section (m)) distance and a normalized HAM (Section (b)) distance [22]. The parameter γ for the IPM is fixed such that DIPM(En, Fn) = 1, where En and Fn are the empty and complete graphs of n nodes. The HIM distance is defined as follows We default to ξ = 1, as in [22]. The computational complexity of HIM is O(n 3 ), with the computationally intensive part being the computation of the IPM distance.

(o) Non-backtracking Spectral Distance
The non-backtracking spectral distance NBD between two graphs, G and G , is a method that compares the eigenvalues of the non-backtracking matrix of each graph, B and B [23]. This distance is based on the length spectrum and the set of non-backtracking cycles of a graph (i.e., a closed walk that does not immediately return to the node from which it left) and is calculated as the earth mover's distance (EM D) between the eigenvalues of B and B . The eigenvalues of B and B are expressed as λ k = a k + ib k and λ k = a k + ib k , respectively, and EM D(λ B , λ B ) is the solution to an optimization problem finding the minimum amount of work required to move the coordinates of λ to the positions of λ .
Note that the Ihara determinant formula can be used to obtain the non-backtracking eigenvalues different from ±1 using a 2n × 2n matrix [23].
If one uses the whole non-backtracking spectrums to compute the distance, the computational complexity would be O(n 3 ) [23]. Instead of using the whole spectrum of the non-backtracking matrices, for graph G we compute only the r eigenvalues larger in magnitude than √ λ 1 , where λ 1 is the largest eigenvalue of B [23].
The computational complexity of our implementation of NBD is O(max(r, r )n 2 ) for general graphs, where r and r are the number of eigenvalues larger in magnitude than √ λ 1 and λ 1 , respectively for graph G and G . To compute these eigenvalues, an implicitly restarted Arnoldi method is used. For sparse graphs the computation is even more efficient.

(p) Distributional Non-backtracking Distance
Similar to the NBD distance [24], the DNB distance leverages spectral properties of the nonbacktracking matrices, B and B , of two graphs, G and G , in order to calculate their dissimilarity.
Unlike the NBD distance, the DNB involves a comparison of the (re-scaled) distribution of eigenvalues of B and B , which are then compared using either the Euclidean distance or the Chebyshev distance (here, we use the Euclidean distance). We also use the whole spectrum for this distance. Therefore, the computational complexity of DNB is O(n 3 ) due to the spectral decomposition of the two 2n × 2n matrices (see Appendix (o)).

(q) D-measure Distance
The D-measure distance [25] DMD between two graphs, G and G , involves a combination of three properties from the two graphs to be compared, G and G : the network node dispersion (N N D), the node distance distribution (µ), and the α-centrality (α) for each graph. For a full explanation and justification for each of the components involved in this distance, we refer the reader to the original article [25], but we will briefly summarize it below.
In order to compute the N N D of a graph, each node, v i , is assigned a probability vector, P i , with elements that are the fraction of nodes that are connected to v i at each distance j ≤ d, where d is the diameter of the network. The N N D, then, is defined as where JSD P 1 , P 2 , ..., Pn is the Jensen-Shannon divergence of each P i from the whole network's average node-distance distribution at every distance j, which we will denote µ j . The average µ j for all distances j ≤ d in a graph, G, we will denote µ G . The final step before the calculation of the D-measure distance is to find the α-centrality [26] of each network, G and G , as well as the α-centrality of the complement of each network, G c and G c . The α-centralities of the original networks are denoted P αG and P αG , while the α-centralities of their complements are P αG c and P αG c . Ultimately, the D-measure distance, DDMD, between two graphs is as follows: where w 1 + w 2 + w 3 must equal 1.0. To calculate the final distance value, we adopt the convention used in [25] such that w 1 = 0.45, w 2 = 0.45, w 3 = 0.1. According to Ref. [25], the computational complexity of DMD is O(|E| + n log n). However, one needs to compute all shortest paths between all nodes, which suggest a more computationally intensive calculation. We rather have a computational complexity of O(n(n + |E|) log n) with our implementation using Dijkstra algorithm with a binary heap (see Appendix (f)).

(r) DeltaCon
The DeltaCon distance DCN between two graphs, G and G , is the Matusita distance between the affinity matrices, S and S , of G and G . The affinity matrices are constructed using Fast Belief Propagation, which is expressed as where I is the n × n identity matrix, D is the diagonal degree matrix, A is the adjacency matrix, e i is a vector indicating the initial node v i from which a random walk process is initiated, and s i is a column vector consisting of s ij , which is the affinity of node v j with respect to node v i . The affinity matrices, S and S , are defined as S = [I + 2 D − A] −1 . The distance between G and G according to DeltaCon is as follows The computational complexity of our implementation of DCN is O(n 3 ) since we obtain S by matrix inversion directly. However, note that it is possible to improve the algorithm and have an O(n 2 ) computational complexity using a power method or even O(|E|) by approximating the distance [27].

(s) NetSimile
NetSimile NES is a method for comparing two graphs, G and G , that is based on statistical features of the two graphs. It is invariant to graph labels and is able to compare graphs of different sizes [28]. It is calculated as the Canberera distance between the 7 × 5 feature matrix, p and p , of each graph. To construct the p and p feature matrices, first a 7 × n matrix is constructed for each, with each column, j, consisting of the following seven node-level quantities: (iv) average clustering coefficient of the nodes in the ego network c These features are then summarized into p and p , which are 7 × 5 signature vectors consisting of the median, mean, standard deviation, skewness, and kurtosis of each feature. NetSimile uses the Canberra distance to arrive at a final scalar distance.
The computational complexity of NES depends on two parts : features extraction and features aggregation. Features are all locally defined, hence their extraction will take O(qn) where q is the average degree of a node when selecting a random edge and choosing an endpoint [29]. Feature aggregation is O(n ln n) [28], hence the overall complexity is O(qn + n log n).

C. Analytical derivation of within-ensemble graph distances (a) Jaccard Distance
We can directly calculate d JAC (A, A ) G(n,p) , the expected Jaccard distance among two graphs sampled from G (n,p) . Both |T| and |S| are distributed binomially, as they are the sum of n   which is exactly the same. Indeed, we observe this equivalence in Figure 1 of the main text in the region p ≈ 1. This finding makes intuitive sense because in the region p ≈ 1, the "union graph", T, is likely an essentially complete graph, and d JAC simply measures the fraction of edges/nonedges that are not in agreement between G and G , which is precisely what dHAM does for all p given an adjacency description.

(b) Hamming Distance
The Hamming measure is simply the fraction of mismatched entries between A and A . Due to this simplicity, we again can analytically predict the mean within-ensemble graph distance for graphs sampled from G (n,p) : The function 2p(1 − p) is n-independent, and has a maximum at p = 1 2 ; simulations are matched by it precisely. Interestingly, while this calculation was done for G (n,p) , the results in Figure 1 of the main text shows an equivalent result for RGGs of the same edge-density p.

(c) Frobenius
As a back-of-the envelope-calculation, note that the sum of elementwise differences is binomially distributed with mean i,j |A ij − A ij | = n(n − 1)2p(1 − p). Using sharply-peakedness, we can thus state approximately, which exhibits a maximum at p = 1 2 for any given n, but grows linearly with n, the latter two observations are qualitatively born out in simulations.