Interface Focus
You have accessResearch article

What you see is not what you get: how sampling affects macroscopic features of biological networks

A. Annibale

A. Annibale

Department of Mathematics, King's College London, The Strand, London WC2R 2LS, UK

[email protected]

Google Scholar

Find this author on PubMed

and
A. C. C. Coolen

A. C. C. Coolen

Department of Mathematics, King's College London, The Strand, London WC2R 2LS, UK

Randall Division of Cell and Molecular Biophysics, King's College London, New Hunt's House, London SE1 1UL, UK

London Institute for Mathematical Sciences, 22 South Audley Street, London W1K 2NY, UK

Google Scholar

Find this author on PubMed

    Abstract

    We use mathematical methods from the theory of tailored random graphs to study systematically the effects of sampling on topological features of large biological signalling networks. Our aim in doing so is to increase our quantitative understanding of the relation between true biological networks and the imperfect and often biased samples of these networks that are reported in public data repositories and used by biomedical scientists. We derive exact explicit formulae for degree distributions and degree correlation kernels of sampled networks, in terms of the degree distributions and degree correlation kernels of the underlying true network, for a broad family of sampling protocols that include random and connectivity-dependent node and/or link undersampling as well as random and connectivity-dependent link oversampling. Our predictions are in excellent agreement with numerical simulations.

    1. Introduction

    Networks are popular simplified representations of complex biological many-variable systems. The network representation reduces the complexity of the problem by retaining only information on which pairs of dynamical variables in a given system interact, leading to a graph in which the nodes (or vertices) represent the dynamical variables and the links (or edges) represent interacting pairs. If all interactions are symmetric under interchanging the two variables concerned, the resulting network is non-directed (as e.g. in protein–protein interaction networks (PPINs)). If some or all are non-symmetric, the network is directed (as e.g. in gene regulation networks). Present-day biological databases contain PPINs and gene regulation networks of many species, with typically in the order of N ∼ 103−104 nodes each, measured and post-processed by various different techniques and protocols. However, in biology, the available experimental techniques do not sample the complete system, but only a finite fraction; for the human PPIN this fraction is presently (and inaccurately) estimated to be around 0.5 [1]. Furthermore, the sampling tends to be biased by which experimental method is used [2]. In order to use the available data wisely and reliably, it is vital that we understand in quantitative detail how the topological characteristics of a real network relate to those of a finite (biased or unbiased) random sample of this network. If, for instance, we observe that certain modules appear more often (or less often) than expected in certain cellular signalling networks, we need to be sure that this is not simply a consequence of sampling. The first studies of the effects of false negatives in the detection of links and/or nodes (i.e. bond and/or node undersampling) on network topologies focused on the relation between true and observed degree distributions, either analytically [3,4] or via numerical simulation [5], and found that undersampling changes qualitatively the shape of the degree distribution. Subsequent studies [6,7], based on numerical simulation, revealed the effects of undersampling on topological features other than the degree distribution, such as clustering coefficients, assortativity and the occurrence frequencies of local motifs. More recent publications were devoted to sampling of non-biological networks, such as the Internet [8] and bipartite networks [9]. So far all published studies on the effects of sampling have either been based on numerical simulations, or been restricted to the effects of sampling on a network's degree distribution. Moreover, there are only very few studies that considered connectivity-dependent sampling (e.g. [4]), and none that investigate the effects of false positive (i.e. oversampling). In the present paper, we use statistical mechanical methods from the theory of tailored random graphs to study systematically the effects of sampling on macroscopic topological features of large networks. We extend previous work in several ways. Firstly, we investigate the effect of sampling on macroscopic observables beyond the degree distribution, e.g. the joint degree distribution of connected node pairs from which one calculates quantities such as the assortativity. Secondly, we do this for both random and connectivity-dependent sampling of either nodes, links or both. Thirdly, we investigate not only network undersampling, but also the implications of false positives in the detection of links, i.e. bond oversampling. All our results are obtained analytically, and formulated in terms of explicit equations that express degree distributions and degree correlations of observed networks in terms of those of the underlying true networks. We test our analytical predictions against numerical simulations and find excellent agreement.

    2. Definitions

    2.1. Networks and sampling protocols

    We consider non-directed networks or graphs. Each is defined by a symmetric matrix c = {cij}, with i,j = 1 … N and with cij ∈ {0,1} for all (i,j). Nodes i and j are connected if and only if cij = 1. We exclude self-interactions, i.e. cii = 0 for all i. The degree ki(c) of a node i is ki(c) = ∑jcij, the degree distribution of graph c is p(k|c) = N−1iδk,ki(c), and we abbreviate its degree sequence as k(c) = (k1(c), …,kN(c)). Sampling stochastically an N-node graph c will result in observation of an N node graph c′. The relation between c′ and c depends on the details of the sampling process. We use random variables σi ∈ {0,1} to denote whether a true node i is observed, and τij ∈{0,1} whether a link (i,j) is observed (if nodes i and j are). In studying oversampling λij ∈{0,1} will indicate whether an absent link is falsely reported as present. Thus:

    Display Formula
    2.1

    In a biological context, node oversampling (e.g. detecting a non-existent protein) would be less realistic, so will not be considered in this paper. Note that N′ = ∑iN σi . We take all sampling variables σ = {σi}, τ = {τij} and λ = {λij} to be stochastically independent, with the proviso that τij = τji, λij = λji and λii = 0 (so sampled networks remain non-directed and without self-interactions). In random sampling their probabilities are functionally independent of the site indices; in connectivity-dependent sampling, the probabilities will depend on the degrees of the nodes involved. We conclude that the different types of sampling under equation (2.1) are all special cases of the following unified process:

    Display Formula
    2.2
    with
    Display Formula
    2.3

    Here x(k)∈ [0,1] gives the likelihood that a node with degree k will be detected, y(k,k′) ∈ [0,1] the likelihood that a link connecting nodes with degrees (k,k′) will be detected, and z(k,k′)/N ∈ [0,1] the likelihood that an absent bond will be falsely reported as present (the latter scales as N−1 to retain finite connectivity for large N). For random sampling, the control parameters in equation (2.3) would all be degree-independent, i.e. x(k) = x, y(k,k′) = y and z(k,k′) = z. We note that, since non-existing nodes cannot give false negatives, we may always choose x(0) = y(0,k) = y(k,0) = 0 for all k. For connectivity-dependent sampling, plausible choices for the functional dependence of the control parameters on the local degree would be x(k) = k/kmax and/or y(k,k′) = kk′/kmax2 and/or z(k,k′) = kk′/kmax2, since high-degree nodes and links connecting high-degree nodes are more likely to be reported.

    2.2. Macroscopic characterization of network structure

    To control analytically the topological properties of the networks to which our sampling protocols (2.1) are applied, we consider the following maximum entropy ensemble, tailored for large N, to the production of graphs with prescribed degrees and prescribed degree correlations:

    Display Formula
    2.4
    with p(k) = N−1iδk,ki and Inline Formula, and with ZN the appropriate normalization constant. Graphs generated according to ensemble (2.4) will have k(c) = k, p(k|c) = p(k) and ∑cp(c)W(k,k′|c) = W(k,k′), where Inline Formula is the joint degree distribution of connected node pairs. Apart from the information in k and W(k,k′), the ensemble (2.4) is unbiased; see Annibale et al. [10] for derivations of its information-theoretic properties, Coolen et al. [11,12] for Monte Carlo Markov Chain (MCMC) algorithms via which its graphs can be generated numerically and for a review on the topic. The remainder of this paper is devoted to calculate analytically how in large networks, with given degree sequences and given degree correlations (i.e. as those typically generated via ensemble (2.4)), sampling affects the macroscopic topological characteristics p(k) and W(k,k′). To be specific, we calculate the average connectivity, the degree distribution and the degree correlation function, after sampling from large graphs drawn from ensemble (2.4), in terms of the sampling characteristics {x(k),y(k,k′),z(k,k′)},1
    Display Formula
    2.5
    Display Formula
    2.6
    and
    Display Formula
    2.7
    with cij as defined in equation (2.2) and 〈 · 〉σ,τ,λ denoting averages over the sampling parameters distribution (2.3). The denominators are simplified trivially, using the independence of the sampling variables and the definition of W(k,k′|c), since
    Display Formula
    2.8
    and
    Display Formula
    2.9

    We may therefore write

    Display Formula
    2.10
    Display Formula
    2.11
    and
    Display Formula
    2.12

    In the following sections, we will calculate analytically the observables (2.10), (2.11) and (2.12) and will test our theoretical results against numerical simulations. To this purpose, we will sample from reasonably large graphs (either synthetically generated or real biological PPINs) and we will measure the degree distribution and the degree correlations after sampling.2 These will be compared with the analytically calculated post-sampling degree distribution and degree correlations resulting from averages over graph ensembles asymptotically tailored to the production of graphs with the same degree sequence and degree correlations as the graph instances used for numerical simulations. The extent to which theoretical predictions and numerical simulations agree will provide an indication of how well, for reasonably large graphs, the behaviour of degree distribution and degree correlations under sampling is captured by averages of such quantities over the corresponding maximum entropy ensembles.

    3. Effects of sampling on degree distributions

    3.1. Connection between observed degree distributions and degree correlations

    We note that in the case of connectivity-dependent sampling, the average degree (2.10) in the observed graph will generally depend not only on the degree distribution of the original graph, but also on the latter's degree correlations. Hence, our decision to use the graph ensemble (2.4) for the present study. The observed distributions p(k|x,y,z) and W(k,k′|x,y,z) in expressions (2.11) and (2.12) are connected via a simple identity, as are p(k) and W(k,k′) in the original graph c:

    Display Formula
    3.1

    So for large N, we need to calculate, in principle, only W(k,k′|x,y,z), as p(k|x,y,z) follows via identity (3.1). Alternatively, since for random sampling, p(k|x,y,z) can be found analytically with little effort, the identity (3.1) can be used for verifying the result of our calculation of expression (2.12).

    3.2. Degree distribution for random sampling

    Calculating p(k|x,y,z) is only straightforward for random sampling, irrespective of whether the source graph is generated according to ensemble (2.4), since, in that case, expression (2.11) can be made to factorize over the sampling variables by writing the Kronecker-δ in integral form. In order to appreciate the roles played by the different ingredients of expression (2.11), we first write it in the form p(k|x,y,z) = limN→∞c p(c)pN(k|x,y,z; c), with

    Display Formula
    3.2

    For random sampling protocols, where x(k) = x, y(k,k′) = y and z(k,k′) = z, this expression immediately simplifies to the transparent result

    Display Formula
    3.3
    in which I(S) is the indicator function (i.e. I(S) = 1 if S is true, otherwise I(S) = 0). The observed average degree (2.10) for random sampling is, as expected,
    Display Formula
    3.4
    Formula (3.3) simplifies further for various special cases. For instance:

    — Random bond and/or node undersampling, i.e. z = 0:

    Display Formula
    3.5
    This implies that if we sample from a graph with Poissonian degree distribution, i.e. Inline Formula, then the degree distribution of the sampled graph will be
    Display Formula
    3.6
    i.e. again a Poissonian distribution, but with a reduced average degree Inline Formula. This recovers earlier results of Stumpf and co-workers [3,4]. We note also that equation (3.5) is invariant under exchanging x and y, so sampling all nodes and a fraction x = ξ of the bonds is equivalent to sampling all bonds and a fraction y = ξ of the nodes. We show in §4.1 that this equivalence between bonds and nodes under random undersampling also holds for the degree correlations. In figure 1, we show the predicted degree distributions (3.5) together with the corresponding results of numerical simulation of the sampling process, for synthetically generated networks with size N = 3512 and average connectivity Inline Formula (as in the biological PPIN of Caenorhabditis  elegans [13]) and Poissonian and power-law degree distributions. The agreement between theory and experiment is perfect.
    Figure 1.

    Figure 1. Effect of random node undersampling (a,b) and bond undersampling (c,d) on the degree distribution of synthetically generated networks with size N = 3512, average connectivity Inline Formula and Poissonian degree distribution (a,c) or power-law distributed degrees (b,d). Different symbols correspond to different fractions of sampled nodes (0.5, 0.7 and 1 as shown in the legend) and predicted values (symbols connected by dotted lines) lay on the top of data points from simulations (symbols connected by dashed lines), obtained by averaging over 50 samples.

    — Random bond oversampling, i.e. x = y = 1:

    Display Formula
    3.7
    As with random undersampling, we observe that sampling from a graph with Poissonian degree distribution, i.e. Inline Formula leads to a sampled graph that is again Poissonian, but now with average degree Inline Formula:
    Display Formula
    3.8
    Results from numerical simulations applied to Poissonian and preferential attachment networks are shown in figure 2 together with the corresponding theoretical predictions. Again the agreement between theory and experiment is perfect.

    Figure 2.

    Figure 2. Effect of random bond oversampling on (a) the degree distribution of synthetic Poissonian graphs and (b) synthetic power-law graphs, both with size N = 3512 and average connectivity Inline Formula. Different symbols correspond to different fractions z/N of ‘false positive’ bonds, with z = 0, 2, 5, 10 as shown in the legend. The theoretically predicted values (symbols connected by dotted lines) are found to lay perfectly on top of the data points from simulations (symbols connected by dashed lines), obtained by averaging over 100 samples.

    3.3. Degree distribution for connectivity-dependent sampling

    In the case of connectivity-dependent sampling, where x(k), y(k,k′) and z(k,k′) are no longer all degree-independent, one can no longer evaluate (3.9) without knowledge of the degree–degree correlations in the sources graph c. However, the average (3.9) over the graph ensemble with controlled degree correlations is still feasible. In appendix A, we calculate the marginal (A 24) of the expected kernel W(k,k′|x,y,z) for the sampled network, from which we obtain p(k|x,y,z) via the connection (3.1). One always has p(0|x,y,z) = 0, whereas for k > 0:

    Display Formula
    3.9
    with
    Display Formula
    3.10
    Display Formula
    3.11
    and
    Display Formula
    3.12

    The average connectivity Inline Formula, as given in observables (2.10), is easily obtained from equation (3.9) using normalization of the conditional probabilities 𝒥(k|q) and ℒ(k|q)

    Display Formula
    3.13

    Let us now work out these results for the ‘natural’ types of connectivity-dependent samplings, where the likelihood of observing nodes or links is proportional to the degrees of the nodes involved, with α ∈ [0,1]:

    Connectivity-dependent node undersampling, i.e. x(k) = αk/kmax, y(k,k′) = 1, z(k,k′) = 0:

    Here, we have

    Display Formula
    3.14
    and
    Display Formula
    3.15

    This leads to

    Display Formula
    3.16
    and
    Display Formula
    3.17
    where Inline Formula is the average number of paths of length 3.

    — Connectivity-dependent bond undersampling, i.e. x(k) = 1, y(k,k′) = α kk′/kmax2, z(k,k′) = 0:

    This choice leads again to equation (3.14), while equations (3.15) are now replaced by

    Display Formula
    3.18

    Hence, one gets

    Display Formula
    3.19
    and
    Display Formula
    3.20

    — Connectivity-dependent bond oversampling, i.e. x(k) = y(k,k′) = 1, z(k,k′) = α kk′/kmax2:

    Here, we have

    Display Formula
    3.21
    Display Formula
    3.22
    and
    Display Formula
    3.23
    Substituting into equations (3.9) and (3.13) yields
    Display Formula
    3.24
    and
    Display Formula
    3.25

    In figure 3, we show the predicted degree distribution (3.24) together with the corresponding results from numerical simulations of the connectivity-dependent bond oversampling process.

    Figure 3.

    Figure 3. Effect of connectivity-dependent bond oversampling (i.e. x(k) = 1, y(k,k′) = 1, z(k,k′) = αkk′/kmax2) on (a) the degree distribution of synthetic Poissonian graphs and (b) synthetic power-law graphs, both with size N = 3512 and average connectivity Inline Formula. Different symbols correspond to different values of Inline Formula, as shown in the legend. Theoretically predicted values (symbols connected by dotted lines) are found to lay perfectly on top of the data points from simulations (symbols connected by dashed lines), obtained by averaging over 100 samples.

    3.4. Summary

    We have seen that the degree distributions of large sampled networks can be calculated and written explicitly in terms of the topological characteristics of the true network, for random and connectivity-dependent under- and oversampling. From the resulting equations, we can draw the following conclusions:

    — Sampling generally affects the shape of the degree distribution of a network, with the exception of a Poissonian distribution (as for Erdos–Renyi graphs), where the sampled network will only have a rescaled average degree compared with the original. This result is consistent with findings in Stumpf and co-workers [3,4].

    — The degree distribution observed after random node undersampling of a network is identical to that following random bond undersampling, for any large graph, if the two (node- or bond-) sampling probabilities are identical.

    — In contrast, connectivity-dependent node undersampling (where the probability of observing a node is proportional to its degree) generally leads to a network with a degree distribution that is different from the one that would result from connectivity-dependent bond undersampling (where the probability of observing a bond is proportional to the degrees of the two attached nodes).

    4. Effects of sampling on degree correlation function

    In appendix A, we calculate the degree correlation function W(k,k′|x,y,z) of large networks that are sampled according to the general protocol (2.2), from graphs generated from ensemble (2.4). The resulting, expressed in terms of the topological properties p(k) and W(k,k′) of the true network, is

    Display Formula
    4.1

    with Inline Formula as given in observables (2.10), two conditional distributions 𝒥(k|q) and ℒ(k|q) defined in equations (3.10) and (3.11), and with the short-hands a(q) and b(q) defined in equation (3.12). We will now work out this general result for the most common types of sampling, viz. node undersampling, bond undersampling and bond oversampling, including both random- and connectivity-dependent protocols.

    4.1. Degree correlations for random sampling

    For random sampling protocols where x(q) = x, y(q,q′) = y and z(q,q′) = z, one has a(q) = xz, b(q) = xy and ℒ(k|q) = 𝒥(k|q − 1), so (4.1) simplifies immediately to

    Display Formula
    4.2

    with

    Display Formula
    4.3

    Formula (4.2) simplifies further for various special cases:

    — Random node and/or bond undersampling, i.e. z = 0. Here, we obtain

    Display Formula
    4.4
    so equation (4.2) reduces to
    Display Formula
    4.5

    We note that W(x,y,0), like equation (3.5) previously, is symmetric under exchanging x and y, i.e. node and bond random undersampling lead to the same degree correlations. Therefore, the equivalence between the two samplings is now fully established for large graphs drawn from ensemble (2.4).

    Equation (4.5) clearly shows that sampling from graphs in which degree correlations are present will generally affect those correlations, even in Poissonian networks, in spite of the fact that there the degree distribution is only changed via a reduction of the average degree. Conversely, if we sample from graphs without degree correlations, i.e. for which Inline Formula Inline Formula, equation (4.5) reveals that the degree correlation function in the sampled graph factorizes in the product of its marginals as well, i.e. W(k,k′|x,y,0) = W(k|x,y,0)W(k′|x,y,0). This means that random bond and/or node undersampling from graphs without degree correlations does not generate any degree correlations.

    In order to observe how sampling protocols affect degree correlations, we will monitor, instead of W(k,k′) itself, the normalized kernel Π(k,k′) = W(k,k′)/W(k)W(k′), which will by definition equal unity in the absence of degree correlations. Any deviation from Π(k,k′) = 1 will thus signal the presence of degree correlations. We show the predicted degree correlations in the case of random bond undersampling, together with the corresponding results of numerical simulations, for Poissonian and power-law graphs, in figures 4 and 5, respectively. In figure 6, we show numerical results and theoretical predictions for random node undersampling from Poissonian and power-law graphs. Results for random bond undersampling from the real, biological PPIN of C. elegans are shown in figure 7 (left panels). The agreement between theory and experiment is very satisfactory; deviations are small and consistent with finite size effects. As explained earlier, this confirms a posteriori that the performance of the biological network to be sampled (here C. elegans) is similar to the average behaviour of the maximum entropy ensemble (2.4), where p(k) and W(k,k′) are the degree distribution and the degree correlation function of the biological PPIN, respectively. As a consequence, the biological network can be realistically approximated by a member of such ensemble. As an additional test, we generate synthetically a member of the maximum entropy ensemble asymptotically tailored to the production of graphs with the same degree sequence and degree correlations as the PPIN of C. elegans by using the MCMC algorithm proposed in Coolen et al. [11]. The degree correlations of the resulting graph are shown in the top right panel of figure 7 and are in good agreement with the degree correlations of the PPIN that are being targeted (top left panel). Note that the Hamming distance between the biological PPIN and the synthetically generated graph is 0.93, so similarity in degree correlations is not consequence of similarity in the connectivity matrices.3 Theoretical and numerical results for random bond undersampling from such randomized counterpart of C. elegans are shown in figure 7 (middle and bottom right panels).

    Figure 4.

    Figure 4. Normalized degree correlation function Π(k,k′) = W(k,k′)/W(k)W(k′) of two synthetically generated Poissonian graphs with N = 3512 and Inline Formula and different degree correlations (as shown in panels (a,b) respectively) before (top panels) and after (middle panels) sampling, a fraction y = 0.7 of the bonds of the original graphs (data result from averaging over 104 samples) and their respective theoretical predictions (bottom panels).

    Figure 5.

    Figure 5. Normalized degree correlation function Π(k,k′) of two synthetically generated power-law graphs with N = 3512 and Inline Formula and different degree correlations (as shown in panels (a,b) respectively) before (top panels) and after (middle panels) sampling a fraction y = 0.9 of the bonds of the original graph (data result from averaging over 104 samples) and their theoretical prediction (bottom panels).

    Figure 6.

    Figure 6. Normalized degree correlation function Π(k,k′) of (a) synthetically generated Poissonian and (b) power-law graphs with N = 3512 and Inline Formula before (top panels) and after (middle panels) sampling a fraction x = 0.9 of the nodes of the original graph (data result from averaging over 104 samples) and their theoretical prediction (bottom panels).

    Figure 7.

    Figure 7. Normalized degree correlation function Π(k,k′) for (a) the biological PPIN of C. elegans and (b) one synthetically generated member of its corresponding tailored graph ensemble, before (top panels) and after (middle panels) sampling a fraction x = 0.9 of the bonds of the original graph and their theoretical prediction (bottom panels). For both networks, N = 3512 and Inline Formula and data resulted from averaging over 104 samples.

    — Random bond oversampling, i.e. x = y = 1.

    Here, we have

    Display Formula
    4.6
    so using our earlier result from equation (3.7)
    Display Formula
    4.7
    we may write
    Display Formula
    4.8
    which leads to the transparent expression

    Display Formula
    4.9

    We note for later that substituting equation (4.8) into equation (3.9) and bearing in mind that ℒ(k|q) = 𝒥(k|q − 1), a(q) = z and b(q) = 1, we have

    Display Formula
    4.10
    which yields
    Display Formula
    4.11
    where Inline Formula.

    We now study the effects of oversampling on graphs without degree correlations. Denoting

    Display Formula
    4.12
    which is z-dependent via the function 𝒥, we may rewrite (4.9) as
    Display Formula
    4.13
    If the original graph has no degree correlation, i.e.
    Display Formula
    4.14
    the sampled graph will have degree correlation
    Display Formula
    4.15
    where we have used Inline Formula Inline Formula, in accordance with identity (3.1) and equation (3.4). For z = 0, 𝒥(k|q) = δk,q+1 and W(k|0) = S0(k) = W(k), so the second term in equation (4.15) vanishes; however, for z ≠ 0, this will be generally different from zero: crucially, but not unexpectedly, oversampling from a graph without degree correlations automatically introduces degree correlations. Numerical results and theoretical predictions for random bond oversampling are shown in figures 8 and 9 for the biological PPIN of C. Elegans and synthetically generated Poissonian and power-law counterparts, respectively.
    Figure 8.

    Figure 8. Normalized degree correlation function Π(k,k′) of biological protein interaction network of C. elegans (N = 3512 and Inline Formula) (a) before and (b) after adding a fraction z/N of bonds, with z = 1, and (c) its theoretical prediction. Data obtained by averaging over 104 samples.

    Figure 9.

    Figure 9. Normalized degree correlation function Π(k,k′) of (a) synthetically generated Poissonian and (b) power-law graphs with N = 3512 and Inline Formula before (top panels) and after (middle panels) adding a fraction z/N of bonds, with (a) z = 1 (left) and (b) z = 2, and their respective theoretical predictions (bottom panels). Data obtained by averaging over 104 samples.

    4.2. Degree correlations for connectivity dependent sampling

    Let us now work out equation (4.1) for the types of biased sampling considered above.

    — Connectivity-dependent node undersampling, i.e. x(k) = αk/kmax, y(k,k′) = 1, z(k,k′) = 0

    Here, we have

    Display Formula
    4.16
    and
    Display Formula
    4.17
    so our equations reduce to
    Display Formula
    4.18

    — Connectivity-dependent bond undersampling, i.e. x(k) = 1, y(k,k′) = α kk′/kmax2, z(k,k′) = 0

    For this choice, we obtain

    Display Formula
    4.19
    and
    Display Formula
    4.20

    which leads to

    Display Formula
    4.21

    — Connectivity-dependent bond oversampling, i.e. x(k) = 1, y(k,k′) = 1, z(k,k′) = αkk′/kmax2

    Here, we get

    Display Formula
    4.22
    Display Formula
    4.23
    and
    Display Formula
    4.24

    Hence, we obtain

    Display Formula
    4.25

    Numerical results and theoretical predictions for connectivity-dependent bond oversampling are shown in figure 10 for synthetically generated Poissonian and power-law graphs.

    Figure 10.

    Figure 10. Effects of connectivity-dependent bond oversampling on the normalized degree correlation Π(k,k′) of the synthetically generated Poissonian and power-law graphs (N = 3512, Inline Formula) shown in the panels (a,b), respectively. Middle panels show the result of simulations for (a) Inline Formula and (b) Inline Formula (right) and bottom panels show the corresponding theoretical predictions. Numerical data result from averaging over 104 samples.

    4.3. Summary

    As was the case for the degree distribution, also the degree correlations can for a broad class of sampling protocols be calculated exactly and in terms of fully explicit relations. In contrast to the degree distribution, for which the sampling problem had already been studied partly by other authors, we are not aware of any analytical results for degree correlations. Our equations revealed that:

    — Sampling will always affect the degree correlations of networks, even in the random (i.e. connectivity independent) case, if the original networks had such degree correlations.

    — Uncorrelated networks will remain uncorrelated after sampling only for random node and/or bond undersampling. Bond oversampling will in general introduce degree correlations, even in the connectivity independent case.

    — Random node and bond undersampling both modify the degree correlations (and the degree distribution) in the same way, so they are equivalent for any graph with prescribed topological features p(k) and W(k,k′), as generated from ensemble (2.4).

    — Node and bond undersampling cannot be mapped onto each other in the case of connectivity-dependent sampling; their effects are qualitatively different.

    5. Discussion

    It is well known that the presently available data on cellular signalling networks are incomplete, and often suffer from serious experimental bias, reflecting the highly non-trivial nature of the experimental methods available for their collection. Yet, a significant number of research papers continue to be written in which such data are used to infer statements on the possible biological relevance of local network modules or motifs. In addition, the signalling network data are increasingly used for preprocessing gene expression data in order to derive more robust disease-specific prognostic signatures [1416], and will very soon impact on actual treatment decisions in medicine (e.g. will be used to suggest which cancer patients are likely to benefit from which chemotherapy). Given this situation, it is vital that we understand quantitatively the data imperfections, i.e. the relation between the true biological signalling networks probed and the imperfect network samples of these networks that are reported in public data repositories and presently used by biomedical scientists. To do this, we need mathematical tools; the relevant networks are too large to rely on numerical simulation alone. Moreover, unlike simulations, analytical results can be used in reverse to infer the most probable true networks from the imperfect observed samples.

    Ensembles of tailored random graphs with controlled topological properties are a natural and rigorous language for describing biological networks. They suggest precise definitions of structural features, they allow us to classify networks and obtain precise (dis)similarity measures, they provide precise ‘null models’ for hypothesis testing, and they can serve as efficient proxies for real networks in process modelling. In this paper, we have shown how they can also be used to study analytically the effects of sampling on macroscopic topological properties of large biological networks, under a much wider range of conditions than those considered in previous analytical studies (the latter are recovered as special simple cases). We have obtained explicit expressions for both degree distributions and degree correlation kernels of sampled networks, and have been able to do this for sampling protocols that involve node and/or link undersampling as well as for link oversampling. Our predictions are in excellent agreement with numerical simulations.

    As could have been expected, the most dangerous types of sampling are the connectivity-dependent ones, where the probability to observe bonds or links depends on the degrees of the nodes concerned. Unfortunately, present experimental protocols are quite likely to involve precisely such sampling. We therefore hope that our new analytical tools, which take the form of explicit and transparent equations that connect the topological structure functions p(k) and W(k,k′) of the sampled and the true networks, can prove useful in explaining and decontaminating signalling network data.

    Acknowledgements

    The authors are grateful to L. Fernandes, F. Fraternali, J. Kleinjung and E.S. Roberts for stimulating discussions. A.C.C.C. would like to thank the Engineering and Physical Sciences Research Council (UK) and the Biotechnology and Biological Sciences Research Council (UK) for support.

    Appendix A. Joint degree distribution of connected nodes

    A.1. Path integral representation of W(k,k′|x,y,z)

    Here, we calculate the joint degree distribution of connected nodes (2.12) that will be observed in large networks that are sampled, according to protocol (2.2), from typical graphs with prescribed macroscopic topological features p(k) and W(k,k′), as generated from ensemble (2.4). With the short-hands Inline Formula, k = (k1, …, kN ), and Ω = (Ω1, …, ΩN ) ∈ [ −π, π]N we may write

    Display Formula
    Display Formula
    A 1

    We next introduce the following order parameters

    Display Formula
    A 2
    and insert into equation (A 1) for each (q,Ω) the following integral:
    Display Formula
    A 3

    This converts equation (A 1) into the following path integral, with the short-hand Inline Formula and with ZN′ a new constant that apart from containing ZN absorbs various factors N and constants that are generated when transforming sums over Ω into integrals:

    Display Formula
    A 4
    in which Inline Formula will eventually drop out of our formulae (via normalization) and
    Display Formula
    A 5
    Display Formula
    A 6
    and
    Display Formula
    A 7

    The relevant ω-integrals are of the familiar form

    Display Formula
    A 8
    (unless ℓ < 0, in which case the integral is zero). We also note that by definition we always have the normalization identity Inline Formula. So we arrive at:
    Display Formula
    A 9
    with
    Display Formula
    A 10
    and in which, via the steepest descent argument, the order parameters Inline Formula are the functions that extremize the kernel (A 5).

    A.2. Functional saddle-point equations

    Functional variation of equation (A 5) gives the following saddle-point equations for Inline Formula:

    Display Formula
    A 11
    and
    Display Formula
    A 12

    Equivalently:

    Display Formula
    A 13
    with
    Display Formula
    A 14
    and
    Display Formula
    A 15

    The integrals over Ω in equations (A 14) and (A 15) are again of the type (A 8), from which we derive Inline Formula and Inline Formula. This then converts equations (A 14) and (A 15) into

    Display Formula
    A 16

    Since we know the marginal of the distribution W(k,k′) to be Inline Formula (which follows directly from its definition), we can immediately read off the solution of equation (A 16):

    Display Formula
    A 17

    Insertion into equation (A 13) and using equation (A 8) gives the solution of equations (A 11) and (A 12) in explicit form:

    Display Formula
    A 18

    A.3. Final result for the distribution W(k,k′|x,y,z)

    We can now evaluate the various ingredients of equation (A 9). The function Inline Formula becomes

    Display Formula
    A 19

    Hence

    Display Formula
    A 20
    in which
    Display Formula
    A 21
    and
    Display Formula
    A 22

    Summation over k reveals that Inline Formula for all q > 0, which leads to the final result:

    Display Formula
    A 23
    with Inline Formula as given in equation (2.10). The marginals of W(k,k′|x,y,z) are obtained trivially by summing equation (A 23) over k′, giving
    Display Formula
    A 24

    A.4. Explicit expression for the factors 𝒥(k|q)

    To carry out the integral in equations (A 21) and (A 22), we first write Q(q,Ω) as Q(q,Ω) = a(q)+b(q)qe−iΩ, with

    Display Formula
    A 25

    We note that, owing to Inline Formula, we can be sure that a(q) ∈ [0,1] and b(q) ∈ [0,1]. Substitution into equations (A 21) and (A 22) and integration over Ω, for q > 0 and k > 0, then leads to

    Display Formula
    A 26
    and, similarly,
    Display Formula
    A 27

    Clearly, Inline Formula for all (k,q). Since the factors (A 21) and (A 22) also satisfy the normalization Inline Formula for all q > 0, they can be interpreted as conditional probabilities, as suggested by our chosen notation.

    A.5. Tests

    To test our expression (A 23) we set x(k) = x, y(k) = y and z(k,k′) = z, and try to recover from equation (A 24) via identity (3.1), our earlier results on the degree distribution for unbiased sampling. We now find a(q) = xz, b(q) = xy and Inline Formula, which implies that

    Display Formula
    A 28
    and
    Display Formula
    A 29

    Let us inspect the following cases:

    — Perfect sampling, i.e. x = y = 1 and z = 0.

    Now there should be no difference between the kernel W(k,k′) and the observed kernel W(k,k′|1,1,0) of the sample. Here, we see that equation (A 19) simplifies to Q(q,Ω) = qe−i Ω; hence a(q) = 0 and b(q) = 1 leading to Inline Formula and therefore to the correct identity W(k,k′|x,y,z) = W(k,k′).

    — Unbiased node and/or link undersampling, i.e. xy < 1 and z = 0.

    Now we have Inline Formula and

    Display Formula
    A 30
    which gives
    Display Formula
    A 31
    and therefore we recover the correct expression
    Display Formula
    A 32

     Unbiased bond oversampling, i.e. x = y = 1 and z > 0.

    Now Inline Formula and Inline Formula, which results in

    Display Formula
    A 33
    which is indeed the correct result identified earlier.

    Footnotes

    1 One should expect that macroscopic physical observables such as p(k|c) and W(k,k′|c) are self-averaging, and can therefore be calculated, to leading order in N, in terms of their expectation values (2.5), (2.6) and (2.7) over the ensemble (2.4).

    2 The software used in this paper for generating and sampling from networks is available from the authors upon request (in standard C).

    3 The Hamming distance between two graphs c and c′ of size N and average degree is defined as ρ(c,c′) = (2Nk̄)−1ij|cijcij′| and takes values between 0 (cij = cij′ ∀i,j) and 1 (cijcij′ ∀i,j).

    One contribution of 9 to a Theme Issue ‘Inference in complex systems’.