Multi-model and network inference based on ensemble estimates: avoiding the madness of crowds

Recent progress in theoretical systems biology, applied mathematics and computational statistics allows us to compare the performance of different candidate models at describing a particular biological system quantitatively. Model selection has been applied with great success to problems where a small number—typically less than 10—of models are compared, but recent studies have started to consider thousands and even millions of candidate models. Often, however, we are left with sets of models that are compatible with the data, and then we can use ensembles of models to make predictions. These ensembles can have very desirable characteristics, but as I show here are not guaranteed to improve on individual estimators or predictors. I will show in the cases of model selection and network inference when we can trust ensembles, and when we should be cautious. The analyses suggest that the careful construction of an ensemble—choosing good predictors—is of paramount importance, more than had perhaps been realized before: merely adding different methods does not suffice. The success of ensemble network inference methods is also shown to rest on their ability to suppress false-positive results. A Jupyter notebook which allows carrying out an assessment of ensemble estimators is provided.


Introduction
In physics, simple and elegant symmetry relationships have often led the way to theoretical models [1]. Most importantly Emmy Noether's theorem has been pivotal in establishing the correspondence between continuous symmetries and conservation laws in physics [2]; it has been instrumental in the derivation of physical laws of nature. Biology has not been able to draw on such fundamental principles [3], to a large degree because most processes are intrinsically dissipative (meaning energy is produced and consumed) and hence the conditions where Noether's theorem holds simply do not apply. Instead, biological models have often had a heuristic element or described the rates of change in (often macroscopic) observables (e.g. species of plants or animals, or molecular species).
Writing down the set of equations is an important starting point in modelling as it forces us to express our assumptions in a precise form. Which form to choose is, however, not unambiguously obvious. Instead, we often rely on data to decide between the different options. Statistical model selection [4] provides the tools to make such decisions, balancing the ability of a model to fit the data with the model's complexity. As larger and larger models, even models for whole cells [5][6][7][8], are being considered model selection problems will presumably become the norm, especially when models are constructed exhaustively or automatically [9][10][11][12][13][14][15][16].
In some, probably many, situations model selection will not be able to decide on a single best model. Instead, many models may have comparable support. In such a situation, we may then base analysis or predictions on the ensemble of models that are supported by the data [17]. Each model's contribution to the prediction etc. is weighted by the relative support it has received. Such estimates or predictions based on ensembles have been referred to as exploiting the 'wisdom of crowds' [18]. This refers to the notion that groups of individuals/models are more likely to be better than those based on a single individual/model. This concept, however, also relates to much earlier work, Charles Mackay's 'Extraordinary Popular Delusions and the Madness of Crowds' [19], a 19th century account of how popular opinion can support quite extraordinary and plainly wrong opinions and concepts.
Ensemble estimators have a surprisingly long history, outlined in [20], and aspects such as bagging, boosting and stacking [21] are firmly established in the statistical learning literature; see, for example, [22] for a comprehensive treatment. There has been interest in evolutionary biology [23]; and following [18], there have been further developments in systems and computational biology, e.g. [24,25]. But in the context of network inference, combining different network reconstruction methods has sometimes been viewed as necessarily optimal [26]. Below, we show that this is not automatically the case. In turn, I will show that model averaging and ensemble estimation are susceptible to poorly defined sets of candidate models; and that the behaviour of ensemble approaches to network reconstruction depends strongly on the composition of the ensemble. For very good ensembles, the advantage comes mainly from reducing the number of false-positive edges. Both problems share a dependence on the quality of the ensemble, and we map out and quantify these influences; we also provide selfcontained Julia code for further in silico experimentation and analysis of ensemble prediction methods.

Model selection and multi-model inference
We assume that we have a universe of models, M, that are potential mechanisms, by which some data, D, have been generated. For simplicity, we consider a finite number of models, N. Furthermore, for each model, M i , we assume that we have a data generating function, f i (θ i ), parametrized by a parameter vector θ i which is chosen from some suitable continuous parameter space, Coping with the size of the parameter space is one of the essential challenges of parameter estimation and model selection.
We start from a Bayesian framework [27], where we seek to determine the posterior distribution over parameters, where Pr(D j u i ) is the likelihood, π i (θ i ) the prior over the parameters for model M i , and Pr i (D) ¼ Ð Pr(D j u)p i (u) du (here, we make the dependence on the choice of model explicit through an index) is known as the evidence.
In the Bayesian framework model selection is a (relatively) straightforward extension, and the model posterior is given by where analogously to equation (2.1), we have the model posterior, Pr(M i j D), and model prior, π(M i ). The denominator terms in equations (2.1) and (2.2) are notoriously hard to evaluate for all but the simplest cases, and a large amount of ingenuity and work has been invested into computational schemes [27,28], including Markov chain Monte Carlo, sequential Monte Carlo and related approaches. Often even the likelihood is prohibitively expensive to evaluate and so-called approximate Bayesian computation schemes have been devised to make Bayesian statistical inference possible [29]. Alternatives to the Bayesian framework, such as likelihoodbased inference and optimization of cost functions [30], result in point estimates for the parameters, e.g. the value of θ 0 that maximizes the probability of observing the data, Similarly, we can determine the maximum a posteriori estimate by finding the mode of the posterior distribution [27], Compared to analysis of the posterior distribution, such local estimates lose a lot of relevant information, but some characteristics can still be recovered by considering the local curvature of the likelihood, i.e. the information matrix, or costfunction surface around the (local) extremum identified in this manner [31,32]. Model selection frameworks that are based on likelihood inference rely on criteria to find the optimal model among a set of candidate models. The Akaike information criterion (AIC) [4,33] for model M i is given by withû given by equation (2.3), and n i the number of parameters of model M i . The AIC is probably the most widely used model selection criterion, despite the fact that it is biased in favour of overly complicated models as the amount of available data increases. The Bayesian information criterion does not suffer in the same way; it is defined as with ν representing the size of the data or number of samples. Several other information criteria exist (discussed e.g. in [4,34]), but they all share in common the purpose of balancing model complexity with model fit. The BIC can be derived as an approximation to the full Bayesian model posterior probability, which achieves this balance implicitly. If model selection cannot pick out a clear winner, then either (i) further analysis should be used to design better, more informative experiments that can discriminate among these models [35][36][37]; or (ii) these models should be considered as an ensemble [4,25]. The former has definite attractions as it will lead to an increase in our understanding if we can discard some of the models/mechanistic hypotheses.
The latter approach, basing analysis and especially predictions on an ensemble of models has become a popular concept royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20200419 in machine learning. Most notably, in the context of biological network inference the 'wisdom of crowds' concept [18] has been important in popularizing inference based on several models. Here, we are considering model averaging where contributions from different models may be weighted by their relative explanatory performance. In the Bayesian framework, we can use the posterior probability directly. In the context of an information criterion I i for model i, we define [4] and then determine the model weight as : (2:8) The model weights (e.g. the Akaike weight if I i is the AIC) provide the relative probability of model M i to be the correct model conditional on the data D and the set of models, M, considered. Model averaging in this framework can serve as a variance reduction technique [4,21].

Statistical physics of model selection and ensemble estimation
In order to simplify the discussion, we define a relationship between the model probability (always implicitly understood to be either a posterior or relative model probability), p i , and a cost or energy, ϵ i , [8] as with the normalization constant Z (the partition function) given by With this in hand, we can straightforwardly consider different model selection/averaging frameworks in a similar manner. In general, the true data-generating model (a natural system) will not be represented perfectly by any of the models in M; we denote this true model by @ Ó M. But if we are interested in finding out whether @ has a certain characteristic ϕ we would obtain this from the appropriate ensemble average where 1(x) is the conventional indicator function, i.e.
2) is based on three, potentially strong assumptions. 1. The model universe, M, is 'complete' in the sense that we expect M to contain any model, M k , which we might expect to be a reasonable description of @ (always remembering that @ Ó M).

It is decidable if
3. ϕ has played no part in the construction of the model probabilities, p i , equation (3.1).
The first assumption is arguably the strongest assumption. In principle, we can update M À ! M 0 by adding additional models in light of data; specifying new priors π 0 for the models M 0 i [ M 0 will require some care. One important condition that M (and M 0 ) must fulfil is that models must not appear more than once in the set. This is important to keep in mind as we are increasingly relying on automated generation or exhaustive enumeration of models. With fixed D and M equation (3.2) is, however, a good approach for ensemble-based prediction and estimation. It also encompasses Bayesian model averaging and multi-model inference based on information criteria [4].

The effective model space
We first analyse a simple case where all models in our universe have associated costs drawn from a suitably well behaved probability distribution, q(η) [8], with positive support and associated density, f q (ϵ), over the model energies, ϵ i , such that e i q(h): Because ϵ i is a random variable, the relative weight, ω i = exp (−ϵ i ), will also be a random variable, and we can obtain the probability density function, f(ω), via change of variables as For different choices of q(ϵ), we can now investigate the distribution over model weights. For example, if ϵ i ∼ Gamma(α, θ) (where α and θ denote the shape and scale parameters, respectively), then with Γ( · ) the Gamma function. The Gamma distribution is a flexible and generic distribution and is chosen for its generality rather than any particular property and our discussion here does not depend on its specifics. Some representative distributions over ϵ and the corresponding distributions over ω are shown in figure 1a,b. The distribution over model costs, ϵ, affects the distributions over model weights, ω. This is important to realize when deciding on how to triage model sets for further analysis and prediction [38]: generally, inference based on all models weighted by ω i is neither practical nor desirable, as many models with low weight will mask the information available in the good models. If, for example, we only include models with ω ∈ [0.9, 1.0] then the average model cost for these sets will be 0.005 (blue), 0.0015 (green), 6 × 10 −4 ( purple) and 9 × 10 −5 (orange), see figure 1c. The (unknown) distribution over costs can affect multi-model inference quite profoundly. But for model universes that are enriched for good models (i.e. many models M i with low values of ϵ i ) selecting a subset of models based on even a fairly conservative threshold for the model weights ω i can result in a sufficiently large model sets for further prediction.

A simple test case for multi-model inference
Here, we study a very simplistic scenario in which we have three types of models, borrowing and adapting Box's [39] terminology: Useful Models which capture important aspects of @ and which have an associated cost ϵ 1 .
royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20200419 Useless Models which are qualitatively and quantitative poor descriptions of reality and have an associated cost ϵ 2 ≫ ϵ 1 . Nuisance Models which are qualitatively different from reality, but which can quantitatively capture aspects of @ by chance; their costs are random variables, η.
Nuisance models are here assumed to be models where the quantitative agreement with data is unrelated to the data generating mechanism @. Purely machine-learning-based models are one way in which we can realize such nuisance models [40]; for small datasets D, poorly designed experiments, or simply lack of prior knowledge, there are many ways in which model fit may only be a poor reflection of reality, and this can also give rise to nuisance models [41]. For concreteness let these three different model classes have sizes ν 1 , ν 2 and ν 3 = N − ν 1 − ν 2 , and assume that h U [0,e2] , i.e. nuisance models are at worst as bad as useless models. Then the number of nuisance models that have lower associated costs than the useful models is given by The relative influence of nuisance models can be studied by contrasting three features, ϕ 1 , ϕ 2 , and ϕ 3 , with the following properties: ϕ 1 : equally represented with frequency ξ among models of all classes. ϕ 2 : only represented among useful models. ϕ 3 : only represented among 'nuisance' models.
To illustrate this further we can consider the case where ξ 2 = 1 and ξ 3 = 1 (meaning every useful model exhibits characteristic ϕ 2 , and every nuisance model characteristic ϕ 3 ) and ask when is Pr (f 3 [ @) . Pr(f 2 [ @)? We obtain n 3 . e 2 n 1 e Àe1 1 À e Àe2 . e 2 n 1 e Àe1 : (3:10) Thus, if useful models are sufficiently rare in the model set (say ν 1 < 0.1) the nuisance models' characteristics will have high weight in the ensemble average; see also figure 2. None of the parameters ν 1 , ν 2 , ν 3 , ϵ 1 , ϵ 2 are, of course, known, and we cannot know which class a model belongs to a priori. Thus model averaging is not a panacea and requires careful consideration of which models are included in the prediction set.

Ensemble estimation and network inference
Network inference can also benefit from ensemble methods [18,24,42] but here, too, potential pitfalls arise. We are considering directed networks, with V nodes, and L edges; the adjacency matrix, A, is a convenient way to specify such networks, if we indicate the presences and absences of edges, by one and zero, respectively, in its entries, A ij ¼ 1 if there is an edge from i to j, 0 otherwise: In network inference, we seek to determine whether the data D suggest statistical relationships among pairs of nodes v i , v j that may indicate a functional relationship, represented by an edge connecting the nodes. We consider k different algorithms, O k k ¼ 1, . . . , k, which predict the presence A (k) ij ¼ 1 or absence A (k) ij ¼ 0 of a link from i to j. If we have for the false positive and false-negative probabilities for method κ, we can assess how beneficial the ensemble estimators are for the quality of the inferred networks.

Ensembles of identical estimators
The simplest case, which is already instructive as a baseline, is where all methods have identical performance, s k ¼ s and t k ¼ t, 8k ¼ 1, . . . , k. If the performance of the inference methods is statistically independent, then the number of agreeing inference methods is a binomial random variable. If we set a threshold on the minimum number of concordances among inference methods, k 0 , we observe for the overall probability of scoring a true edge from the ensemble, while the probability of a false negative is To illustrate the outcome of such a simple estimation procedure, we assume a network loosely based on expected Homo sapiens network sizes [43] (22 000 nodes and 750 000 interactions).
In figure 3,we consider 10 ensembles and base the ensemble estimator on the minimum number, k 0 , of methods that have to predict an edge. If k 0 is too small, then there will be too many positives as is the cases here; a majority-vote rule, i.e. here k 0 = 6, would do an acceptable job in terms of precision, recall and F1 statistics [21], but this does depend on the false positive ratio in particular (as biological networks are sparse), as well as the size of the ensemble of inference algorithms,

Ensemble estimators can be worse than individual estimators
We are interested in ensemble estimators because we know that individual estimators are far from perfect. But ensembles are not guaranteed to always improve on individual estimators. We compare an ensemble of equally well performing estimators with a single estimator. The ensemble false negative probability, T, is given by where 2 F 1 is the hypergeometric function [47] (see appendix B for an approximation for small arguments). From this, we can determine when the ensemble false negative rate, T, will be greater than t, i.e. Equally, we obtain for the ensemble false positive rate, S, : For low thresholds, k 0 , the ensemble error rate can be greater than that of the individual estimator; this is because the stringency of the ensemble prediction is then reduced, as the cumulative probability of a sufficiently small number of estimators to 'accidentally' agree is greater than the error rate of the individual estimator. We show this for two false negative rates, t = 0.1 and t = 0.05, in figure 4a,b.  [44,45]; once false-positives are over-controlled (by demanding a larger number of methods to score an edge), the recall deteriorates. This is reflected in the number of (a) true and (b) false positives. We generated 1000 random inferred networks but because of the large network size and the binomial nature of the process the distributions around the mean are tight. (c) The precision, recall and F1 statistics [21,46] (see also appendix A) as a function of the minimum number of methods that need to positively score an edge (again confidence intervals are very tight to be effectively hidden by the symbols).  royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20200419

Heterogeneous ensembles of network inference methods
We next focus on the case of a small set of predictors, k = 10, and two classes of methods: a set of good methods with error rates t 1 and s 1 ; and a set of poor methods with error rates t 2 > t 1 and s 2 > s 1 . In figure 5, we show, again for a case modelled on a likely human gene regulation network, the likely true-and false-positive predictions arising from ensembles with different numbers of good versus poor edge prediction methods. The basic lesson is that good methods have to outnumber bad methods; otherwise, especially the precision will suffer. Here, we have chosen a simple majority-vote criterion. To bring precision and F1 statistic up to a satisfying level (say in excess of 0.7) requires essentially purging the ensemble of the weakly performing methods (i.e. k 1 * 8). This only points to the extent to which poor methods can compromise the performance of royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20200419 ensemble estimators (and the accompanying Jupyter notebook can be used to explore this further). In sparse networks especially, poor estimators will result in inflation of false-positive results, and lead to overall poor performance: in a directed network there are N × (N − 1) possible interactions, but this is still vastly greater than the number of existing edges. For example, in the case of the human network we would have of the order of 4.8 × 10 8 potential interactions of which only some 750 000 are expected to exist. So even for s 2 = 0.5 there would be about 470 000 cases where ten such methods would agree and score an edge, and 1.8 × 10 8 false predictions, if a simple majority vote rule were applied.

Discussion
We have shown that ensemble estimators are not as robust as has sometimes been claimed [26], or incorrectly surmised from the success of community average results in the DREAM competition [18]; there, of course, it had already been shown that certain, carefully selected, subsets of estimators give better results than others [24].
For the analysis of multi-model inference from mechanistic models, we can distill two points: (i) ensembles of mechanistic models that are reasonably defined [48] (i.e. their construction incorporates any available mechanistic insights; duplicate models are avoided; the model is predictive and can be used to generate data that can be compared with real observations/data) can be combined with the aid of model selection criteria or Bayesian posterior model probabilities with relative ease and safety; (ii) the inclusion of 'nuisance models' can hamper ensemble approaches if they come to predominate the model universe M. Such situations could become more likely as model spaces are explored exhaustively [16] or automatically [12]. Because of the formalism connecting different model selection criteria, equation (3.1), these are general results, and do not depend on the particular model averaging procedure chosen (as also clear from the analysis in [20]). So, in essence, the construction of the models in M [3,49] will determine the robustness of model averaging or ensemble approaches to prediction and analysis. Little is to be gained by increasing the size of M beyond the (already quite large) set of reasonable models.
In the context of network inference, the situation is similar. We find that the poor performance of some methods can drag down the performance of an ensemble estimator or network predictor. So like in the construction of the model universe M before, the make-up of the ensemble of network inference methods, O k k ¼ 1, . . . , k, does matter considerably (as was also found in the empirical study of [18]). Majority vote will typically be a sound criterion for a set of reasonable estimators, though not necessarily optimal from the perspective of precision and F1 measures. This is because biological networks are sparse and false-positives will predominate the inferred networks unless they are carefully controlled. For a set of statistically similar powerful inference methods, a conservative criterion for scoring edges will improve on the overall performance of the individual estimators, however.
The problem of network inference has long been known to be challenging. One reason for this (in addition to the largescale testing problem) is that we do not have a fair way to score and compare the performance of different network inference approaches. The most promising existing approaches are typically computationally expensive and rigorous in silico assessment of performance as well as the factors influencing performance is often seen as computationally prohibitively expensive. There is also a danger of biasing simulated data in favour of a given method and the DREAM competition has aimed-with some success we feel-to avoid this, and other approaches have followed suite. Clearly, more effort in this domain is needed and computational cost should not preclude rigorous assessment [50]. This situation is mirrored in other areas of genomics and systems biology, e.g. pseudo-temporal ordering approaches have until recently [51] rarely been rigorously tested. But what is also needed are approaches which allow us to assign confidence to inferred networks, or, more specifically, predicted interactions without recourse to a gold-standard [52]. Here, measures based on biological expectations/knowledge offer promising routes for filtering out poor methods [53] (see figure 6).
One of the potential initial attractions of using a panel of network inference algorithms is that different methods may capture different aspects of the data and in concert may provide a more complete representation of a 'true' network of functional relationships among the genes in an organism under a given scenario. While appealing this notion needs to be viewed with caution [56]. Combining the most powerful methods by leveraging their individual strengths is possible in principle [20,22,54,55], but requires us to characterize each method O i reliably and independently.

Conclusion
In summary, unless we know the constituency of the model universe, M, or the ensemble of predictors, O k , we have limited ways of telling whether we are dealing with a madness of crowds or a wisdom of crowds scenario. However, the present analysis shows that ensemble procedures will be robust as long as the ensembles are carefully constructed. In the context prediction spaces for three network inference methods In an ideal scenario, we would also be able to exploit the individual strengths of the predictors to reduce the ensemble false negative rate; the mathematical formalisms exist [20,22,54,55], but we need to be able to quantify the individual predictors' behaviours better.
royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20200419 of biological network inference, reduction in false-positives is the primary cause of their success. Without a robust and transferable way of assessing the strengths and weaknesses of different methods, we cannot (yet) use tools from decision theory that pool these strengths for diverse and heterogeneous ensembles [54,55]. Currently, the best advice, in light of the analysis carried out here, is to be ruthless in weeding out poorly performing methods for network inference, or models with low weight for multi-model averaging. So there is no need, for example, to include correlation networks, even though they are cheap to calculate: their performance is simply too poor to warrant inclusion in an ensemble. Quality is more important than quantity.
Data accessibility. A Jupyter Notebook containing the Julia code to reproduce all the computational results here, and to explore the effects of the madness of crowds in network inference and model averaging is available at https://github.com/MichaelPHStumpf/Madness-Of-Crowds.
Competing interests. Appendix A. Assessing the performance of network inference methods Real biological networks are sparse [57]; this means that a predictor which scores each candidate edge to be absent can have high performance if the number of false-positives is heavily influencing how we quantity performance of a network inference method. We therefore focus on precision and recall and a derived statistic, the F1 statistic [21,57]. We denote the numbers of true and false positive inferences (i.e. scored edges in the context of network inference) by TP and FP, and the true and false negatives by TN and FN.
Then the precision, P, is given by P ¼ TP TP þ FP and the recall by The F1 statistic is given by These statistics are confined to the [0, 1] range, with larger values indicating better performance, and the F1 statistic in particular becomes maximal at the point where the curves for P and R intersect.