Modelling science trustworthiness under publish or perish pressure

Scientific publication is immensely important to the scientific endeavour. There is, however, concern that rewarding scientists chiefly on publication creates a perverse incentive, allowing careless and fraudulent conduct to thrive, compounded by the predisposition of top-tier journals towards novel, positive findings rather than investigations confirming null hypothesis. This potentially compounds a reproducibility crisis in several fields, and risks undermining science and public trust in scientific findings. To date, there has been comparatively little modelling on factors that influence science trustworthiness, despite the importance of quantifying the problem. We present a simple phenomenological model with cohorts of diligent, careless and unethical scientists, with funding allocated by published outputs. This analysis suggests that trustworthiness of published science in a given field is influenced by false positive rate, and pressures for positive results. We find decreasing available funding has negative consequences for resulting trustworthiness, and examine strategies to combat propagation of irreproducible science.


Introduction
In academia, the phrase 'publish or perish' is more than a pithy witticism-it reflects the reality that researchers are under immense pressure to continuously produce outputs, with career advancement dependent upon them [1,2]. Academic publications are deemed a proxy for scientific productivity and ability, and with an increasing number of scientists competing for funding, the previous decades have seen an

Model outline 2.1. Basic model and assumptions
To construct a simple model of publication rewards, we define the total amount of available funding for research as R(t). Per unit of funding in a given field, there is a global discovery rate of D R , which includes a proportion p T of positive/significant results, a proportion p F of false positives, and a proportion n of null results. Null results in principle can include both true negatives and false negatives, but given the bias towards positive results we will not discriminate between these two in this investigation. The relative proportion of positives and nulls will be inherently field-specific-certain disciplines will be more prone to false positives, while others tend to yield less ambiguous results. As the quantities are proportions, we have that p T + p F + n = 1. (2.1) In certain fields, the false positive rate may be high, and so diligent researchers take measures to falsify positive results and test their results multiple times. Even when research groups are very diligent, they may occasionally happen upon a false or misleading result which is hard to eliminate and due to experimental or theoretical difficulty rather than carelessness. For the diligent cohort, this will be as low as can reasonably be achieved and so we state they submit a small fraction, , of their false positives for publication. Researchers exist on a spectrum, but for simplicity we may broadly sub-divide this spectrum into three distinct classes.
(i) Diligent cohort. This group take pains to replicate experiments and do not dishonestly manipulate results. Their false positive submission fraction is , thus as low as reasonably possible. They account for a fraction f D of the initial total, and parameters relating to them have subscript D. (ii) Careless cohort. This group do not falsify results, but are much less careful at eliminating spurious positive results. They may also have questionable practices that lead them to false positives. As a result, they have a false positive submission rate of c where c > 1. They account for a fraction f C of the initial total, and parameters relating to them have subscript C. (iii) Unethical cohort. This group appear broadly similar to the diligent group, but with one crucial difference in that they may occasionally manipulate data or knowingly submit dubious results at a rate of δ beyond global discovery rate. For convenience, instead of defining a higher value of D R in this group to account for the higher 'discovery' rate, we retain the same parameter value of D R for the unethical cohort but allow p T + p F + n + δ > 1, so that their realized 'discovery' rate is higher than the other groups. They account for a fraction f U of the initial total, and parameters relating to them have subscript U.
The funding held by the diligent cohort at a given time is x(t), with y(t) held by the careless cohort and z(t) by the unethical cohort, so that With these assumptions, we can model the theoretical impact of a paradigm where researchers are rewarded with funding and success in direct relation to their publication output. As outlined in the Introduction, there is huge pressure on scientists to submit positive or 'novel' findings, while findings confirming the null hypothesis are frequently side-lined. Under such a selection pressure, all researchers will aim to submit their significant positive results for publication. The respective rates of submission per unit funding for the diligent, careless and unethical cohorts are accordingly The rate at which null results are submitted is less clear-in general, there is a significant bias in publication towards significant results. results may not ever see the light of publication, the so-called 'file drawer' problem. We assume that each cohort submit only a fraction of their null results in the proportions β D , β C , β U such that ⎛ ⎜ ⎝ Equations (2.1)-(2.4) comprise the researcher-specific parameters, and we must further quantify the journal-specific elements also. Competition for space in field-specific top-tier journals is fierce, and we denote the combined carrying-capacity of these field-specific top-tier journals as J(t). These journals exhibit a clear bias towards positive results, with a positive-publication weighing fraction of published articles, B, describing significant results. Thus, presuming that more submissions are obtained than can be published, we can quantify the probability that a positive result (ν P (t)) or a negative result (ν N (t)) is published. These probabilities are given by From this, we can then yield an expression for the publication rate per unit of funding for the diligent, careless and unethical cohorts, which are, respectively, ⎛ ⎜ ⎝ The average rate of publications per unit of funding per unit time in top-tier journals for a given field is thus If researchers are rewarded with funding based solely on their published output, we can quantify the impact of this with time by employing a recursive series solution at discrete time steps, corresponding to funding cycles. If funding is allocated to each cohort based upon their output at the beginning of the previous funding cycling, and we assume total funding remains constant (dR/dt = 0) then the funding available for each cohort at each successive time step is (2.8)

Variable funding resources
We also consider the fact that the total amount of funding may not remain constant, so we may model the impact of changing funding scenarios. For simplicity, we assume it changes at some constant rate G, which can be negative (for diminishing funding, the likes of which might occur with a decrease in NIH or EU funding budgets), zero (for constant funding, as in equation (2.8)) or positive (increasing funding). New funding is allocated at random in proportions reflecting the typical make-up of new researchers, and accordingly the refined equations are ⎛ ⎜ ⎝ x(t + 1)

Research fraud detection
For unethical researchers, we can look at a slightly more complicated scenario where dubious publications have a probability of detection leading to denial of funding, η. We further assume this penalization only applies to dubious results which were published rather than just submitted. If this consideration is taken into account, then we modify the last part of equation (2.9) to reflect this so that (2.10)

Rewarding diligence
The diligent cohort have intrinsically lower submission rates than other groups, and consequently are more likely to suffer under a publish or perish regime, despite the importance of their reproducible work.
To counteract this, it has been suggested that rewarding diligence might counteract this trend [31,32]. We might envision an ideal situation where scientific works are audited for reproducibility by independent bodies, with groups who keep their reproducibility high and error rates below a certain unavoidable threshold (given by D R ν P ) garnering a reward of R W . This in practice could only be achieved by the diligent cohort, and in the most simple case, their funding resources are given by

Counteracting publication bias
It is also possible to envision a situation where journals do not give any preference to positive results over null results. In this case, we would expect researchers to submit all their results so that β D = β C = β U = 1. In this case, ν P and ν N are replaced by a single function of time ν, given by . (2.12)

Trustworthiness of published science
Finally, we define a metric for the trustworthiness of published science, defined as the proportion of reproducible results, T(t). This is given by where the time arguments of x,y,z and ν p have been excluded for clarity.

Parameter estimation and assumptions
To simulate the trends that would occur under these assumptions requires that we select appropriate parameters (these are detailed in table 1), which are used in all simulations unless otherwise stated in the text. It can be seen through inspection that discovery rate per unit resource D R cancels in the analysis for x(t), y(t), z(t) and T(t), and accordingly this can be ascribed any real positive value without skewing analysis. When there is no fraud detection funding penalization (η = 0), journal carrying capacity J also cancels in the analysis and does not impact results. Initially we assume also that G = 0 so that funding levels remain constant. Estimation of fraudulent submission fraction per unit resource δ requires some elaboration, as this is notoriously difficult to ascertain and field-specific. A 1996 analysis by Fuchs & Westervelt [33] extrapolated from known cases to estimate that approximately 0.01% of published papers were fraudulent, though this is considered exceptionally conservative [27]. Empirical estimates of plagiarism vary markedly from 0.02 to 25% of all publications [26]. The frequency of paper retractions from the PubMed database for misconduct is about 0.02%, suggesting that fraud might be present in 0.02-0.2% of papers therein [34]. An investigation in the Journal of Cell Biology found inappropriate imagemanipulation occurring in 1% of papers [35]. More alarmingly perhaps, a 1992 data audit by the United States Food and Drug Administration found deficiencies a in 10-20% of studies published between 1977 and 1990, with 2% of investigators deemed guilty of severe scientific misconduct [23,36,37].
This is dependent on the true/false positive of the field, and we initially take an optimistic assumption that the 1% of published fraud occurs in fields with high levels of false positives, and will be less in fields with less ambiguity in results, so that the same value of δ is used for all simulations. This is calculated assuming p F = 0.32 and p T = 0.08 so that δ = 0.057 as per table 1. The reasonable error rate is taken from pvalue for significance, as discussed in Colquhoun [9]. Strictly speaking, Prof. Colquhoun puts forward an eloquent argument in the cited work that p < 0.05 is a frequently abused metric, leading to false positives. For simplicity, however, we will presume that = 0.05 reflects best reasonable practice in this simulation.

Results
3.1. Impact of the field-specific false positive rate Conversely, reducing funding increases the publication pressure and results in increased selection of suspect works and a fall in scientific reproducibility. The implications of this require some elaboration and are considered in the Discussion section. Figure 3 depicts the impact of aggressive fraud detection (η) and punishment. Increased fraud detection seems to improve science trustworthiness, but η has to be very high in practice to have a substantial impact on the proportion of funding allocated to unethical cohorts. Negating growth, the funding allocation to this group would only be expected to decrease with time if

Impact of rewarding diligence
By inspection, it is straightforward to show that for the amount of funding held by the diligent cohort to stay the same or increase, then the condition on R W is though in practice for most situations, R W will have to be much greater than this minimum value. For the example depicted in figure 4, a large reward for diligence (R W = 10) substantially increases the funds awarded to the diligent cohort. However, reproducibility still falls slowly if the unethical cohort is not removed. It is possible to both reward diligence and punish fraud, which can improve trustworthiness, as illustrated in figure 4. 3.6. Impact of initial unethical funding proportion

Discussion
The model presented is a simplification of a complex ecosystem, but gives some insight into what factors shape scientific trustworthiness. The model suggests that a fixation in top-tier journals on significant or positive findings tends to drive trustworthiness of published science down, and is more likely to select for false positives and fraudulent results. In our simulations, best outcome was obtained by simply paying no heed to whether a result was significant or not. This is akin to the model used by many emerging open access peer-reviewed journals such as PLoS ONE, who have a policy of accepting any work provided it is scientifically rigorous. Our simulation suggests this model of publishing should improve science trustworthiness, and it is encouraging that many other publishers are taking this approach too, including Royal Society Open Science and Nature Scientific Reports. As of 2017, Scientific Reports has surpassed PLoS ONE as the world's biggest mega-journal [38]. However, there is an important point to consider in the form of the parameter J (the publication carrying-capacity). This can be highly field-specific, comprising the top-tier journals in that specific field. In general, these publications are focused on prestige rather than rapid dissemination of science and it is unlikely these journals would move to replicate the approach of rapid open-access publishers. Accordingly, the suggestion that top-tier journals might aspire to treat all studies, regardless of their results, as equally worthy of publication is likely to be an unworkable ideal.
Indeed, there is still a perception that such journals are for 'trivial' or unimportant results, and that positive or important results should still go to a few journals with extreme competition for space. Empirical evaluations show that small studies published in top-impact journals have markedly exaggerated results on average compared with similar studies on the same questions published in journals of lesser impact factor [39]. This suggests that the pressure to publish in these flagship journals may still be very real, despite the option of publishing in less competitive journals. The analysis here suggests that science trustworthiness is affected too by changes in funding resources, and that when an increase of funding improves the overall trustworthiness of science, as depicted in figure 2. Conversely when this is diminished, the increased competition on scientists appears to create conditions when false positives and dubious results are more likely to be selected for and rewarded. This is a natural consequence of the model, but requires careful interpretation. Crucially, it is important to note that there is no mechanism in the model for unethical or careless researchers to transition into diligent scientists. Rather, decreasing funding increases competition, and amplifies the career advantages of questionable findings. Conversely, if global funding rates are increased, then competition for resources decreases and the advantage of suspect findings is somewhat mitigated. While beyond the scope of this work, such a prediction could be empirically tested by analysing situations when research budgets change markedly, such as the doubling of the NIH budget from 1998 to 2003.
The model presented pivots on the assumption of a scenario that publication is the dominant metric upon which scientists are rewarded, and elucidates the potential consequences of such a situation. It is important to note this is a substantial simplification, and there are other metrics by which scientists are assessed, including other measures of impact, awards and citations. However, the number of publications attributed to a scientist has a marked effect on their career success, with more publications associated with principal investigator status, and acquisition of funding [40]. The average number of authors per paper is increasing over time, and this is not just due to more interdisciplinary work, but also due to a greater demand for having more papers in one's CV [41]. The model also implicitly assumes that output is an approximately linear function of funding in a given field. The exact applicability of this assumption may vary across fields. For example, wet-lab sciences require a certain threshold of continuous funding just to operate, whereas computational or theoretical sciences may be able to operate with comparatively little funding. Presuming direct comparison of researchers and their teams across a given field, however, the assumption of direct correspondence between resources and outputs is reasonable, although outliers are to be expected.
One curious result persistently seen in the model was that diligent researchers are unfairly affected by careless or unethical conduct, with avoidable false positives or unethical publications garnering disproportionate reward at their expense. Simply increasing fraud detection does not do much to stop this, as careless researchers benefit from the gap in the market, out-producing their diligent colleagues, as shown in figure 3. This appears to be an unfortunate and seemingly unavoidable consequence of a 'publish or perish' system. However, in good scientific environments carelessness would be sooner or later detected and potentially penalized. We can estimate how much of a penalty for carelessness or reward for diligence we need so as to inverse the worsening trends that we observe, by manipulating equations similar to the manner outlined for unethical conduct. However, this approach risks being ruthlessly punitive, punishing honest mistakes with the same severity reserved for the most egregious abuses of scientific trust.
While a penalty for carelessness has intuitive appeal, distinguishing between honest and careless errors is fraught with difficulty. As has been argued elsewhere [31,32], rewarding diligence is perhaps a better way to ensure researchers do not suffer for good conduct. A simple model of this is shown in figure 4, and indeed this suggests rewarding diligence improves the proportion of funding allocated to diligent groups. However, it requires some penalty for bad conduct to keep unethical cohorts from benefiting at the expense of others. In practice, this level of detection appears to have to be relatively high, which of course would require considerable resources to achieve. It should be noted too that the false positive rate of a field has a significant impact on science trustworthiness, as illustrated in figure 1. A high type II error rate provides ample cover for the small minority of unethical researchers to cheat without overt fear of detection [23,27], perhaps explaining the elevated prevalence of dubious practice in biomedical science [23], in particular.
The model presented is a much simplified picture of reality, but it allows us to examine how different factors might influence the trustworthiness of published science, and potentially suggest strategies to improve it. As the motivations of and pressures on scientists are incredibly complex, it is important to recognize the limitations in the model too. The three cohorts presented here would in reality constitute a spectrum. The sub-divisions in this work are relatively arbitrary and informed by the available data on researcher populations, though it would be easily possible to extend this to consider more subpopulations if desired. Scientific conduct is notoriously difficult to quantify, and the assumptions we have used in this work reflect the best estimates to date [23].
We can also envision a situation where authors are awarded solely on the basis of positive findings, so that negative findings have no funding benefit. We can apply the model to these circumstances too, with the realization that under such a scheme, there would be no incentive for authors to submit negative results. In this case, B = 1 and all β terms reduce to zero. Essentially then, one gets a similar result to the one shown in figure 5a, with even further reduction in trustworthiness. Finally, measures that can be adopted to begin changing the culture of fixation on novel positive results include the establishment of awards by academic societies designed to recognize methodological rigour rather than positive results, as well as the explicit recognition of material published in online repositories as relevant material in university tenure and promotion guidelines.
It is also worth considering how the positive publication weighing might impact on the 'filedrawer' problem [42]. This was the observation first articulated by Rosenthal in 1979 that researchers tended to not invest their energy trying to publish null findings, instead burying them in a filedrawer. The great tragedy of this is that essential null results are often disregarded by the scientists who discover them, meaning others labour down fruitless avenues. In the model, we have implicitly assumed a version of this by assuming researchers submit only a portion of their negative findings (β) for consideration. It would be useful to know precisely how much is never submitted, and to gauge the extent of the file-drawer problem. Certainly estimates have been made in some fields, notably by Franco et al. [43], who determined that in one study of publications in social sciences, only 35% of the null results were ever written up (in good agreement with our estimates for β in table 1) and ultimately, just over 20.8% of these findings were published. Also for NIH-funded clinical trials, 32% remained unpublished a median of 51 months after their completion [44]. Whether these patterns apply also in other fields remains to be seen. One approach might be to consider the issue from an energyexpenditure point of view or game-theory approach which could be coupled with the model to estimate how much vital science never reaches the public domain, though this is beyond the scope of this investigation.
A more sophisticated future analysis might include variables that respond to the available funding. For example, the fraudulent publication rate δ is treated as a constant in this work for the most part, but it is easy to imagine a situation where this increases with shrinking funding, or where the number of investigators willing to engage in such practices is a function of available funding. This is not considered here, but the model presented could be easily adapted to probe this further. Future work with more sophisticated models could explore how best to implement these and other possible interventions designed to improve science trustworthiness. For instance, trustworthiness as a function of positive publication bias (B) and fraud detection rate (η) could be computed and optimization approaches could be applied to determine the optimal combination of B and η to improve science trustworthiness. These parameters can be somewhat influenced by large academic societies, government agencies or independent foundations for instance, who could fund efforts to detect fraud in published work and support research concerning null results. It is also important to note that the model results pivot explicitly on the assumption that scientists are forced to operate under a 'publish or perish' regime, and rewarded solely on output. Thus, there is another way to improve the trustworthiness of published science-while publications are indeed one measure of productivity, they are not necessarily the sole measure. While a much harder aspect to gauge, trustworthiness is more fundamentally important. For their part, scientific journals should realize that issues such as replication and null findings are equally vital to good science as eye-catching 'new' results. This is slowly beginning to be recognized, with some groups coming to the forefront of championing reproducible research methods [45]. The consequences detailed in this manuscript only arise when publishing quantity is the dominant measure of an academic's worth, but in reality this should only be one consideration among many. The model suggests that if publishing is the sole criteria under which academics are judged, then dubious conduct can thrive.
We accordingly need to address alternative ways to assess researchers, and to encourage judicious diligence over dubious publishing. The model outlined here is far from complete, but yields some insights into the factors that shape the trustworthiness of published science. There is already evidence that pressure to publish is driving researcher burn-out and cynicism in published research [46], negatively affecting both research and the researchers themselves [47,48]. Other studies have not found a clear association of some productivity incentives with bias [49], but these incentives may be confounded in that sometimes they coexist with other features and research practices that tend to increase also quality of research, rather than just quantity of publications. Crucially, bogus findings risk undermining public confidence in science. Among notable examples [50][51][52], the fraudulent Lancet MMR-Autism paper [53] is especially infamous, remaining a cornerstone of anti-vaccine narratives [54].
Scientific publishing is not intrinsically flawed, and complete, unbiased publication is essential for scientific progress. This work illuminates potential consequences of a system where publication is the dominating measure of academic success, and strongly suggests we should consider the consequences of our incentives, and look at changing how academics are evaluated. This is key not only to appreciating the exceptional pressures wrought upon researchers by a strict publish or perish imposition, but to improving science itself. This would not only benefit those working in the field, but is crucial if public trust in science is to be maintained.
Data accessibility. A demonstration version of the model coded is available at https://github.com/drg85/Publish-or-Perish, coded for MATLAB and as a windows stand-alone application.