Factors influencing taxonomic unevenness in scientific research: a mixed-methods case study of non-human primate genomic sequence data generation

Scholars have noted major disparities in the extent of scientific research conducted among taxonomic groups. Such trends may cascade if future scientists gravitate towards study species with more data and resources already available. As new technologies emerge, do research studies employing these technologies continue these disparities? Here, using non-human primates as a case study, we identified disparities in massively parallel genomic sequencing data and conducted interviews with scientists who produced these data to learn their motivations when selecting study species. We tested whether variables including publication history and conservation status were significantly correlated with publicly available sequence data in the NCBI Sequence Read Archive (SRA). Of the 179.6 terabases (Tb) of sequence data in SRA for 519 non-human primate species, 135 Tb (approx. 75%) were from only five species: rhesus macaques, olive baboons, green monkeys, chimpanzees and crab-eating macaques. The strongest predictors of the amount of genomic data were the total number of non-medical publications (linear regression; r2 = 0.37; p = 6.15 × 10−12) and number of medical publications (r2 = 0.27; p = 9.27 × 10−9). In a generalized linear model, the number of non-medical publications (p = 0.00064) and closer phylogenetic distance to humans (p = 0.024) were the most predictive of the amount of genomic sequence data. We interviewed 33 authors of genomic data-producing publications and analysed their responses using grounded theory. Consistent with our quantitative results, authors mentioned their choice of species was motivated by sample accessibility, prior published work and relevance to human medicine. Our mixed-methods approach helped identify and contextualize some of the driving factors behind species-uneven patterns of scientific research, which can now be considered by funding agencies, scientific societies and research teams aiming to align their broader goals with future data generation efforts.

The choice of predictors is exiting as they are qualitatively diverse, and contain some that plausibly related directly to biological laboratory research and some that may provide justification of research that lies outside of the laboratory (e.g.: conservation). While hinted at, the manuscript does not state clearly what appears for me to be the main conclusion -namely that among the different plausible causes that could lead to genomic work on primates medical research is more informative that than other plausible explanations. Possibly alternate hypothesis could be framed in the beginning of the manuscript.
The violin plots hide a lot of the data as most values are 0 and violin plots hold a kernel-estimate to smoothen the distribution. Would be useful to have the number of non-zero and zero-values indicated clearly (e.g.: stated in legend, or switching away from a violin plot to a non-parametric letter-value plot) Additional curiosities: While the manuscript follows an 1:M type inquiry between genomic information and other predictors, it would be interesting to see all (1+M):(1+M) comparisons between all predictors and genomic information. This would help to see if there are distinct groups of related features.
When considering multiple predictors simultaneously, the strength of the models is not visualized (e.g.: scatter plot between observed, and predicted), making it difficult to see, if there are systemic mis-predictions.
As there are likely non-linearities and possibly interactions or redundancies between predictors, an alternative analysis using Gradient Boosting Regression would appear to reveal a better quantification, and alternative way to assess the importance of individual predictors.
It is unclear, whether medical literature performs less well than the entire literature because it is medical, or whether there is fewer literature, which is medical -and hence numbers are smaller and more difficult to fit. A possible control analysis to distinguish would be to subsample papers from the entire literature randomly.
Do the findings hold for distinct modalities of omics data? E.g.: presence of a reference build of the genome, exome-sequencing, RNA-sequencing, … Which of the different modalities of omics dominates the data considered in this study?
Would the metadata of the experiments indicate whether the participants in the semi-structured interview used samples stemming from tissues of animals or from cell lines which could be cultured in a lab? How would the types of provided reasons change?
Since this is about genomics: How informative would it be whether there is an official accepted genome sequence of these primates? How informative would the years of the initial genome sequence builds be?

Review form: Reviewer 2
Is the manuscript scientifically sound in its present form? Yes

Do you have any ethical concerns with this paper? No
Have you any concerns about statistical analyses in this paper? No

Recommendation? Accept as is
Comments to the Author(s) I find paper "Factors influencing taxonomic unevenness in scientific research: A mixed-methods case study of non-human primate genomic sequence data generation" to be clearly written and the analyses done to be rational and competently performed. While I feel the results presented represent a confirmation of something almost anyone in the field would expect to be the case, it is nice to see it confirmed. Additionally, the grounded theory approach is an interesting way to explore the qualitative factors behind the taxonomic unevenness is scientific research.

Decision letter (RSOS-201206.R0)
We hope you are keeping well at this difficult and unusual time. We continue to value your support of the journal in these challenging circumstances. If Royal Society Open Science can assist you at all, please don't hesitate to let us know at the email address below.

Dear Mrs Hernandez
On behalf of the Editors, we are pleased to inform you that your Manuscript RSOS-201206 "Factors influencing taxonomic unevenness in scientific research: A mixed-methods case study of non-human primate genomic sequence data generation" has been accepted for publication in Royal Society Open Science subject to minor revision in accordance with the referees' reports. Please find the referees' comments along with any feedback from the Editors below my signature.
Both reviewers were very positive about publication. Reviewer 1 raised a number of interesting comments to improve the manuscript further and we invite you to respond to the comments and revise your manuscript. Below the referees' and Editors' comments (where applicable) we provide additional requirements. Final acceptance of your manuscript is dependent on these requirements being met. We provide guidance below to help you prepare your revision.
Please submit your revised manuscript and required files (see below) no later than 7 days from today's (ie 10-Aug-2020) date. Note: the ScholarOne system will 'lock' if submission of the revision is attempted 7 or more days after the deadline. If you do not think you will be able to meet this deadline please contact the editorial office immediately.
Please note article processing charges apply to papers accepted for publication in Royal Society Open Science (https://royalsocietypublishing.org/rsos/charges). Charges will also apply to papers transferred to the journal from other Royal Society Publishing journals, as well as papers submitted as part of our collaboration with the Royal Society of Chemistry (https://royalsocietypublishing.org/rsos/chemistry). Fee waivers are available but must be requested when you submit your revision (https://royalsocietypublishing.org/rsos/waivers).
Thank you for submitting your manuscript to Royal Society Open Science and we look forward to receiving your revision. If you have any questions at all, please do not hesitate to get in touch. Comments to the Author(s) The manuscript by Hernandez et al. provides a nice overview into distinct biases that exist in non-human primate genomics. The coupling of a quantitative approaches with interviews is interesting as it shows how global trends relate to the experiences of individual researchers. One ambivalent strength of the manuscript is that it repeatedly identifies many points, which would warrant further curiosity and a closer study -but does not provide more detail itself. I will describe some of these below but believe that addressing them fully might be outside of the scope of this publication. Methodologically, only one statistical mistake, which could be easily corrected, caught my eye. Stylistically, the manuscript could be cleaner.

Statistics:
The manuscript provides averages and a t-test to describe differences between those species with and without genomic information. As hinted at in their Supplemental Figures, the data itself however is not normal. Hence the non-parametric alternatives of median and two-sided Mann-Whitney U test would be appropriate.

Code:
The code neither tells how to import the data files (for which there would be different ways) nor does it contain lines of code to read the data files in (which would avoid ambiguity). This prevented me from testing.

Stylistic:
When referring to elements of a panel, it would be easier for the reader, if the order in the text matched the order of the panels.
A table with all individual considered predictors, and their associated R2 and P-values (for both, presence of sequencing data, and extend of sequencing data for species with at least some data) would be helpful.
The choice of predictors is exiting as they are qualitatively diverse, and contain some that plausibly related directly to biological laboratory research and some that may provide justification of research that lies outside of the laboratory (e.g.: conservation). While hinted at, the manuscript does not state clearly what appears for me to be the main conclusion -namely that among the different plausible causes that could lead to genomic work on primates medical research is more informative that than other plausible explanations. Possibly alternate hypothesis could be framed in the beginning of the manuscript.
The violin plots hide a lot of the data as most values are 0 and violin plots hold a kernel-estimate to smoothen the distribution. Would be useful to have the number of non-zero and zero-values indicated clearly (e.g.: stated in legend, or switching away from a violin plot to a non-parametric letter-value plot) Additional curiosities: While the manuscript follows an 1:M type inquiry between genomic information and other predictors, it would be interesting to see all (1+M):(1+M) comparisons between all predictors and genomic information. This would help to see if there are distinct groups of related features.
When considering multiple predictors simultaneously, the strength of the models is not visualized (e.g.: scatter plot between observed, and predicted), making it difficult to see, if there are systemic mis-predictions.
As there are likely non-linearities and possibly interactions or redundancies between predictors, an alternative analysis using Gradient Boosting Regression would appear to reveal a better quantification, and alternative way to assess the importance of individual predictors.
It is unclear, whether medical literature performs less well than the entire literature because it is medical, or whether there is fewer literature, which is medical -and hence numbers are smaller and more difficult to fit. A possible control analysis to distinguish would be to subsample papers from the entire literature randomly.
Do the findings hold for distinct modalities of omics data? E.g.: presence of a reference build of the genome, exome-sequencing, RNA-sequencing, … Which of the different modalities of omics dominates the data considered in this study?
Would the metadata of the experiments indicate whether the participants in the semi-structured interview used samples stemming from tissues of animals or from cell lines which could be cultured in a lab? How would the types of provided reasons change?
Since this is about genomics: How informative would it be whether there is an official accepted genome sequence of these primates? How informative would the years of the initial genome sequence builds be?
Reviewer: 2 Comments to the Author(s) I find paper "Factors influencing taxonomic unevenness in scientific research: A mixed-methods case study of non-human primate genomic sequence data generation" to be clearly written and the analyses done to be rational and competently performed. While I feel the results presented represent a confirmation of something almost anyone in the field would expect to be the case, it is nice to see it confirmed. Additionally, the grounded theory approach is an interesting way to explore the qualitative factors behind the taxonomic unevenness is scientific research.

===PREPARING YOUR MANUSCRIPT===
Your revised paper should include the changes requested by the referees and Editors of your manuscript. You should provide two versions of this manuscript and both versions must be provided in an editable format:<ul><li>one version identifying all the changes that have been made (for instance, in coloured highlight, in bold text, or tracked changes);</li><li>a 'clean' version of the new manuscript that incorporates the changes made, but does not highlight them. This version will be used for typesetting.</li></ul> Please ensure that any equations included in the paper are editable text and not embedded images.
Please ensure that you include an acknowledgements' section before your reference list/bibliography. This should acknowledge anyone who assisted with your work, but does not qualify as an author per the guidelines at https://royalsociety.org/journals/ethicspolicies/openness/.
While not essential, it will speed up the preparation of your manuscript proof if you format your references/bibliography in Vancouver style (please see https://royalsociety.org/journals/authors/author-guidelines/#formatting). You should include DOIs for as many of the references as possible.
If you have been asked to revise the written English in your submission as a condition of publication, you must do so, and you are expected to provide evidence that you have received language editing support. The journal would prefer that you use a professional language editing service and provide a certificate of editing, but a signed letter from a colleague who is a native speaker of English is acceptable. Note the journal has arranged a number of discounts for authors using professional language editing services (https://royalsociety.org/journals/authors/benefits/language-editing/).

===PREPARING YOUR REVISION IN SCHOLARONE===
To revise your manuscript, log into https://mc.manuscriptcentral.com/rsos and enter your Author Centre -this may be accessed by clicking on "Author" in the dark toolbar at the top of the page (just below the journal name). You will find your manuscript listed under "Manuscripts with Decisions". Under "Actions", click on "Create a Revision".

Attach your point-by-point response to referees and Editors at
Step 1 'View and respond to decision letter'. This document should be uploaded in an editable file type (.doc or .docx are preferred). This is essential.
Please ensure that you include a summary of your paper at Step 2 'Type, Title, & Abstract'. This should be no more than 100 words to explain to a non-scientific audience the key findings of your research. This will be included in a weekly highlights email circulated by the Royal Society press office to national UK, international, and scientific news outlets to promote your work.

At
Step 3 'File upload' you should include the following files: --Your revised manuscript in editable file format (.doc, .docx, or .tex preferred). You should upload two versions: 1) One version identifying all the changes that have been made (for instance, in coloured highlight, in bold text, or tracked changes); 2) A 'clean' version of the new manuscript that incorporates the changes made, but does not highlight them. --If you are requesting a discretionary waiver for the article processing charge, the waiver form must be included at this step.
--If you are providing image files for potential cover images, please upload these at this step, and inform the editorial office you have done so. You must hold the copyright to any image provided.
--A copy of your point-by-point response to referees and Editors. This will expedite the preparation of your proof.

At
Step 6 'Details & comments', you should review and respond to the queries on the electronic submission form. In particular, we would ask that you do the following: --Ensure that your data access statement meets the requirements at https://royalsociety.org/journals/authors/author-guidelines/#data. You should ensure that you cite the dataset in your reference list. If you have deposited data etc in the Dryad repository, please only include the 'For publication' link at this stage. You should remove the 'For review' link.
--If you are requesting an article processing charge waiver, you must select the relevant waiver option (if requesting a discretionary waiver, the form should have been uploaded at Step 3 'File upload' above).
--If you have uploaded ESM files, please ensure you follow the guidance at https://royalsociety.org/journals/authors/author-guidelines/#supplementary-material to include a suitable title and informative caption. An example of appropriate titling and captioning may be found at https://figshare.com/articles/Table_S2_from_Is_there_a_trade-off_between_peak_performance_and_performance_breadth_across_temperatures_for_aerobic_sc ope_in_teleost_fishes_/3843624.

At
Step 7 'Review & submit', you must view the PDF proof of the manuscript before you will be able to submit the revision. Note: if any parts of the electronic submission form have not been completed, these will be noted by red message boxes.

See Appendix A.
Decision letter (RSOS-201206.R1) We hope you are keeping well at this difficult and unusual time. We continue to value your support of the journal in these challenging circumstances. If Royal Society Open Science can assist you at all, please don't hesitate to let us know at the email address below.
Dear Dr Hernandez, It is a pleasure to accept your manuscript entitled "Factors influencing taxonomic unevenness in scientific research: A mixed-methods case study of non-human primate genomic sequence data generation" in its current form for publication in Royal Society Open Science.
You can expect to receive a proof of your article in the near future. Please contact the editorial office (openscience_proofs@royalsociety.org) and the production office (openscience@royalsociety.org) to let us know if you are likely to be away from e-mail contact --if you are going to be away, please nominate a co-author (if available) to manage the proofing process, and ensure they are copied into your email to the journal.
Due to rapid publication and an extremely tight schedule, if comments are not received, your paper may experience a delay in publication. Royal Society Open Science operates under a continuous publication model. Your article will be published straight into the next open issue and this will be the final version of the paper. As such, it can be cited immediately by other researchers. As the issue version of your paper will be the only version to be published I would advise you to check your proofs thoroughly as changes cannot be made once the paper is published.
Please see the Royal Society Publishing guidance on how you may share your accepted author manuscript at https://royalsociety.org/journals/ethics-policies/media-embargo/.
Thank you for your fine contribution. On behalf of the Editors of Royal Society Open Science, we look forward to your continued contributions to the Journal.

Response to reviewer comments:
We thank both reviewers for their thoughtful feedback on our manuscript. Below we describe how we incorporated the revisions into our manuscript and, if not, why we chose not to do so. Within this document, you will find our responses in red. We have also included the changes to the manuscript in red.

Reviewer 1:
Comments to the Author(s): The manuscript by Hernandez et al. provides a nice overview into distinct biases that exist in non-human primate genomics. The coupling of a quantitative approaches with interviews is interesting as it shows how global trends relate to the experiences of individual researchers. One ambivalent strength of the manuscript is that it repeatedly identifies many points, which would warrant further curiosity and a closer study -but does not provide more detail itself. I will describe some of these below but believe that addressing them fully might be outside of the scope of this publication. Methodologically, only one statistical mistake, which could be easily corrected, caught my eye. Stylistically, the manuscript could be cleaner.
General Response to Reviewer 1: Thank you so much for your thoughtful feedback on both the quantitative and qualitative portions of the paper. Your feedback has made our manuscript stronger and clearer and we sincerely appreciate the time and effort you spent on reviewing our manuscript.
1. Statistics: The manuscript provides averages and a t-test to describe differences between those species with and without genomic information. As hinted at in their Supplemental Figures, the data itself however is not normal. Hence the non-parametric alternatives of median and two-sided Mann-Whitney U test would be appropriate.
Response: Thank you for this recommendation. We have accordingly replaced the t-tests with Mann-Whitney U tests, which we agree are more appropriate in this case. We observed a change in the result for one variable, millions of years since last shared common ancestor with humans, which went from being non-significant to significant using standard probability cutoffs. Please see the revised text below. In contrast, we decided to continue listing mean rather than median values in the manuscript text, due to informativeness. For example, the median values for number of medical publications for species with and without genomic data were zero.