Widespread selection for extremely high and low levels of secondary structure in coding sequences across all domains of life

Codon composition, GC content and local RNA secondary structures can have a profound effect on gene expression, and mutations affecting these parameters, even though they do not alter the protein sequence, are not neutral in terms of selection. Although evidence exists that, in some cases, selection favours more stable RNA secondary structures, we currently lack a concrete idea of how many genes are affected within a species, and whether this is a universal phenomenon in nature. We searched for signs of structural selection in a global manner, analysing a set of 1 million coding sequences from 73 species representing all domains of life, as well as viruses, by means of our newly developed software PACKEIS. We show that codon composition and amino acid identity are main determinants of RNA secondary structure. In addition, we show that the arrangement of synonymous codons within coding sequences is non-random, yielding extremely high, but also extremely low, RNA structuredness significantly more often than expected by chance. Taken together, we demonstrate that selection for high and low levels of secondary structure is a widespread phenomenon. Our results provide another line of evidence that synonymous mutations are less neutral than commonly thought, which is of importance for many evolutionary models.


Page 3, Results, second paragraph:
The definition of the DBF-score in this paragraph is not clear. The authors should add the formal definition to the Methods section. What confuses me is the fact that "...DBF-score of 0 means that none of the aORFS exhibits a lower DBF, while a DBF-score of 1 means that none of the aORFs exhibits a higher DBF". Assuming that a set of aORFs behaves exactly as the oORF, what would be the resulting DBF-score? Please clearly state the definition of the DBF-score! 4. Page 3, Results, paragraph 2 + 3: Some of this text should rather be moved to the Methods section as it explains what the PACKEIS software actually does. 5. Page 3, Results, last paragraph: The sentence "The PACKEIS software including ..." appears several times throughout the manuscript. While a little redundancy usually doesn't hurt, it might be better to state the 'Availability' of the software only once, either in the Methods section when you explain in detail what PACKEIS implements, or as a separate section 'Availability' at the end of the manuscript. 6. Page 9, Results: for the amino acid identity analysis you create two sets of 10,000 random peptide sequences, one consisting of amino acids that appear in structured, and the other with amino acids that appear mostly in unstructured ORFs. From the text (even after reading the corresponding part in the Methods section), I don't fully understand what are the oORFs and aORFs in this analysis. After all, both are required to obtain the DBF-score if I'm not mistaken. Please clearly state what constitutes oORF and aORF. 7. Page 9, Results, last sentence before Discussion: The Figure  You mention RNAfold and RNAplfold which are used mutually exclusive depending on the length of the input ORF length. Please state whether these tools are run with default options or not. For RNAplfold you state that it is used with option '-l 100'. While this should probably read '-L 100' I wonder whether there are any other additional options passed to RNAplfold.
If not, I'd suggest adding the '-W' option with a window length at least 50nt larger than the value used for '-L' to level boundary effects for the computed base pair probabilities. Note here, that -L specifies the maximum allowed distance of two pairing nucleotides along the backbone, while the -W option specifies the window that is used to average the pairing probabilities. (see Bernhart et al. 2006, Bioinformatics). 9. Page 11, Methods section, "Quantifying the number of ORFs under structural selection" The textual description of the s_low/s_high scores is a little bit confusing. Please clarify: (i) that you count the lowest/highest quantiles only if they are significant, i.e. p < 0.05 which translates to Z-score >= 1.96.
(ii) what the s_low/s_high scores actually count. If I'm not mistaken, you are measuring the excess of ORFs in the five quantiles with respect to the expectation. Please rephrase this paragraph and clearly denote the parts of the text that correspond to the equation that follows this paragraph, i.e. state what a_{Q k / 100) and z_{Q k / 100} is. 10. Page 12, Methods section, "Generating random peptides and ORFs" As already mentioned for "6. Page 9, Results", the description for the second random peptide set is unclear. Please clarify what are the oORFs and corresponding aORFs that are required for DBF and DBF-score computation and averaging.

Review form: Reviewer 2
Recommendation Accept with minor revision (please list in comments) Are each of the following suitable for general readers?

a)
Title Yes

c) Introduction Yes
Is the length of the paper justified? Yes Should the paper be seen by a specialist statistical reviewer? Yes

Do you have any ethical concerns with this paper? No
Comments to the Author Gelbert et al present evidence that a small percent of protein coding sequences show a nonrandom degree of RNA secondary structure in species from all five kingdoms + viruses. The key method develops a metric for RNA structure based on the number of nucleotide pairs predicted by the ViennaFold algorithm and compares that to the number of nucleotide pairs predicted by three 100x-shuffled variants to estimate the random expectation. The best controlled shuffle variant maintains the same composition of the 61 codons, thus preserving GC content, amino acid content and synonymous codon usage. Other variants allow the effect of amino acid content and synonymous codon usage to be assessed. I find the principle claims to be true and interesting. The Discussion is well balanced, making it clear that it is challenging to determine if there is a selection on say amino acid content that is causing changes in structure or vice versa. The authors may well be correct that the true degree of secondary structure under selection in protein coding regions is larger than their analysis shows.
I hope the authors will consider the following.
1. Results, first sentence. The term "backfolding" is introduced without being defined. Why "back", why not just folding. Please define the term when it is introduced.
2. I am not suggesting changing the DBF-score, but it does not distinguish between genes with many nucleotide pairs above or below random vs those with only one. The thresholding of greater than or less than the DBF of the oORF prevents this. Would it be useful to present in addition the distribution of, say, nucleotide pairs above or below the 100xmodel 2 background for, say, the genes DBF-score <0.05 or >0.95? Is there a Gaussian or, perhaps more likely, a single tailed distribution?
3. It might be worth pointing out that species with large effective population sizes are those where codon preferences correlate strongly with tRNA abundances, whereas those with higher mutational loads have codon usages that reflect neutral selection, such as GC content. e.g. dos Reis and Wernisch 2009 mol biol evol 26, 451. Would it be interesting to test if species with the structure bias correspond to those with higher or lower effective population sizes?
4. Given that synonymous codons for a given amino acid have closely related mono, di, and trinucleotide frequencies and that RNA di-and tri-nucleotides have differing base stacking and base pairing energies, it seems inevitable that any differences in amino acid or synonymous codon usage between genes would result in different degrees of RNA secondary structure. e.g. Mathews et al J mol biol 1999 288, 911. I would be tempted to make that explicit. (note to the editor, shuffle model 2 shows that some of the unexpected structure observed is not due to selection for amino acid or synonymous codon usage).
5. There is an interesting paper from Lai et al showing extensive folding of many mRNAs, including their protein coding region. NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-06792-z.
6. It is off putting that there is such a poor correlation in structure bias between orthologous genes. Codon usage, by contrast, is known to be well conserved among orthologs. This is one reason I am not compelled that the DFB-score is the most sensitive or best metric.

23-Apr-2019
Dear Dr Rosenkranz, We are pleased to inform you that your manuscript RSOB-19-0020 entitled "Widespread selection for high and low secondary structure in coding sequences across all domains of life" has been accepted by the Editor for publication in Open Biology. The reviewer(s) have recommended publication, but also suggest some minor revisions to your manuscript. Therefore, we invite you to respond to the reviewer(s)' comments and revise your manuscript.
Please submit the revised version of your manuscript within 7days. If you do not think you will be able to meet this date please let us know immediately and we can extend this deadline for you.
To revise your manuscript, log into https://mc.manuscriptcentral.com/rsob and enter your Author Centre, where you will find your manuscript title listed under "Manuscripts with Decisions." Under "Actions," click on "Create a Revision." Your manuscript number has been appended to denote a revision.
You will be unable to make your revisions on the originally submitted version of the manuscript. Instead, please revise your manuscript and upload a new version through your Author Centre.
When submitting your revised manuscript, you will be able to respond to the comments made by the referee(s) and upload a file "Response to Referees" in "Section 6 -File Upload". You can use this to document any changes you make to the original manuscript. In order to expedite the processing of the revised manuscript, please be as specific as possible in your response to the referee(s). Please see our detailed instructions for revision requirements https://royalsociety.org/journals/authors/author-guidelines/.
Before uploading your revised files please make sure that you have: 1) A text file of the manuscript (doc, txt, rtf or tex), including the references, tables (including captions) and figure captions. Please remove any tracked changes from the text before submission. PDF files are not an accepted format for the "Main Document".
2) A separate electronic file of each figure (tiff, EPS or print-quality PDF preferred). The format should be produced directly from original creation package, or original software format. Please note that PowerPoint files are not accepted.
3) Electronic supplementary material: this should be contained in a separate file from the main text and meet our ESM criteria (see http://royalsocietypublishing.org/instructions-authors#question5). All supplementary materials accompanying an accepted article will be treated as in their final form. They will be published alongside the paper on the journal website and posted on the online figshare repository. Files on figshare will be made available approximately one week before the accompanying article so that the supplementary material can be attributed a unique DOI.
Online supplementary material will also carry the title and description provided during submission, so please ensure these are accurate and informative. Note that the Royal Society will not edit or typeset supplementary material and it will be hosted as provided. Please ensure that the supplementary material includes the paper details (authors, title, journal name, article DOI Comments to the Author(s) In their article, the authors investigate whether the coding sequence (CDS) of mRNAs is under evolutionary selection with respect to the RNAs structuredness, i.e. base pairing potential. For that purpose, the authors take over one million CDS from a total of 73 species from all domains of life (incl. viruses) and, for each of them, predict base pairing probabilities which in turn give rise to a single measure of structuredness, the degree of backfolding (DBF). To evaluate, whether a DBF for a particular CDS is significant, the authors compare it against alternative open reading frames (aORFs) obtained from three different background models. In particular, the aORFs consist of codons that preserve the encoded protein sequence, but are (i) selected from a uniform distribution, (ii) selected from the global codon usage of the respective organism, or (iii) derived from shuffling the original open reading frame (oORF) of the CDS. This comparison then gives rise to a DBF-score which ranges from 0 to 1 denoting that the oORF is less structured than background, or more structured than background, respectively.
From this analysis, the authors find that indeed many CDS seem to be significantly more (or less) structured than expected by the background model(s) throughout all kingdoms of life. In particular, viral CDS appear to be exceptionally structured. Furthermore, the authors identify two sets of codons that are over-represented in exceptionally structured, and unstructured ORFs. Similarly, two sets of amino acids are determined which, again, lead to structured and unstructured ORFs, respectively.
Along with their manuscript, the authors present a newly create software named PACKEIS that given a set of ORFs creates aORFs for each of the background models mentioned above. PACKEIS then computes the DBF and DBF-scores for the input data by utilizing the programs RNAfold or RNAplfold of the ViennaRNA Package.
The article is very well written and sound. However, I have some minor remarks that still need to be addressed.
1. The term 'high' and 'low' secondary structures that is used in the title and throughout the manuscript sounds rather odd to me. Especially in the title, it is unclear what high and low secondary structures could refer to. Since the authors associate structuredness in terms of the DBF with it, I would suggest adding the term 'structuredness' or something similar.

Page 3, Results, second paragraph:
Here, the definition of DBF appears in textual form. This should be moved to the Methods section, potentially associated with an equation.

Page 3, Results, second paragraph:
The definition of the DBF-score in this paragraph is not clear. The authors should add the formal definition to the Methods section. What confuses me is the fact that "...DBF-score of 0 means that none of the aORFS exhibits a lower DBF, while a DBF-score of 1 means that none of the aORFs exhibits a higher DBF". Assuming that a set of aORFs behaves exactly as the oORF, what would be the resulting DBF-score? Please clearly state the definition of the DBF-score! 4. Page 3, Results, paragraph 2 + 3: Some of this text should rather be moved to the Methods section as it explains what the PACKEIS software actually does.

Page 3, Results, last paragraph:
The sentence "The PACKEIS software including ..." appears several times throughout the manuscript. While a little redundancy usually doesn't hurt, it might be better to state the 'Availability' of the software only once, either in the Methods section when you explain in detail what PACKEIS implements, or as a separate section 'Availability' at the end of the manuscript. 6. Page 9, Results: for the amino acid identity analysis you create two sets of 10,000 random peptide sequences, one consisting of amino acids that appear in structured, and the other with amino acids that appear mostly in unstructured ORFs. From the text (even after reading the corresponding part in the Methods section), I don't fully understand what are the oORFs and aORFs in this analysis. After all, both are required to obtain the DBF-score if I'm not mistaken. Please clearly state what constitutes oORF and aORF. 7. Page 9, Results, last sentence before Discussion: The Figure  You mention RNAfold and RNAplfold which are used mutually exclusive depending on the length of the input ORF length. Please state whether these tools are run with default options or not. For RNAplfold you state that it is used with option '-l 100'. While this should probably read '-L 100' I wonder whether there are any other additional options passed to RNAplfold.
If not, I'd suggest adding the '-W' option with a window length at least 50nt larger than the value used for '-L' to level boundary effects for the computed base pair probabilities. Note here, that -L specifies the maximum allowed distance of two pairing nucleotides along the backbone, while the -W option specifies the window that is used to average the pairing probabilities. (see Bernhart et al. 2006, Bioinformatics).

Page 11, Methods section, "Quantifying the number of ORFs under structural selection"
The textual description of the s_low/s_high scores is a little bit confusing. Please clarify: (i) that you count the lowest/highest quantiles only if they are significant, i.e. p &lt; 0.05 which translates to Z-score &gt;= 1.96.
(ii) what the s_low/s_high scores actually count. If I'm not mistaken, you are measuring the excess of ORFs in the five quantiles with respect to the expectation. Please rephrase this paragraph and clearly denote the parts of the text that correspond to the equation that follows this paragraph, i.e. state what a_{Q k / 100) and z_{Q k / 100} is.
10. Page 12, Methods section, "Generating random peptides and ORFs" As already mentioned for "6. Page 9, Results", the description for the second random peptide set is unclear. Please clarify what are the oORFs and corresponding aORFs that are required for DBF and DBF-score computation and averaging.

Referee: 2
Comments to the Author(s) Gelbert et al present evidence that a small percent of protein coding sequences show a nonrandom degree of RNA secondary structure in species from all five kingdoms + viruses. The key method develops a metric for RNA structure based on the number of nucleotide pairs predicted by the ViennaFold algorithm and compares that to the number of nucleotide pairs predicted by three 100x-shuffled variants to estimate the random expectation. The best controlled shuffle variant maintains the same composition of the 61 codons, thus preserving GC content, amino acid content and synonymous codon usage. Other variants allow the effect of amino acid content and synonymous codon usage to be assessed. I find the principle claims to be true and interesting. The Discussion is well balanced, making it clear that it is challenging to determine if there is a selection on say amino acid content that is causing changes in structure or vice versa. The authors may well be correct that the true degree of secondary structure under selection in protein coding regions is larger than their analysis shows. I hope the authors will consider the following.
1. Results, first sentence. The term "backfolding" is introduced without being defined. Why "back", why not just folding. Please define the term when it is introduced. 4. Given that synonymous codons for a given amino acid have closely related mono, di, and trinucleotide frequencies and that RNA di-and tri-nucleotides have differing base stacking and base pairing energies, it seems inevitable that any differences in amino acid or synonymous codon usage between genes would result in different degrees of RNA secondary structure. e.g. Mathews et al J mol biol 1999 288, 911. I would be tempted to make that explicit. (note to the editor, shuffle model 2 shows that some of the unexpected structure observed is not due to selection for amino acid or synonymous codon usage).
5. There is an interesting paper from Lai et al showing extensive folding of many mRNAs, including their protein coding region. NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-06792-z.
6. It is off putting that there is such a poor correlation in structure bias between orthologous genes. Codon usage, by contrast, is known to be well conserved among orthologs. This is one reason I am not compelled that the DFB-score is the most sensitive or best metric.

01-May-2019
Dear Dr Rosenkranz We are pleased to inform you that your manuscript entitled "Widespread selection for extremely high and low levels of secondary structure in coding sequences across all domains of life" has been accepted by the Editor for publication in Open Biology.
You can expect to receive a proof of your article from our Production office in due course, please check your spam filter if you do not receive it within the next 10 working days. Please let us know if you are likely to be away from e-mail contact during this time.
Article processing charge Please note that the article processing charge is immediately payable. A separate email will be sent out shortly to confirm the charge due. The preferred payment method is by credit card; however, other payment options are available.
Thank you for your fine contribution. On behalf of the Editors of Open Biology, we look forward to your continued contributions to the journal.