Hidden patterns of codon usage bias across kingdoms

The genetic code is necessarily degenerate with 64 possible nucleotide triplets being translated into 20 amino acids. Eighteen out of the 20 amino acids are encoded by multiple synonymous codons. While synonymous codons are clearly equivalent in terms of the information they carry, it is now well established that they are used in a biased fashion. There is currently no consensus as to the origin of this bias. Drawing on ideas from stochastic thermodynamics we derive from first principles a mathematical model describing the statistics of codon usage bias. We show that the model accurately describes the distribution of codon usage bias of genomes in the fungal and bacterial kingdoms. Based on it, we derive a new computational measure of codon usage bias—the distance D capturing two aspects of codon usage bias: (i) differences in the genome-wide frequency of codons and (ii) apparent non-random distributions of codons across mRNAs. By means of large scale computational analysis of over 900 species across two kingdoms of life, we demonstrate that our measure provides novel biological insights. Specifically, we show that while codon usage bias is clearly based on heritable traits and closely related species show similar degrees of bias, there is considerable variation in the magnitude of D within taxonomic classes suggesting that the contribution of sequence-level selection to codon bias varies substantially within relatively confined taxonomic groups. Interestingly, commonly used model organisms are near the median for values of D for their taxonomic class, suggesting that they may not be good representative models for species with more extreme D, which comprise organisms of medical and agricultural interest. We also demonstrate that amino acid specific patterns of codon usage are themselves quite variable between branches of the tree of life, and that some of this variability correlates with organismal tRNA content.

Beanbag models assume that selection acts at the level of codons. Assigning codons to a particular site can be thought of as selecting beans from a bag. Each time a codon is picked, the choice between the codons / bags is made with a fixed probability that does not depend on the previous choices made. For each codon assignment there will be a probability p i that the i-th codon is chosen among all possible |C A | codons C A 1 , . . . , C A |C A | for amino acid A. This type of selection procedure is generally known to lead to a multinomial distribution of codon subsequences. Specifically, the probability to observe a particular gene that has L = L A,g occurrences of a particular amino acid A with m := |C A |, and k i occurrences of the i-th codon is given by Here q i is the global codon usage bias of the i-th codon. In the special case when there are only two codons in the amino acid, i.e. m = 2 then this probability reduces to the binomial distribution. We set k := k 1 , k 2 = L − k, p := q and p 2 = 1 − q.

Mixture models
A variant of the codon model is the multinomial mixture model, whereby selection still happens at the level of the individual subsequences, but the probability to select a codon depends on the particular subsequence. If we assume that there are M different groups of genes that each share the same selection pressure and the fraction of genes belonging to group i is F i then we can calculate the expected frequencies of genes as follows (for the case of only 2 codons): Here the brackets indicate an ensemble average over the M classes. The empirical probability to observe a subsequence with k codons of type one is then given by . ( As the number of classes M grows, the probability P k approaches the uniform probability distribution, since

Codon usage bias as a random walk
An alternative to the beanbag model is a model where not individual codons are selected, but specific subsequences. As an extreme example, consider the scenario where there is no underlying global codon usage bias at all and all codons occur equally often across the genome, but within a particular gene all codons for an amino acid A are strictly always the same. In this case the probability p A i = 1/|C A |, but the probability to observe a particular subsequence would not follow the binomial distribution. In fact, there would be only C A different subsequences of codons for each amino acid of a particular gene.
For the derivation of the full model, it is convenient to consider codon evolution as a random walk. We consider here a model whereby the position of the codons does not matter, and we only care about the number of codons in a particular subsequence. This analysis remains thus insensitive to correlations of codon usage within genes.
For simplicity, we focus here exclusively on amino acids with 2 codons, yet extensions to more codons are straightforward. In the case of 2 codons, codon evolution can be represented as a 1D random walk in discrete space and continuous time. The number of sites is L A,g . Each site corresponds to a particular subsequence composition, i.e. a pair (k A,g 1 , L A,g − k A,g 2 ), which defines a state. In the case of 2 codons the state is entirely characterised by k 1 . States are connected (i.e. accessible to the random walker) when a single synonymous codon change is sufficient to get from one state to the other. In the case of 2 codons this means that there are two states with one connected state (0, L A,g ) and (L A,g , 0)). All other states are connected to two other states states, such that k 1 is connected to (k 1 − 1, L A,g − k 1 + 1) and (k 1 + 1, , L A,g − k 1 + 1). When there are more than 2 codons, then each state is connected to more than two states.
A transition from one state to another always involves that a codon of a particular type is changed to a codon of a different type. The rate of such an event is proportional to the number of codons of the type that is lost. For example, the rate of transitions where codon 2 is converted to a codon of type 1 is proportional to In order to derive a model, we now take a conceptual leap and (i) model each subsequence as a site of the random walk that has a given energy E i , where the index i counts the numbers of codons of type 1 (which we choose arbitrarily). This energy is, at least for the time being, an entropic energy in that it is the (conceptual) consequence of the number of subsequence configurations that have k 1 codons of type 1. This number is lowest when all codons are of the same type and highest when all codons are used exactly equally often. (ii) We posit that the random walker that moves between the sites is in contact with a large heat-bath that remains at a fixed temperature T . This temperature bath exchanges energy with the walker, thus enables it to transition to sites with higher energy, as well as lower energy while extracting/adding energy to the heat bath. We stress here that this idea of energy and temperature are merely conceptual devices and should not be confused with the actual physical temperature that is experience by organisms.
Having made this conceptualisation, we can now apply well established concepts from stochastic thermodynamics [1] to this problem in order calculate the long-term equilibrium probabilities π(k 1 , k 2 , . . .) for various configurations. To do this we impose that the system obeys the detailed balance condition, which means that in equilibrium there are no net-flows of probability between any two states that are connected.
Here r(k 1 , k 2 , . . . → k 1 + 1, k 2 − 1, . . .) is the rate of a mutation from any of the k 1 codons of type 1 to codon 2. This implies that π(k 1 , k 2 , . . .) π(k 1 + 1, We also know that the occupation probabilities of a random walker of the type described here obeys the Boltzmann distribution in equilibrium. This means that the probability to find the walker in a given configuration depends on the energy of the configuration E(k 1 , k 2 , . . .) in the following way: Since we are not interested here in specific units, we will henceforth set the Boltzmann constant k B = 1.
The energy is a priori unknown, but we postulate that the local detail balance condition is fulfilled [2]: where r + and r − are the forwards and backwards transition rates respectively. This relationship then implies a dependence between the two rates, namely: 1.2.1 Calculation of the energy of a specific codon in the completely unbiased beanbag model.
We first calculate the energies associated with a model where there is no selection pressure at all and codons perform a random walk only. This is the beanbag model where all codons have the same probability of being chosen. To calculate the energies of each subsequence, we start from a subsequence where all L codons are of type 1. We define this state as having energy E 0 = 0. A mutation can reach the next state (1) by changing one of the L codons of type 1 to a codon of type 2. When there are no selection forces then this happens with a rate proportional to L. From this state (1), the system can then move further to state (2). This happens now with a rate proportional to L − 1. Alternatively, with a rate proportional to 1 it can move back to state (0). Altogether, the following transitions are thus possible: Using the local detailed balance condition (eq. 8) and because the energy associated with state (0) is by construction E 0 := 0, the energy difference between state (0) and state (1) is ∆E 1 = ln(L/1). Generally, the energy difference ∆E i between state (i) and state (i − 1) is given by ∆E i = ln ((L − i + 1)/i). Consequently, the energy of state (k) is then: This entails that the (Boltzmann-)probability to observe a specific configuration k is proportional to the binomial coefficient, which indicates that this system is distributed according to the binomial distribution eq. 2 with p = 1/2. As such it represents a model with no selection pressure at the level of sequences.
We now establish an energy model of the binomial distribution when there is an underlying bias to the mutations. In this case then the model is (c.f. eq. 10): Following the same reasoning as above, we can establish the energy differences: Hence, it follows that the energyÊ k is given by: The corresponding Boltzmann distribution can be shown to be the binomial distribution eq. 2 with p = q.

Asymmetric bias
For completeness we also consider the case where the transition rates are biased in one direction only. To this end we modify the rates such that there is a constant bias C into one direction. The random walk is then modified as follows: Following the same steps as above, we can calculate energies for this model.
The average codon usage ǫ is then given by

Sequence selection models
We now further modify the rates of the model and assume that there is a selection pressure on the genome such that the rate with which codon 2 mutates to codon 1 and vice versa is now no longer proportional to the number N − k + 1 and k of codon 2, but proportional to a power of this number. This then changes the above model, as follows: The energies then become:Ẽ We conclude that, if the transition rates are as in the unbiased case, but modified by the exponent, then the energies of the model remain unaffected, but the temperature of the system changes according to γ. Note that in this model, on average, the number of codons of type 1 will always be the same as the number of codons of type 2. Thus no global codon usage bias would occur. Selection manifests itself in the way that codons are distributed across subsequences, which is quantified by the inverse temperature γ. Finally, we now consider the most general model with different exponents to the left and to the right to arrive at the full model.
This changes the energies of the model in the following way: It is apparent that this case changes both the inverse temperature of the system and the expected number of codons of type 1 relative to the case of no selection. The corresponding Boltzmann distribution, is a mono-modal distribution. The mean of the distribution depends on the parameters ξ and γ. For certain choices of these parameters, P (Ē k ) can approximate a binomial distribution with some bias p = 1/2.

Further details on methods used
Fitting the data Each gene in the clean sequence files was then split into (up to) 18 OCCs as follows: For each gene g and amino acid A in the dataset we found all codons that code for A and discarded all others. Thus, we reduced the gene g to an OCC of codons of length L A,g . For each OCC thus produced, a control coding OCC was generated by replacing each codon with a random synonymous codon (which could be the same as the one in the original OCC). The probability of choosing a random synonymous codon was biased according to the observed global codon frequency bias of the respective species and amino acid, such that in the control data the codons were distributed according to the multinomial distribution by construction.
For the results presented in the main text we limited the analysis to the 9 amino acids with only 2 codons because the calculation of D becomes statistically unreliable when OCC sets are too small, and this can generate problem in the analysis of amino acids encoded by more than two codons. The number of possible OCCs grows quickly with the number of codon possibilities, whereas the actual number of OCCs represented in a genome does not. As a consequence, there are fewer examples per configuration which increases the statistical error.
The empirical estimates for the parameters of the full model were obtained by fitting genomic data to eq. 6 as follows: 1. For each species we split up each gene into 18 OCCs (or 9 respectively when we focussed on the 2 codon amino acids).
2. We then discarded all OCCs with more than 15 or fewer than 5 codons.
3. From the remaining data, we calculated the normalised frequencies of each codon as the number of occurrences of each codon divided by the total number of codons in the subset. This provided an empirical estimate of the energy E i .
4. Next, we fitted these empirical E i values to the full model eq. 6, thus obtaining estimates for γ and ξ for each length, i.e. 11 different fits per amino acid.
5. Species-wide distances D were calculated by taking occurrence weighted averages over the individual fits of the species.
Fitting was done using the Maple 2018 "NonlinearFit" function. The initial estimates for γ and ξ were set to 1. If an initial fit resulted in a mean-residual > 0.0009999 then the fit was repeated with randomly chosen initial estimates. This was repeated up to 1000 times until a fit was found with a mean-residual < 0.00099999. Fig. 2c shows the residuals obtained from the full model versus the residuals obtained from the binomial model. If the former is low and the latter is high then the corresponding OCC is better explained by SLS than the beanbag model. However, there is an ambiguous region where the beanbag model fits equally well as the full model (or better). There are many ways to define such a region, and we chose the following conservative approach: We first fitted a straight line to the corresponding plot of the control data in fig. 2b. Then we decided that all points in the plot for the full model in fig. 2c that lie below this line are ambiguous.

Global temperatures
For each set of subsequences corresponding to a particular length and amino acid, we next define an empirical energy E as , i.e. plotting the empirical energies againstÊ would result in a straight line with slope 1. In contrast, if the distribution of subsequences was distributed by a multinomial distribution with a higher temperature, then this plot would still result in a straight line, but with a slope corresponding to 1/T . We plotted the empirical energies E againstÊ and found that the two energies are in good approximation, related via a straight line. Fig. 1a shows 4 examples of fungal species for the 9 amino acid with exactly 2 codons. In order to check the temperature of a species, we fitted E versusÊ to a straight line. The slope of this straight line can then be taken as an estimate for the inverse temperature for each of the 462 fungal species in our dataset, see fig. 1a. % par We found that the slopes are significantly and systematically different from 1 and significantly and systematically higher than slopes obtained from random controls; see fig. 1. Seen globally, the model is thus consistent with the full model and thus consistent with the hypothesis of sequence-level selection. At the same time, the significantly altered slopes are not consistent with the assumption of a purely codon-based selection, which would lead to a binomial distribution of data, and hence slopes close to 1.
While this global data on the slope can be used qualitatively to demonstrate the signature of sequencelevel selection acting on codon usage, it cannot be reliably used to quantify this selection pressure. While the result that slopes systematically deviate from the random model is robust with respect to changes of the fitting protocol, the precise numerical values are not robust to such changes. In fig. 1b we reported slopes obtained by fitting the lines to the entire dataset. If instead fits are restricted to energiesÊ < 5, which would exclude the low probability points, then somewhat different numerical results would appear. Moreover, the   There is considerable overlap between the two groups. Density estimates for each group are overlayed on the graph to aid the eye. It appears that fungi are somewhat cooler than protists. Bacteria would lie in-between protists and fungi, but are omitted to aid graph readability. The control data peaks around an inverse temperature of 1, whereas the real data is distributed around a lower inverse temperature. (b) Distribution of inverse temperature for two different species. This shows the temperature for two species including all amino acids and is a subset of (a).4c The distribution of distances D. The control data clearly has a smaller distance on the whole than the non-selection model, indicating that considering only the global codon usage bias underestimates the selection pressure. (d) Same data, but for two species only. global slopes obtained thus are in no clear relationship with the summary statistics of the detailed fits of the full model to subsequences, even though they consist of exactly the same data.
We then repeated the same plot for the same species, but this time for the amino acids that have more than 2 codons. Fig. 2 demonstrates that these also lead to approximately straight lines, but only for the subset of high probability subsequences. For those we obtained slopes that were close to 1. When all subsequences are included then noise dominates and no meaningful fitting is possible; see fig. 2b.

Supplementary plots
Correlation for the parameter ξ for the fungal dataset. We include only those fits where the mean-residual was smaller than 0.0009999. The horizontal axis indicates the fitted value for the subsequence length of 15 and the vertical axis for the subsequences values 5 − 14. The title indicates the Pearson correlation. Same as above but for parameter γ.