Inferring processes underlying B-cell repertoire diversity

We quantify the VDJ recombination and somatic hypermutation processes in human B cells using probabilistic inference methods on high-throughput DNA sequence repertoires of human B-cell receptor heavy chains. Our analysis captures the statistical properties of the naive repertoire, first after its initial generation via VDJ recombination and then after selection for functionality. We also infer statistical properties of the somatic hypermutation machinery (exclusive of subsequent effects of selection). Our main results are the following: the B-cell repertoire is substantially more diverse than T-cell repertoires, owing to longer junctional insertions; sequences that pass initial selection are distinguished by having a higher probability of being generated in a VDJ recombination event; somatic hypermutations have a non-uniform distribution along the V gene that is well explained by an independent site model for the sequence context around the hypermutation site.


Background
Along with T cells, B cells contribute to the large diversity of immune cells that specifically recognize antigens. The diversity of the B-cell repertoire is encoded in the different amino acid composition of B-cell receptors (BCRs) expressed on the surface of these cells. These receptors are formed during the VDJ recombination process in the bone marrow. Before these cells leave for the periphery, they are initially selected for functionality. Later, they undergo further selection depending on their ability to recognize foreign antigen. Additionally, unlike T cells, BCR sequences are subject to point hypermutations during the proliferation that follows successful recognition of an antigen [1]. These hypermutations are selected for antigen binding through the process of affinity maturation. Apart from the possibility of hypermutations, the generation and selection processes are very similar in B and T cells, and involve common enzymes and pathways [2]. Recent advances in sequencing technologies [3,4] make it possible to obtain copious data on immune cell receptor sequences. We work with large samples of human BCR heavy chain DNA sequence 1 and apply advanced statistical techniques to accurately quantify the processes that shape B-cell repertoire diversity-generation, selection and hypermutations. Some of these techniques were developed in [5,6] to describe T-cell repertoires. Here, we also introduce a new probabilistic model to describe hypermutations (specific to B cells), as well as new tools to automatically detect and assign new V, D and J alleles in individuals from their repertoire data, and to infer the distribution of these alleles among their two chromosomes.
Many characteristics of B-cell repertoires have previously been described using a variety of methods. Gene usage was studied by both immunoscope techniques [7,8] and single-cell PCR [9], and the variable length distributions of the complementarity-determining region 3 (CDR3) were reported [10,11]. License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited. [12 -18]. Considerable attention has also been given to quantifying hypermutations and disentangling them from site specific selection, either by comparing synonymous and non-synonymous mutations, or functional and nonfunctional rearrangements [19 -26]. Other studies used lineage trees to describe this mutation -selection process [24,27]. Recently, high-throughput sequencing data combined with sophisticated inference techniques have been used to study selection in the affinity maturation process [19] and the statistics of synonymous hypermutation profiles [20].
Yet a comprehensive quantitative description of the entire generation, selection and hypermutation process of the heavy chain repertoire is still lacking. Here, we self-consistently model all parts of these interlinked processes. VDJ recombination is based on a combinatoric process in which V, D and J genes are chosen from a number of genomic templates and joined together, with additional base pair insertions and deletions at the VD and DJ junctions leading to further randomness in the final sequence [2]. In the case of antigenexperienced cells, receptor sequences further undergo random somatic hypermutations. A difficulty that arises in analysing receptor sequences is that a given sequence can be produced in many ways. This complication makes it impossible to unambiguously retrace the steps (V, D, J gene choices, deletions and insertions at the junctions, hypermutations) that have led to its generation. Our method circumvents this difficulty by employing a probabilistic framework that exploits large sequence datasets to accurately infer the statistical properties of the three processes that are central to the generation and evolution of BCRs-VDJ recombination, functional selection and hypermutations. This analysis allows us to quantify the theoretical diversity of these sequences. Our results suggest that, as in the case of T-cell receptors, the VDJ recombination process is biased towards sequences that are likely to pass functional selection. We also show that the diversity of the human B-cell repertoire is significantly reduced during the initial functional selection step. Our probabilistic framework also allows us to account for the statistics of hypermutations, which are well described by an additive DNA context-dependent binding model.

Analysis approach
We analysed Illumina reads of BCR DNA sequences from human blood samples taken from two individuals (labelled A and B) [16]. Cells from each sample were sorted into naive and memory subsamples. The variable region of the BCR gene was amplified using sequence specific PCR primers resulting in 130 bp DNA sequence reads anchored on a conserved sequence within the J gene (data from H. Robins, whom we thank for sharing it with us).
The VDJ recombination process is not guaranteed to produce in-frame sequences or, even when sequences are in frame, functional proteins. If the receptor gene from the initially rearranged chromosome is not functional, the second chromosome may be rearranged. If this second recombination event leads to a functional receptor, the cell has two rearranged chromosomes-one functional and expressed, and the other one silenced by allelic exclusion. As a result, the DNA sequence dataset we analysed contains a large fraction of non-productive sequences, which are either out-of-frame or contain a stop codon. These sequences experienced no selection and owe their survival to the receptor expressed by the other chromosome. For this reason, they provide us with the raw, unselected product of the generation process. We used such out-of-frame sequences from the naive subsample to infer the statistics of the VDJ recombination process, and the out-of-frame sequences from the memory subsample to learn the statistics of hypermutations. McCoy et al. [19] previously exploited these differences between in-and out-of-frame sequences in human BCR memory repertoire analysis. The naive productive sequences (in frame and with no stop codon) are expected to have passed a selection process before being admitted to the periphery (henceforth called initial selection, to distinguish it from selection following a recognition event). We used this subsample to learn the selective forces acting on amino acids by comparing how their statistics differ from the raw product of VDJ recombination learned from the naive out-of-frame sequences. Figure 1a summarizes the analysis workflow and emphasizes how the three main processes underlying sequence diversity-VDJ recombination, initial selection, hypermutations-are inferred using three subsamples of the sequences. A typical subsample used in our analysis had approximately 200 000 unique sequences.
As we stressed above, one sequence can be the result of a number of scenarios that include different initial gene choices, followed by variable numbers of deleted and inserted base pairs. This problem requires a probabilistic description of the generation process that sums over all the different possible scenarios for producing a given sequence, weighting each scenario by its probability. Each scenario's probability P pre is calculated using a generation

P(V, D, J)P(insVD)P(insDJ)
× P(delV|V)P(dellD,delrD|D)P(delJ|J) × P(s 1 )P(s 2 |s 1 ) ··· P(s insVD |s insVD-1 ) × P(t 1 )P(t 2 |t 1 ) ··· P(t insDJ |t insDJ-1 ) learn Q knowing P pre Figure 1. (a) BCR heavy chain sequences are formed during VDJ recombination according to a probability distribution P pre that we infer from the unproductive naive sequence repertoire. The unproductive memory repertoire is used to infer the rate and sequence dependence of somatic hypermutation. Productive sequences are selected for entry into the naive peripheral repertoire with a sequence-dependent factor Q, resulting in the observed distribution of receptor sequences P post . (b) Recombined sequences arise via a scenario involving independent choices of which gene segments to recombine as well as of numbers of deletions and insertions. The probability distribution of these choices is not known unambiguously from the observed sequences and is estimated probabilistically in an iterative procedure. (c) The selection factor Q is assumed to be a product of factors for V and J gene choice together with factors q i;L (a) for the choice of the specific amino acid a at each position i in a CDR3 of length L. These factors are determined from the naive productive sequence repertoire by an iterative procedure.
model of the form shown in figure 1b. In brief, the various factors account for the probabilities of uncorrelated events leading to a specific VDJ rearrangement: choice of which gene segments to recombine P(V, D, J ), choice of number of nucleotides to insert in a VD or DJ joint P(insVD) and P(insDJ), probability of number of deletions from all four ends of the V, D and J genes at the junctions P(delVjV ), P(dellD,delrDjD) and P(delJjJ ), as well as factors to account for unequal nucleotide preference in the inserted sequences.
Since the recombination machinery is the same for B and T cells, and this model structure captured all the correlations present in T-cell data [5], we expect that it should also capture all correlations present in the B-cell sequence data; we will see below that this is the case. In [5], we described an expectation -maximization method to self-consistently solve for the component probability distributions in P pre by maximizing the likelihood of the set of observed sequences and we apply the same method here. This method is applied to the naive unproductive sequence repertoires that result from the raw generation process. The expectation -maximization algorithm converges after only a few steps (electronic supplementary material, figure  S1). The results of the inference obtained from two disjoint sequence datasets (from the same individual) are almost identical (electronic supplementary material, figure S2), indicating that our results are robust to statistical noise. We validated a posteriori the factorized structure of the model distribution in figure 1b, by checking that correlations between its constitutive elements were consistent with the hypothesized model structure (electronic supplementary material, figure S3). During proliferation, BCR sequences also acquire point hypermutations that occur with probability e i at position i on the sequence. We use the non-productive memory sequences to learn P mem , which is the product of P pre and the probability of a given combination of mutations (given in red in figure 1b). As in the case of non-productive naive sequences, we reasoned that these sequences were not selected at any point (either before or after hypermutations), and that their statistics should reflect the raw product of VDJ recombination followed by random mutations.
Using the (hypermutation-free) generation model as a starting point, we infer selection factors Q acting on each sequence in the naive repertoire, where Q is defined as the fold increase of the probability to see a particular sequence in the functional repertoire (naive, productive) compared with the previously learned generation probability: Q ¼ P post /P pre . To infer those factors, we use a factorized model (figure 1c), where we assume that selection acts independently on the V and J gene choice (through factor q VJ ), the length L of the CDR3 sequence (through factor q L ), and on each of the amino acids a i at positions 1 i L between the conserved cysteine near the end of the V gene and the conserved tryptophan within the J gene (through factors q i;L (a i )). We use an expectation -maximization procedure to update the selection factors until convergence, as previously described by Elhanati et al. [6]. The convergence of log likelihood as a function of iteration number is shown in electronic supplementary material, figure S4, and the reproducibility of the inferred factors across two disjoint datasets in electronic supplementary material, figure S5.
Our inference of recombination scenarios for individual reads requires accurate knowledge of the germline sequence of all the V, D and J genes. These genes have several alleles, and an accurate accounting of sequencing errors and somatic hypermutation events requires knowing which alleles are present on the two chromosomes of each individual. The existing databases do not provide such information, but list all alleles that have been detected, together with an estimate of the population frequency of these alleles [28]. To address this problem, we have developed a method for identifying allelic variation directly from the sequence data for each individual. Working with naive out-of-frame sequence reads (so that hypermutation is not an issue), we accumulate patterns of mismatches between reads and reference genome gene sequences that occur too often to be attributed to chance errors. This procedure usually identifies at most two significant alleles needed to account for a given individual's reads. While the majority of genes are homozygous, a significant number are not (we find that heterozygous V alleles account for 32% of all sequences in individual A). We are of course sensitive only to allelic variation in the region of a gene that can show up within the 130 bp sequence read (while the V gene is almost 300 bp in length). Thus, our 'alleles' may not capture the full extent of allelic variation, although they do capture all the information we need for our analysis. In summary, we infer specific alleles for each individual, and then use that individual's specific alleles when inferring generative models or selection models. This refinement of the genetic information yields a much improved accounting of sequencing errors and hypermutation events. A more detailed description of the method can be found in the electronic supplementary material, and a list of inferred heterozygous alleles is given in electronic supplementary material, tables SI and SII. Given a generative model inferred using this allelic data, we are further able to make a probabilistic assignment of alleles to two chromosomes for each individual. The model includes a joint usage probability distribution P(V, D, J ) of the three sets of genes/alleles (in effect treating alleles as separate genes). Since VDJ rearrangement happens on a single chromosome, the probability of recombining a heterozygous V allele with a heterozygous D allele on different chromosomes should be zero, up to assignment errors (figure 2). To reconstruct the two chromosomes, we consider all possible associations of a chromosome to each heterozygous allele. We then compute the likelihood of each chromosomal association as the sum of the joint probability P(V, D, J ) over all V, D, J combinations that are associated with the same chromosome, and simply choose the chromosomal association that maximizes this likelihood (see the electronic supplementary material for details). In this way, we found a chromosomal organization for the two individuals that accounted for about 90% of all sequences. We can also evaluate the usage probability of the two chromosomes identified using this procedure. For both individuals, it was consistent with equal usage probability between the two chromosomes, within errors.

(a) Distribution of recombination events
The raw distribution of recombination events P pre before any selection takes place was inferred from the naive, nonproductive sequence dataset, as explained above (figure 1). This inference procedure yielded the VDJ gene usage distribution, as well as the distributions of insertions and deletions, and the frequency of inserted nucleotides. It is notable that V and D gene usage (electronic supplementary material, figures S6-S7) is strongly non-uniform (to simplify displays of gene-specific information here and elsewhere, we agglomerate alleles into their associated genes). Figure 3 shows the distributions of the number of inserted nucleotides between the V and D genes (figure 3a), or D and J genes (figure 3b), averaged over both individuals. The figure shows both the distributions after generation, i.e. before any selection, as inferred from the non-productive sequences and, for comparison, the same distributions for the productive sequences in the naive repertoire, i.e. after the initial selection process. They have similar forms-wide distributions with exponential tails. The effect of selection is to favour sequences with fewer insertions, thus reducing CDR3 length.
The identities of the nucleotides inserted during the generation process are well described by a dinucleotide Markov model (electronic supplementary material, figures S9 -S11). As was observed for T-cell receptors [5], the profile for the VD insertions on the sense strand correlates very well with the JD insertions on the antisense strand.
Although the deletion profiles are in fact gene dependent, and we infer a separate deletion profile for each gene (electronic supplementary material, figure S8), for convenience we present here only the weighted mean over all genes of the V and J gene deletion distributions (figure 3c,d). These distributions seem little affected by selection, as evidenced by the similarity between the pre-and post-selection (naive) distributions.

(b) Selection
Armed with the raw recombination model P pre , we can estimate the effect of selection in the naive repertoire by comparing the statistics of the naive, productive sequences, with those predicted by the generation model. The effect of selection is already evident from the CDR3 length distribution [16], as illustrated in figure 4. The pre-selection sequences are longer and have a wider length distribution than the selected ones. This effect of selection on length is in agreement with previous studies [12,16], and in close parallel with recent observations on T-cell receptors [6].
Selection acting on the rearranged sequences is quantified by position-dependent selection factors q i;L (a) that capture the positive or negative effect that each amino acid at each position has on the functionality of the entire sequence, and in addition by gene-dependent selection factors q VJ (figure 1b). We show the amino acid selection factors averaged over both individuals in figure 5 and show the gene-dependent selection factors in electronic supplementary material, figure S12. The q i;L factors are reasonably consistent between the two individuals (figure 6a), even though their sequence repertoires show slightly different CDR3 length distributions ( figure 4). Residue selection patterns show a dependence on the position and length of the CDR3-some patterns are related to either the V or J side of the junction, while other effects are localized in the bulk. This is related to the conserved motifs just outside of the CDR3, which function as anchors for the variable area. Thus, the role of an amino acid is different whether it is close to the V gene conserved motif, the J conserved motif, or far from both.
We also looked at correlation between our selection factors and various biochemical properties of the amino acids. We used quantified numeric properties for every amino acid, such as hydrophobicity or polarity, to look at the Spearman's correlation between these properties and selection factors of the 20 amino acids, for a specific position and CDR3 length. This can be done for every position and length, yielding many correlation coefficients. Figure 6b-l shows the distributions of those correlation coefficients. Selection is not determined by a single property of the amino acid. However, some properties do correlate with selection, namely the tendency of an amino acid to participate in a turn of the protein (figure 6d) and its tendency to be found in the core of the interaction complex (figure 6g). A few other properties, namely amino acid volume (figure 6l ), charge (figure 6h), pH (figure 6i) or hydrophobicity (figure 6k) all have a negative influence on the overall amino acid selection probability. This last negative correlation is consistent with the observation that hydrophobic D segments are selected against after rearrangement [16]. Most of these results are similar to what was observed in the case of T cells [6].

(c) Correlation between generation and selection
What kind of sequences are likely to pass initial selection? Each sequence, no matter in what repertoire we observe it, can be assigned a probability of being generated in the initial VDJ recombination event. Figure 7a shows the distribution of this quantity for sequences in the pre-and post-selection repertoires. Remarkably, we note that most sequences have a very low generation probability (typically less than 10 210 ). The similarity of the naive productive (green) and postselection model prediction (red) curve is a validation of the model, while the difference from the pre-selection (blue) curve highlights the effect of selection. Sequences that had higher probability to be generated are also more likely to be selected, resulting in a shift towards higher generation probabilities after selection. This correlation between generation probability and selection is made more evident by figure 7b, which shows a two-dimensional density plot of the generation probability and the overall selection factor P pre versus Q, evaluated over a large set of generated sequences. There is a clear correlation between generation and selection-higher values of generation probability imply also higher values of selection and vice versa. To put it differently, the generation process anticipates the subsequent somatic selection process.

(d) Repertoire entropy
The diversity of the immune repertoire is one of its key characteristics. The rearrangement and selection models enable us to precisely quantify this diversity and identify its sources. Repertoire diversity is an inherently dynamic property. Upon random rearrangement, the initial diversity is established, but initial selection will modify it. Those changes can be demonstrated by looking at the entropy of the different distributions, calculated from the models we infer. Entropy gives a measure of the number of different sequences we can expect to find at different stages of B-cell development ( figure 8). The generation entropy can be broken down into contributions from the different events that make up the recombination scenario. Most of the diversity comes from insertions. Note that the entropy of rearranged sequences is smaller than the entropy of recombination events. This is due to convergent recombination-the fact that a given sequence can be produced by different recombination scenarios. Individual B has larger generation entropy than individual A, primarily because it has more insertions. The entropy of productive sequences is further reduced (by two bits) by keeping only the in-frame sequences. Subsequent action of selection reduces the diversity of the repertoire by about 10 bits. This reduction is true for both individuals, regardless of their different generation entropy, and follows from the correlation between P pre and Q (figure 7b), which concentrates the distribution towards the most probable rearrangements, thus reducing diversity in the selected repertoire. The numbers that can be associated with these entropies are extremely large: 2 71 2 Â 10 20 for the productive sequences and 2 60 ¼ 10 18 for the naive repertoire-much larger than the number of B cells in the body. These numbers are not estimates of the total number of possible sequencesthis is set by the maximum number of insertions and is much larger-but rather the equivalent number of outcomes in a uniform probability distribution.

(e) Hypermutations
Upon recognition of an antigen, BCRs undergo an affinity maturation process, by which their binding strength to the antigen is increased through the combination of random somatic hypermutations and selection. Thus, receptor sequences from  The pre-selection distribution is derived from a synthetic repertoire of productive sequences drawn from the generative model P pre that has been inferred from naive unproductive data sequences. Notable features include the progressive shortening and narrowing of the distribution as selective pressure is applied, and the close similarity, but not identity, between the two individuals. Hypermutations appear in our sequence reads as mismatches with the genomic sequence. However, because the survival of a sequence in the memory repertoire depends on its affinity for a particular antigen, the statistics of its hypermutations should reflect factors other than just the hypermutation process itself. To overcome this issue, we make the assumption (as in [19]) that when the hypermutation machinery is activated in a cell, it acts on both chromosomes indifferently. This assumption is backed by the comparable number of mismatches seen in the out-of-frame and in-frame memory data (the number of mismatches found in the naive out-of-frame data is smaller by two orders of magnitude). If this assumption is true, out-of-frame sequences will also display the effects of somatic hypermutations, with statistical properties unaffected by any further selection effect. For this reason, we restrict to out-of-frame memory sequences to infer the statistical properties of the somatic hypermutation process.
First, we used the out-of-frame memory sequence reads to infer a model of rearrangement with random hypermutations in the germ line part of the sequence (figure 1b). Such a model allows us to assign a set of likelihood-weighted recombination and hypermutation scenarios to any specific sequence. We thus identified and probabilistically weighted somatic hypermutation events and recorded them, together with their sequence context within a seven-nucleotide window centred on the mutation. We restricted our analysis to V gene-derived segments: D and J gene-derived sequences are much shorter and may not have provided enough statistics for our analysis. We then use statistics of the sequence context of hypermutations to construct a simple additive scoring model for the probability of mutation at any position within a V gene context. Specifically, the probability of observing a somatic hypermutation was assumed to be of the form p SHM (s) / p bg (s) exp[ P i¼23,3 e i (s i )], where s ¼ (s 23 , s 22 , . . ., s þ3 ) is the nucleotide sequence in a sevennucleotide window around the mutated nucleotide s 0 ; p bg (s) is the background probability of the heptamer s occurring in the genomic V gene sequences; and the e i (s i ) are contributions of each position to the motif, which play the same role as binding energies [31]. The values of e i are adjusted so as to maximize the likelihood of the data under the model. The e i factors so inferred are displayed in figure 9a. Since they are defined up to a constant, we impose P s e i (s) ¼ 0 at each position i relative to the mutation site. A positive value of e i (s) means that nucleotide s is enriched at position i relative to a mutation site.
By running p SHM (s) over the genomic V gene sequences, we get a predicted probability of observing a somatic hypermutation at different positions within the V gene. The prediction of this model averaged over all gene profiles (aligned with respect to the conserved Cys), and its comparison with observed hypermutations is displayed in figure 9b (the profiles for individual genes were found to be similar to the average profile). The scatter plot of model predictions versus data (correlation ¼ 0.8) shows that the model gives a good account of the data, and the plot of their position dependence shows that the model accounts well for the presence of hypermutation hotspots. Note that the hypermutation rate displayed in figure 9b is very large: 5-10% per position. For comparison, the rate of mismatches in the naive out-of-frame sequences (attributable to sequencing errors or a leak of memory cells into the naive set) was approximately 0.1%. Finally, we also looked at the dependence of the substituted nucleotides on the immediate context of the mutation. Figure 9c represents the probabilities of substituted nucleotides as a function of the trimer context. We note a clear dependence on the identity of the mutated base, with additional context-dependent variability from the local trimer sequence.

Discussion
Our statistical algorithms allowed us to characterize in detail the generation, initial selection and hypermutation processes that lead to the observed BCR repertoires. Two key ingredients underlie our approach. First, we exploited out-of-frame sequences, assumed to be free of selective pressure, to reconstruct the statistics of the DNA editing processes: VDJ recombination and hypermutations. Out-of-frame sequences have similarly been used as a baseline to study the properties of BCR naive repertoires [16], or to estimate selective    table 3.3 [29]). Residues that are exposed to solvent in protein-protein complexes (following definitions and data from [30]) are divided into three groups: surface (interface) residues that have unchanged accessibility area when the interaction partner is present (e), rim (interface) residues that have changed accessibility area, but no atoms with zero accessibility in the complex (f ) and core (interface) residues that have changed accessibility area and at least one atom with zero accessibility in the complex (g pressures related to affinity maturation in memory B cells [19]. Second, our approach overcomes the degeneracy of the recombination problem (whereby the same sequence may be generated by many different recombination events) by using a fully probabilistic approach. The generation of BCR and T-cell receptor results from similar processes, involving common enzymes and pathways. Thus, perhaps not surprisingly, many of the results we obtain here are similar to what was reported for T-cell receptors using similar methods [6]: statistical independence between the insertion and deletion processes, as well as between gene choice and insertions; the identity of inserted nucleotides following a Markovian probability law. Another similarity with T-cell receptors is that the generation process anticipates the action of selection: sequences that are more likely to be produced are more likely to be retained by selection. This suggests evolutionary adaptation of the generation machinery.
We also noted some differences with T-cell receptors. Because BCRs have more insertions than TCRs, we find that B-cell repertoires are much more diverse than T-cell repertoires, as measured by entropy, even before hypermutations occur. In addition, the selection factors acting on the CDR3 amino acid sequence of BCRs are quite different from those reported for T-cell receptors, consistent with the fact that their cognate epitopes are very different in nature.
Although there is a difference in diversity between our two individuals, this difference is restricted to the generation process and is caused by a slight excess of insertions in individual B. Selection factors on the other hand seem to be well-conserved across individuals, pointing to general biophysical and biochemical properties that are subject to selection. The selection factors correlate with some biochemical properties, including a negative correlation with hydrophobicity (in agreement with a previous report that hydrophobic D segments are selected against [16] after rearrangement).
This study also provides a couple of methodological advances. Thanks to our stochastic framework, we were able to identify heterozygous genes as well as their alleles, even when  Figure 7. (a) The distribution of generation probabilities (as inferred from the pre-selection model P pre ) for the pre-selection model itself (blue), the post-selection model P post (red) and the naive functional sequence repertoire itself (green). The key feature is that sequences in the selected repertoire have systematically higher generation probability. Panel (b) makes the same point via a scatter plot of the primitive generation probability versus the selection factor Q for a synthetic repertoire of sequences generated according to P pre .  Figure 8. Total sequence entropy partitioned into its various elementary contributions for the two individuals. The bottom three horizontal bars in each stack display the partitioning of the entropy of the probability distribution of recombination scenarios. Because multiple scenarios can generate the same sequence, the nucleotide sequence entropy of the sequences directly produced by recombination is smaller than the recombination scenario entropy. Out of those, productivity of the sequence further restricts the diversity by constraining frame and forbidding stop codons appearing, as depicted in the smaller bar above. Finally, as seen in the topmost bar, the initial selection process itself significantly reduces the diversity of those productive sequences. It is worth noticing that while the initial diversity of both individuals is different, consistent with their different CDR3 length distributions, the reduction effect of the selection is quite similar, keeping the same difference in entropy. rstb.royalsocietypublishing.org Phil. Trans. R. Soc. B 370: 20140243 these were not present in existing databases. We could then reconstruct the partition of these alleles into two chromosomes. We found no significant bias in chromosome usage.
Perhaps the most important difference between BCR and T-cell receptors is the existence of hypermutations accumulated during affinity maturation. Hypermutations are expected to add an enormous amount of sequence diversity. With a hypermutation rate estimated to be of the order of 5-10%, hypermutations are expected to contribute an additional %0.4 bit per nucleotide-a huge number if we consider that hypermutations can happen over a region a few hundred nucleotides in length. A common difficulty in studying hypermutations is to disentangle the biases that are inherent to the hypermutation process from the biases that result from affinity-driven selection [32]. While previous efforts have been devoted to the inference of selective pressures in the B-cell memory repertoire [19], here we focused our attention on the raw substitution process. We used outof-frame memory sequences to learn about the statistics of substitutions unperturbed by the effect of selection, as was suggested in [26]. For this purpose, we adapted our probabilistic framework to infer the hypermutation profile across all genes. The hypermutation rate was found to be very variable along the V-gene sequence but similar across genes. We showed that a simple additive model could correctly predict the hypermutation rate from the immediate sequence context (heptamer) around the mutation site. These results confirm earlier reports that hypermutations are context-dependent both in localization and substitutions [23,26,33,34], as well as recent observations from high-throughput sequencing data restricted to synonymous mutations [20]. Our additive model is consistent with an independent site binding energy model for the hypermutation-inducing enzyme activation-induced cytidine deaminase (AID), where each flanking nucleotide contributes independently to its binding energy, as in the case of transcription factor binding sites [31]. The hypermutation rates and substitution matrices inferred by our model could serve as a baseline or a control for future efforts to infer selective pressures from functional sequences, and could be useful in the inference of mutational phylogenic trees in the study of affinity maturation.