The analysis of clonal expansions in normal and autoimmune B cell repertoires

Clones are the fundamental building blocks of immune repertoires. The number of different clones relates to the diversity of the repertoire, whereas their size and sequence diversity are linked to selective pressures. Selective pressures act both between clones and within different sequence variants of a clone. Understanding how clonal selection shapes the immune repertoire is one of the most basic questions in all of immunology. But how are individual clones defined? Here we discuss different approaches for defining clones, starting with how antibodies are diversified during different stages of B cell development. Next, we discuss how clones are defined using different experimental methods. We focus on high-throughput sequencing datasets, and the computational challenges and opportunities that these data have for mining the antibody repertoire landscape. We discuss methods that visualize sequence variants within the same clone and allow us to consider collections of shared mutations to determine which sequences share a common ancestry. Finally, we comment on features of frequently encountered expanded B cell clones that may be of particular interest in the setting of autoimmunity and other chronic conditions.


Introduction
B cell clones comprise the fundamental units of selection in the humoral immune response. But what defines a clone? Clonally related cells derive from a common progenitor cell. All cells in the body initially derive from the zygote, but saying that all of the cells in the body are clonally related is not a satisfying answer for B cells. B cells undergo somatic recombination events and somatic hypermutation (SHM) to create a diverse repertoire of clones. To find a better answer to this question, we need to think of differentiation events and how they limit downstream cell fates. Therefore, in §2, we begin by reviewing basic checkpoints in B cell development where the antibody repertoire is generated, diversified and selected. Clones, or collections of genetically similar cells (as defined by similarities in their antibody heavy chain, light chain or a combination of both), emerge at these different checkpoints. How clones are defined is also profoundly influenced by the kind of experiment that is performed. In §3, we review four basic kinds of repertoire experiments and comment on the advantages and disadvantages of each with respect to clone identification and characterization.
Next, we come to the computational problem of identification and analysis of clones in high-throughput sequencing data. In §4, we describe different methods for defining clones and some of the errors that arise with these different approaches. We emphasize the need for trying different methods and using different parameters for defining clones to determine if clonal lineages are robust. In §5, we comment on clonal expansion, diversity and overlap. Here, we focus primarily on the diversity of the repertoire as a whole (inter-clonal diversity). In §6, we comment on two general metrics of repertoire skewing: heavy chain V gene (VH) usage and third complementarity-determining region (CDR3) length distribution. In § §7 and 8, we turn to diversification within individual clones and discuss some of the computational approaches and challenges when analysing clonal lineages for evidence of selection. We leave some of the details of clonal lineage analysis to the separate article in this theme issue by Steve Kleinstein [1].
Although many clones are unique, one of the most surprising findings in the literature is that certain kinds of antibodies recur-they appear to be independently generated. In §9, we review some examples of recurrent expanded B cell clones and stereotyped receptors. We comment on why such clones may be enriched in the repertoires of individuals with autoimmunity and other chronic conditions, as well as why multireactive clones may serve beneficial functions in healthy individuals. In §10, we discuss some ideas regarding clonal evolution and speculate that many expanded and persistent clones are multireactive in some of their members/ progeny, at one or more times during their lifespans.

Definitions of clones that focus on different stages of B cell development
During somatic B cell development, there are several developmental stages where clones could be separated from other clones. An overview of B cell maturation is provided in figure 1 and is based mostly upon mechanistic studies in the mouse and in cell lines [2,3]. Figure 1 summarizes the corresponding antibody gene rearrangements and other forms of diversification. Heavy chain (H chain) gene rearrangement occurs in the pro-B cell stage [4]. Developing B cells with different H chain gene rearrangements are represented by different colours. In the large cycling pre-B cell stage ( pre-B1), cells with favourable H chains undergo more rounds of cell division than cells with less favourable H chains [5,6]. Favourable H chains could be those that are expressed at higher levels and/or pair more effectively with the surrogate light chain and/or signal better [5]. B cells that fail to make a productive H chain rearrangement altogether die [7]. For simplicity, we show only the fate of B cells that express a particular H chain rearrangement (denoted by dark green shading) that have successfully emerged following the pre-B1 selection checkpoint. After signalling through the pre-B cell receptor ( pre-BCR) and a few rounds of cell division, these B cells are ready to enter the small resting pre-B2 stage of development, in which light chain (L chain) gene rearrangement occurs [8]. Because there is more than one B cell with the same H chain rearrangement entering the pre-B2 stage, we presume that these cells can in principle undergo independent rearrangement to different L chains  that is favourable and B cells with this particular rearrangement undergo multiple rounds of cell division in the pre-B1 stage. In the pre-B2 stage, light (L) chain rearrangement occurs, indicated by different shades of green. Fully formed antibodies are subjected to further selection in the naive B cell stage. In the periphery, antigen-reactive B cells can undergo activation, proliferation and somatic mutation (GC, germinal centre). Somatic mutations are indicated by different coloured circles. Mutations at different branch levels are marked by different coloured circles around each 'cell'. An appropriate lineage is constructed based on the number of common mutations in each of the expressed mutant types, leading, in this particular example, to the suggestion of a lineage with a common trunk (orange inner circle), two branch levels, and two inferred nodes (intermediary mutants).
and have indicated the range of different L chains with different shades of green at the pre-B2 cell stage. The next stage of B cell differentiation is the naive B cell stage, sometimes also referred to as the transitional B cell stage. These cells are the first to display the B cell receptor on the cell surface, in the form of an IgM molecule. Based largely on data in inbred strains of mice, transitional B cells appear to be short-lived and most die [9][10][11]. For illustrative purposes, we have therefore contracted our repertoire from many different H þ L pairs in the pre-B2 stage to a single cell in the transitional stage (that has already survived the selection checkpoint within the transitional cell pool). For many surviving peripheral B cells, the next major round of selection occurs mainly in the germinal centre, where B cells are exposed to antigen and T cell help, and undergo multiple rounds of proliferation, SHM and selection for high-affinity binders [12]. Sequence variants of the clone emerge and either expand or contract, based upon their affinity for antigen(s) [13 -15]. Based upon this simplified overview of B cell maturation, one realizes that there are several checkpoints at which expanded clones could be molecularly defined. If one focuses on the pro-B to pre-B cell transition, clones can be identified on the basis of having shared functional H chain rearrangements (VDJþ), as well as shared non-functional rearrangements (VDJ2) or partial rearrangements (DJ). If instead, one focuses on the pre-B2 to transitional B cell checkpoint, one could use not only the H chain rearrangement(s), but the L chain rearrangements as well. L chain rearrangements can include V-J rearrangement at the kappa and lambda loci as well as rearrangements to the non-coding recombining sequence (RS) on the kappa locus. Finally, if one focuses on the mature B cell repertoire, one could not only use the combinatorial and junctional diversity of the H and L chain gene rearrangements generated in the preceding stages, but also the diversity that arises due to SHM.

Different experimental approaches for
antibody repertoire analysis Table 1 shows four different experimental approaches that are used to analyse B cell clones. There are of course other approaches, such as Southern blotting of bulk repertoires [16,17], V gene arrays [18] and mass spectrometry of secreted antibodies [19], but we chose these four methods because they are the most frequently used.

(a) Hybridoma panels
Hybridomas are non-secretory myeloma cells fused to normal B cells [20]. They provide virtually unlimited amounts of genetic material, and because they secrete the antibody of the normal B cell, they provide a facile means of studying that antibody. Because of the abundance of available genetic material, the gene rearrangements at all of the antibody loci can be studied in hybridomas. The variety of antibody gene rearrangements that can be used to identify and track expanded B cell clones are shown in figure 2. To define clonally related hybridomas, one can readily generate full-length sequences of the expressed H and L chains. The analysis of these sequences can yield information about the hypervariable third complementarity-determining region (CDR3), the H þ L chain pair, and even about SHMs within the variable region genes of the H and L chains. The sizes of the DNA fragments that contain various gene rearrangements by Southern blotting can be used as another clone-specific genetic feature of hybridomas. One can also perform polymerase chain reactions (PCRs) to genotype the expressed H and L chain rearrangements, and the non-productive or deletional rearrangements (such as nonproductive VDJ rearrangements on the non-expressed H chain allele or RS rearrangements at the kappa locus). One can also characterize reciprocal products, such as Vk-Jk rearrangements that are retained on the chromosome when a subsequent inversional rearrangement occurs between an upstream Vk and a downstream Jk gene segment. Although nearly all of these other rearrangement products are not as diverse as the H chain CDR3, finding concordance among them in two different hybridomas is a powerful indication that the hybridomas are clonally related [21]. The major drawback of using hybridomas for antibody repertoire analysis is that they only represent a very small portion of the total B cell repertoire. There are several reasons for this. First, some kinds of B cells are easier to fuse than others. For example, in some cases, B cell mitogens such as lipopolysaccharide (LPS) are used to enhance the recovery of hybridomas and some cells, such as marginal zone cells, are more responsive to LPS than others [22]. Second, the yield of generating hybridomas in our experience is at best several hundred per mouse. Because only one in several thousand B cells successfully fuses with the myeloma cell and is propagated in culture, there is limited sampling of the repertoire. Third, hybridomas are typically selected by binding to antigen. On the one hand, this is an advantage, because Table 1. Repertoire experiment types. Shown are four different types of repertoire experiments. Each experiment is classified on the basis of the type of information it provides on the antibody repertoire starting with whether full (heavy þ light chains) or single chains (heavy or light, unlinked) are studied. The information generated by the different approaches is classified into that which is readily available (yes), not available in all instances or only with difficulty (partial) or unavailable (no), see text. H, antibody heavy chain; L, antibody light chain; Ab, antibody. multiple rounds of selection can be used to not only select for binding to a particular antigen but also to select against binding to other antigens. But a disadvantage of screening clones for the antigens they bind is that one may inadvertently discard members of the same clone that have accumulated somatic mutations that have caused their specificity to drift. Discarding clones or sequence variants within clones that bind to antigens other than the antigen of interest may result in the selective loss of B cells that are multireactive. Another difficulty is that selection for antigen binders further reduces the number of clones from several hundred to at best a few dozen per mouse. Yet even at this level of sampling, it is remarkable how many fundamental insights into clonal selection have been gained from hybridoma analysis [23][24][25].
(b) Phage display DNA that encodes antibodies (typically the single chain variable fragment) can be inserted into the genome of the bacteriophage (a virus that infects bacteria) and expressed on the surface of the virion [26]. Phage particles that express antibodies that bind to the antigen of interest are selected by panning and then propagated in bacteria. Multiple rounds of panning can be used to recover bacteriophages that harbour antibodies with the specificities of interest. Thus, phage display is a powerful tool for recovering antibodies that bind to specific antigens.
Unlike hybridomas, antibody phage display is easily adapted to recover antibodies that have different H chain isotypes and sorted subsets of B cells can be used to generate phage display antibody libraries. However, current methods have not yet resulted in the engineering of phage in which the H and L chains of the antibodies produced by single B cells are associated. Thus, phage display may generate a broader repertoire of antibodies (with random H þ L pairs) than the true antibody repertoire (in which the association of specific H þ L pairs can occur).

(c) Single cell cloning
In single cell cloning, suspensions of single cells of interest (e.g. influenza-specific plasmablasts, [27]) are sorted into individual wells, RNA is extracted, and antibody H and L chain V genes are amplified. The amplified V genes are then cloned into expression vectors, transfected into cell lines and secreted antibodies are purified and analysed (further details of the protocol can be found in Tiller et al. [28]). Drawbacks of the method are that it is low throughput and arduous to go from H þ L sequences to cloning antibodies, expressing them and testing their specificities. However, recently, some groups have developed methods for moderate-to high-throughput sequencing of H þ L chain pairs from the same cell, which may substantially improve future repertoire studies with this method [29,30]. Several recent studies of the antibody repertoire have relied upon high-throughput sequencing of antibodies from populations of B cells [31][32][33][34][35]. The major advantage of this approach is that very large numbers of clones can be studied and large numbers of variant sequences can be evaluated within expanded clones. When coupled with flow cytometry, this technique can also be used to evaluate B cell repertoires in different B cell subsets [36] or in B cells that bind to a particular antigen [37]. The potential to study both antigen-selected and unselected cells from the same individual and survey up to millions of different antibody gene rearrangements provides insights into the repertoire in an unprecedented level of detail. Unlike hybridomas and antibody phage display, antigenic selection of B cells analysed by bulk sequencing is limited to a single round of selection because once the cells are selected, they are destroyed in the process of nucleic acid extraction. If antibody H chains or L chains are sequenced separately, it is impossible to recreate the H þ L pairs that are associated with single cells, although bioinformatic approaches are being used to try to match H chains with L chains based upon their frequencies and other properties [38]. Technical developments using emulsion PCR-based approaches now bring moderate-throughput next-generation sequencing of full antibody (H þ L chain pairs) from single cells within reach [39].

Identification of clones in high-throughput sequencing data
Most single chain bulk sequencing experiments focus on the antibody H chain. H chains are more diverse than L chains, providing a more reliable signature for clonal relatedness. H chains are more diverse than L chains because they have two rearrangement junctions in the CDR3 and these junctions are more diverse because the enzyme terminal deoxynucleotidyl transferase, which creates N additions, is more active during H chain rearrangement [40]. The H chain CDR3 also includes the D gene segment (which L chains lack). D genes can be read in up to six different reading frames and can occasionally undergo D-D fusion [41]. Finally, H chains also may undergo higher rates of SHM than L chains, particularly if there is concomitant peripheral L chain editing [42]. Higher mutation frequencies in H chains can make it easier to establish clonal association of sequences based upon shared mutations using H chain, rather than L chain sequences. After the identification of primer sequences that indicate which sample each read corresponds to, sequences are subjected to quality control. One can use software to trim the ends of the read based upon the Fastq score, and/or introduce Ns into sequences that have likely errors (low-quality scores) in the middle [43]. Next, reads are subjected to V, D and J identification using various tools such as the ImMunoGeneTics high V-QUEST server, Ig-BLAST or others [44][45][46][47][48]. At this point, sequences that are identical (or sufficiently similar, based upon the frequency of sequencing errors, to be considered identical) are grouped together into 'unique sequences'. Each unique sequence has an associated number of copies or times that it has occurred. In samples with large clonal expansions (for example in B cell malignancies), the top copy number clone will be many times more frequent than the next most frequent clone. Conversely, with highly diverse repertoires, such as naive IgM þ B cells in the peripheral blood [37], clones will tend to have more uniform copy number distributions and replicate sequencing of the same DNA sample will yield minimal clonal overlap. With unmutated lymphocyte populations, the number of unique sequences approximates the number of clonotypes. But with B cells that have undergone SHM, this assumption is no longer valid. Expanded B cell clones with mutations may include sequence variants that differ by several mutations from one another. Accurate clonal assignment is critical for understanding inter-and intra-clonal repertoire diversity and selection and involves complex data analysis.
A common definition of clonal relatedness is to consider all sequences with the same germline V and J identification and some similarity in the CDR3 regions as being clonally related. One of the challenges with using the CDR3 sequence for this purpose is that the germline CDR3 sequence is unknown. Not only are the junctional sequences between V and D and between D and J generated somatically, but the D gene itself is often extensively nibbled during the recombination process, making identification of all but the longest D gene segments unreliable [49]. Current methods have 50% or lower D identification rates at 10% somatic mutation frequencies [50], which is the equivalent of approximately two amino acid positions not being identical to the germline at an average CDR3 length of 14 amino acids. For these reasons, features of the CDR3 sequence, such as its length and sequence similarity, are used as indicators of clonal relatedness. CDR3 length is typically identical for clonally related sequences, although sequencing errors and naturally occurring mutations including insertions and deletions [51 -53] can cause clonally related sequences of differing lengths to be misclassified as separate clones. If VH identity is used as a criterion for clonal assignment, sequences with ambiguous VH assignments can be misclassified. VH assignment ambiguities (so-called V ties) are more likely for certain VHs, such as the VH 4 family, and occur more frequently with shorter read lengths. Another challenge with any method of clonal assignment that uses nucleotide or amino acid similarity in the CDR3 or VH or both, is that different thresholds may be required when the CDR3 length is very short and/or there is a high degree of somatic mutation in the rearrangements under study. Shorter CDR3 sequences can arise more easily by chance, leading to erroneous inclusion of multiple clones into a single clone with a short CDR3 sequence.
One further challenge when comparing many sequences to each other is deciding which sequence should be compared first. All clonal identification methods that rely upon one or more metrics of sequence similarity, such as percent sequence similarity, Hamming distances or shared somatic mutations, suffer from this problem. For example, if one randomly starts to identify a clone with the first encountered unique sequence, the 'edges' of the clone may wind up in a different place than if one uses the highest copy number sequence within the clone, which could be closer to the centre of the clone. An alternative approach is to take all the potential members of a clone, i.e. all the sequences with the same germline V gene, J gene and CDR3 length, and to create a lineage of their common mutations. While in this instance we also consider some comparisons to be 'first', i.e. we are comparing all the sequences in a clone to the putative germline root of the lineage, this decision is based on a specific set of assumptions and data. Under these assumptions of clonal relatedness, all those sequences that share some threshold level of mutation are considered to be rstb.royalsocietypublishing.org Phil. Trans. R. Soc. B 370: 20140239 members of the same clone. For example, four or fewer mutations from the germline were empirically shown to be a level of shared mutation that was possible to achieve by chance [38,49]. However, it is important to note that for different types of immune responses the threshold level of mutation that indicates common clonality may need to be changed (i.e. a higher threshold may need to be used in highly mutated repertoires, for instance in those found in people chronically infected with HIV [54]).
In figure 3, we show how different thresholds and methods suggested above can lead to differences in clone assignments. We consider 11 sequences that we cluster by their CDR3s at 85% identity into four clones (figure 3a-clone 1, red; clone 2, blue; clone 3, green and clone 4, black). In every clone, all of the sequences are 85% similar in the CDR3 to all of the others (i.e. they differ from all other members of the clone by up to two mutations in CDR3; figure 3b). If we consistently assign sequences to clones starting with the largest copy number sequence, we get a reproducible order of clones, which we can compare with the sets of clones formed from these 11 sequences by other means. Next, we consider the same sequences under less stringent demands of CDR3 identity (65%) and this leads to a different division of the sequences into two clones ( figure 3a,b). Alternatively, we consider all the mutations from the germline and use them to create a lineage. This method has the advantage that it considers the entire length of the sequence to assess similarity [49]. However, it cannot consider the CDR3 regions (as we do not know to distinguish germline positions in the CDR3). In this example, the lineages suggest three clones that match most of the clones from the 85% CDR3 method (figure 3c). Thus, we can see that decisions on which part of the sequence data to use, and how the threshold of similarity is set, change the clonal assignment of sequences. However, even with different methods, the identity of the big clones remains quite constant (figure 3). One way to potentially overcome sequence comparison ordering issues in clone identification is to empirically perform clone identification using different seed sequences, different cut-offs for similarity and/or different or iterative clustering approaches to determine if robust clonal lineages exist [38,55].

Evaluation of the clonal landscape: clone identification and tracking
Once the sequences have been collapsed into clones, the next step is to study the frequency distribution of clones in the repertoire. For example, histogram plots can be created to show the copy numbers of all of the clones in the repertoire. The copy number refers to the number of times the sequences that comprise the clone appear in the sequencing dataset. In a very diverse repertoire, as occurs in the naive B cell pool within the peripheral blood of healthy humans, the copy number distribution is fairly flat, with the highest copy number clones being minimally more frequent than the next most frequent clone. In contrast, if the repertoire contains one or more very large expanded clones, as occurs in the setting of some autoimmune diseases, such as Sjogren's syndrome [56] or systemic lupus erythematosus (SLE) following rituximab therapy [57], or B cell malignancies such as chronic lymphocytic leukaemia (CLL) [58], there will be a few or even just one highly dominant high copy number clone that is many times more frequent than the next most frequent clone. Because clonal expansion is a fundamental feature of an immune response, we are very interested in learning whether expanded clones can be identified and tracked over time in individuals with immunologic disorders such as SLE. To begin to address these questions, we need to have a reliable means of quantifying the clone copy numbers, so that different sequencing libraries can be compared with each other. Some have approached this problem by using calibrators that are amplified along with the endogenous rearrangements. The ratio of the copy numbers from the calibrators versus the cells in the sample provides a means of normalizing the copy number [59]. But in the case of IgH rearrangements, such normalization procedures may be insufficient because SHM can affect the efficiency of PCR amplification and not every single somatic variant can realistically be captured in the calibrators. Alternatively, one could start from RNA and use primers that are in the constant region to circumvent mutation. But with RNA from bulk populations, one has to correct for differences in transcript abundance in different cell types or perform amplifications on sorted or single cells. In addition to using calibrators or some other normalization technique, one can generate replicate sequencing datasets from the same sample [60]. By performing multiple replicate amplifications from independent DNA aliquots of the same sample, one can determine the maximal expected clonal overlap within a given sample, determine which clones are reliably identified in the replicates (including the reproducibility of their copy number rankings) and, with adequate sampling, perform a rarefaction analysis to empirically determine which clones can be identified reliably [61]. A final method to assess the relative extent of clonal expansion is to count the number of unique sequences in a given clone, counting every variant of mutant sequence in a clone only once. While this is not a direct measurement of all expansion as it depends on mutation, it is nonetheless quite reliable, when mutations are present.

Metrics of repertoire skewing
In addition to the relative or absolute quantification of top copy number clones, one can also study the repertoire as a whole. Global metrics of the repertoire landscape include evaluations of how 'skewed' the repertoire is. One relatively simple and commonly employed metric is VH gene usage. VH usage can be quantified by calculating the percentage of clones that use a particular VH. One can also evaluate VH usage on the basis of clone copy numbers. A comparison of the VH usage of clonotypes (where each clone is only counted once) with VH usage with total copies provides insights into the effects of high copy number clones on the repertoire. VH usage is frequently shown as a heat map, and various types of clustering algorithms can be used to compare VH usage in different sequencing libraries, similar to what investigators use for microarray data.
Another general metric of repertoire skewing is the size distribution of antibody CDR3 lengths. In a diverse repertoire, we expect the distribution of the CDR3 sizes to resemble a truncated and discretized Gaussian distribution [62]. If the size distribution of CDR3s in the repertoire exhibits kurtosis or skewing, this can be an indication of an unexpected over-abundance of clones undergoing abnormal development or perhaps reacting to a specific antigen. For instance, antibody sequences with short CDR3 lengths may derive from B1 B cells, which develop in fetal life and tend to have fewer N additions because the enzyme that creates N additions (terminal deoxynucleotidyl transferase) is less active during fetal life [40]. Conversely, human antibody sequences with long CDR3 lengths may have an increased frequency of JH6 usage (JH6 has a series of tyrosine residues at the 5 0 end that contribute towards the CDR3 length, as was observed in early stage B-cell precursors [63]). Another potential molecular mechanism for CDR3 elongation is VH replacement. In VH replacement, an upstream VH gene invades into a pre-existing VDJ rearrangement on the same allele [64,65]. VH replacement can elongate the CDR3, because the invasion occurs into a cryptic heptamer sequence that is upstream of the V-J junction. Thus, in some cases, the 3 0 end of the VH gene in the replaced allele not only contains the original V-J junction, but also a few nucleotides from the 3 0 end of the VH gene itself, the so-called VH replacement 'footprint' [66].

Distinguishing somatic mutation from sequencing errors
Another general metric of the repertoire is to evaluate the level of SHM, which can provide insights into the general composition of the repertoire. If a high fraction of clones has mutations, this indicates that many of the cells that were sequenced were memory cells and/or may have entered into a germinal centre reaction and received T cell help. The quantification of somatic mutations seems simple, but for this method to be robust, one has to control for technical errors that produce mutations. A significant challenge with bulk high-throughput sequencing approaches is that sequencing has a significant error rate. How do we distinguish somatic mutations from sequencing errors and other mistakes? Most methods for doing this with DNA-based sequencing protocols involve running samples in replicates and looking for the presence of the same sequence variants in more than one replicate. Within the same sample, one can also institute a copy number cut-off. Sequencing errors are less likely to be found in higher copy number sequences, because the same error has to recur. However, with high-depth sequencing experiments, it is surprisingly easy to regenerate the same sequencing errors. Therefore, stringent filtering of highthroughput data is important for reducing spurious mutations. Fortunately, certain types of errors are more common and can rstb.royalsocietypublishing.org Phil. Trans. R. Soc. B 370: 20140239 be filtered from reads using computational approaches, some of which are sequencing platform-specific [43,[67][68][69][70][71][72][73]. Because true somatic mutations are often shared among clonally related sequences, this is another approach that can be used to distinguish real mutations from sequencing errors.
One of the most rigorous experimental ways of addressing sequencing errors is to employ a technique known as molecular barcoding [74]. Barcoding involves using primers that contain stretches of random nucleotides to label amplifications of single molecules. Mutations that are shared in the majority of sequences with the same barcode are likely to be real, whereas mutations found in single sequences are likely to be due to sequencing error. Another promising approach to obtaining high-quality reads is a single molecule circular amplification technique. This moderate-throughput method generates very high fidelity sequences at very long reads up to several thousand nucleotides in length and recently was used to demonstrate novel splicing isoforms in human antibody heavy chain transcripts [75].

Evaluation of inter-and intra-clonal diversity:
where are the boundaries of a clone?
Intra-clonal diversity is generated by and large through SHM. The first step of mutation analysis is the identification of the closest germline source. It is important to identify which sets of sequences and mutations come from the same clonal source, otherwise when studying the mutations in a repertoire we will be mixing both independent and dependent events. For example, if a mutation occurs in five different sequences in a single clone, we can deduce that they are siblings that shared a single mutation in an ancestor, whereas if the same mutations had occurred in five different clones, we would count it as a mutation that appears multiple times in the repertoire. The analysis of unique mutations per clone has been used to detect selection pressure [23,76,77]. The unique mutations in a clone can be considered together to create a lineage of mutations that describes the process of clonal development ( figure 4). Analysing the patterns of these lineages can give insights into the diversification and selection processes that lead to clonal evolution [78,79]. Different lineage analyses have led to the calculation of the mutation rate of SHM [80] and as metrics of the specific selection pressure on a set of mutations in a given branch or at the root of the lineage [81,82].
One issue in creating and analysing clonal lineages is that we cannot be certain of our identification of mutations. Errors in identification can stem from experimental error, but also because of inherent real limitations of our knowledge. While the recombined heavy and light chains are diverse, V genes, D genes and J genes are quite similar and can generate mutants that are closer to germline genes that are not their true origin. In principle, as germline identification is errorprone, it may be better to identify sets of clonally related genes by showing that they minimize some function of cosimilarity rather than aligning every sequence first to the germline and using that characteristic to identify clones. While we cannot do this across all sequences, it may be feasible to do so to verify existing indications of clonality (much as we describe above in our lineage method for clonal identification). In this instance, the comparison with a germline gene is not used just to identify mutated positions in every sequence, but rather is also treated as a context under which we check if the differences between sequences in the clone are minimized [83]. Thus, we can estimate the efficacy of different clonal identifications in an analysed repertoire and potentially identify instances when clones are not well defined and should be further analysed or compared. A recently described algorithm for constructing lineage trees, called ImmuniTree, explicitly models SHMs, sequencing errors and iteratively evaluates internal nodes of the tree in an effort to generate robust clonal lineages [38].
There has been a heightened interest in using metrics from ecology to study the inter-clonal diversity of clones in immune repertoires [84,85]. However, at present such experiments can only reliably evaluate highly expanded clones, for instance by focusing only on the Ig G subset of the response to the hapten (4-hydroxy-3-nitrophenyl)acetyl [60]. Diversity is most commonly measured using indices such as the Shannon entropy or by using the 'true' diversity, which, when counted so as not to be biased by small or large clones, is equivalent to the exponential of the Shannon entropy [60,86]. There are also various statistical methods to ascertain the minimal clone size (in terms of sequence number or fraction of the sequenced repertoire) at which diversity can be estimated reliably. These include estimates of re-sampling likelihood ( per clone of a given size, [87]). When sampling an environment (or the antibody repertoire) one can construct an accumulation curve of the species (or clones) identified. This curve is either based on the tracking of individual sequences, when considering a single sample where each  individual sequence is acquired separately, or by sample if we consider multiple sampling experiments in which a subset of individual sequences are taken at once. In both cases, we can calculate the estimated or rarefaction curve of the order free sampling of the environment in our samples [61,88,89]. If the rarefaction curve plateaus, we can reliably estimate the diversity. Rarefaction is a better and more computationally efficient method for estimating if sampling is sufficient than performing random re-sampling by simulation [87,90], as these latter methods are simply a numerical approximation of the estimate that rarefaction calculates directly.
9. When is a clone really more than one clone?
As the number of independent sequences that are sampled increases, the chances of finding similar sequences that may arise independently increases. Similar to the parlour game where one is asked to estimate the probability of any two people in the room sharing a birthday, we can determine the probability of any two clones sharing a particular H chain rearrangement by chance. To make this calculation, we need to estimate how many different (heavy chain) CDR3 sequences can be generated. If we assume that the whole CDR3 is determined by 49 V, 27 D and 6 J genes alone, that the frequencies of V/D/J gene usage are uniformly distributed, that the same outcome cannot be achieved through multiple combinations of different Vs, Ds or Js, and that D segments can be read in six reading frames (three forward and three reverse), then the probability of having the same heavy chain is 1/49*1/6*1/(27*6). In a single experiment with 10 000 sequences, this translates to an approximately 20% probability of finding at least one instance of the same CDR3 twice by chance. However, the addition of non-templated nucleotides and exonucleolytic nibbling at the junctions between the recombining gene segments makes the probability much smaller. If there is even one amino acid not accounted for by the germline genes, the probability of encountering two different clones with the same CDR3 is reduced to approximately 1% and with two amino acids, it is further reduced to approximately 5 in 10 000. This is probably still an overestimate of how many independently generated similar clones we will find. Statistical estimates of CDR3 sharing have been described for T cell receptor (TCR) sequencing data [91][92][93]. However, it is difficult to extrapolate from T cell repertoire diversity to B cell repertoire diversity because of differences in rearrangement (such as the frequency of D-D fusion events, which occur in approx. 2% of productive TCRb rearrangements [94] but in only approx. 1/800 IgH rearrangements [95]), potential differences in the extent of clonal expansion, and differences in that only B cells undergo SHM. Estimates of BCR diversity have been made indirectly using phage display to provide high-quality DNA libraries for deep sequencing and reveal that not only the hypervariable CDR3 sequence but also somatic mutations in CDR1 and CDR2 of the V gene contribute substantially to the overall BCR repertoire diversity, which was estimated to be at least 3.5 Â 10 10 different clonotypes [96]. More recently, the frequency of shared CDR3 sequences in memory B cells from different individuals was observed to occur at a frequency of approximately one in 4000 clonotypes [74]. Most of these recurrent instances of clones were likely the result of rare recurrent recombination and not selection as they were mostly un-switched, un-mutated and had short CDR3s [74].
These estimates appear to indicate that occurrences of independently generated overlapping CDR3 sequences are quite rare, although if we consider multiple samples from multiple experiments, the number will increase. However, it is important to note two caveats to this low estimate: (i) these calculations assume full knowledge of the source of the CDR3 positions. In reality, owing to sequencing errors and the difficulty in identifying D gene associations [49], we must be satisfied with identifying all sequences that have a CDR3 that is 'close enough'. (ii) To make our calculation, we assume that all combinations are equally likely to occur (that they are uniformly distributed), but this is not true. For example, some VH genes (such as the 3 0 VHs in fetal life, [97]) rearrange more frequently than others. Moreover, repertoires from mature or memory B cells have undergone additional rounds of selection; IgH rearrangements that are positively selected will be more numerous owing to clonal expansion.
The erroneous classification of two independent clones as a single clone can also occur. The generation of clonally distinct yet highly similar antibodies may be more likely under certain circumstances. B1 B cells tend to harbour shorter CDR3 sequences, exhibit skewed VH gene usage and at least a subset of B1 cells can have low levels of SHM [98,99]. All of these properties make it easier for B1 B cells to harbour highly similar antibodies than B2 B cells. A second circumstance could arise if different L chains rearrange in different clones that actually share the same antibody H chain. This hypothetical scenario is possible because pre-B1 cells undergo multiple rounds of cell division prior to the initiation of L chain rearrangement [5]. In a clone with a favourable H chain, there could be as many as seven rounds of rearrangement, generating up to 128 pre-B cells with the same H chain. If L chain rearrangements in B cells with the same H chain gene rearrangement occur independently, then the same H chain could be found in combination with many different L chains. Consistent with this idea, studies with the E12 cell line, which can differentiate from the pro-B stage all the way to B cells in culture, demonstrate that subclones with the same VH rearrangement can have different L chain rearrangements [100]. An interesting prediction of this theory is that, independent of the H chain specificity, the number of rounds of pre-B cell division could dictate how many different L chain partners a given H chain can have. A third setting in which similar or identical H chains could be independently created occurs when there is positive selection for independent B cell clones with shared amino acid sequence motifs that bind to one or more particular antigens. Such sharing has been given a variety of names including 'public CDR3s' and 'stereotyped receptors' and has been documented in many different chronic disorders ranging from infections and autoimmunity to cancer.
A few examples are highlighted in table 2. Table 2 is not an exhaustive list of stereotyped receptors from the literature. Here, we have decided to focus on cancer and autoimmunity as these represent chronic responses to internal or modified self-antigens. However, there are other examples of sharing of antibody sequence motifs, for example in responses to repetitive polysaccharides such as the Haemophilus influenza type B polysaccharide [114,115] or to viruses such as influenza [27,55]. Some of these immune responses could be chronic as well and/or use antibodies that are positively selected for binding to some other (unknown) self-antigen. Some of the most fascinating and well-characterized shared antibody motifs occur in CLL. Up to one-third of patients with CLL have 'stereotyped' antibodies [101,102].
Stereotypy can refer to sharing of the VDJ genes of the H chain V region or VJ of the L chain, or to specific H þ L chain pairs and/or to specific amino acid motifs within Table 2. Shared sequence motifs in clonally unrelated antibodies from humans. Recurrent antibody specificities and/or structural features are summarized in this table and some examples (CLL) are discussed in more detail in the text. The examples are meant to be illustrative rather than exhaustive and represent only a small fraction of shared antibody motifs and heavy and/or light chain restriction that can be found in the literature. ADAMTS13, a distintegrin and metalloproteinase with a thrombospondin type 1 motif, member 13; B-NHL, B cell non-Hodgkin lymphoma; CLL, chronic lymphocytic leukaemia; FR1, framework region 1; HCV, hepatitis C virus; I/i antigens, red blood cell antigens (I has a linear chain of N-acetyllactosamine units); ITP, immune thrombocytopenic purpura; MALT, mucosa-associated lymphoid tissue; NAL, N-acetyllactosamine; LDL, low density lipoprotein; RBC, red blood cell; RF, rheumatoid factor; SLE, systemic lupus erythematosus; TRIM21, tripartite motif-containing protein 21; TTP, thrombotic thrombocytopenic purpura; Wa Id, Wa idiotype. DFDHW motif in CDR3 [113] rstb.royalsocietypublishing.org Phil. Trans. R. Soc. B 370: 20140239 the V, particularly but not exclusively in the CDR3 [103]. The sharing of motifs suggests that there may be common antigens that drive the disease. Consistent with this idea, three-dimensional modelling based upon publicly available structural data on H þ L chain variable region pairs from over 300 CLL patients revealed a restricted series of predicted antigen-binding sites [116]. While this suggests that the repertoire of potential antigens is limited, there is also evidence to suggest that CLL B cells with stereotypic antibodies share many features with B1-like cells, including multireactivity [117]. For example, in TCL transgenic mice (an animal model for CLL) expanded B cell clones are often polyreactive and harbour specificities for phospholipids, lipoproteins and/or polysaccharides [118]. In human CLL patients with subgroup 1 antibodies, the BCRs recognize oxidized LDL, but fail to respond fully to BCR crosslinking in vitro [119].
Other CLL antibodies recognize antigens on apoptotic cells, similar to multi-and autoreactive B cells [104]. In fact, some VH4-34 antibodies in CLL look very similar to VH4-34 antibodies in lupus: they can have a hydrophobic patch in framework region 1 that binds to N-acetyllactosamine (NAL) [110,120]. Also resembling VH-4-34 antibodies in lupus, some of the VH4-34 antibodies in CLL can have positively charged residues (arginine, lysine) in the CDRs, reminiscent of anti-DNA antibodies in lupus [121]. Intriguingly, some CLL cases have more than one functional L chain rearrangement, which could be due to failed feedback inhibition of rearrangement or to active (auto)antigen-induced receptor editing [105]. Many examples of public or stereotyped receptors are multireactive. Why is this so? One possibility is that many antibodies are multireactive, once we fully characterize their specificities. Multireactive antibodies are readily created during primary B cell maturation [63,122], and are also created by clonal selection and SHM [123]. It is also conceivable that B cells that produce multireactive antibodies can rely on several different potential mechanisms for survival, some of which may be co-opted or enhanced by chronic infections, cancer or autoimmunity. For example, in the case of autoimmune disease, multireactive B cells may respond to self-antigens expressed by dead or dying cells, contribute to inflammation and cell injury, thereby creating a feedforward loop of selfantigen, immune stimulation and further self-injury [124]. By binding to complex (auto)antigens that may simultaneously engage pattern receptors such as TLR9 and/or TLR7, multireactive B cells may be able to circumvent traditional tolerance mechanisms. Furthermore, some of the antigens that these cells bind (such as nucleic acid, or other antibodies in the case of rheumatoid factors) can form complexes that can link with a much broader diversity of antigens, permitting the multireactive B cells to present a wider array of peptides that can be used to recruit (inappropriate) T cell help [125].
But all of this begs the question of why multireactive antibodies are not only easily generated, but frequently observed also in healthy individuals with properly regulated immune systems. In health, B cells with multireactive antibodies appear to have protective as well as anti-inflammatory functions [126]. A famous example of an anti-inflammatory antibody is the T15 idiotype (encoded by the murine S107/TEPC15 V1 gene), which is an IgA antibody that binds to Pneumococcus C polysaccharide, and was shown by immunodiffusion to bind to lipids containing phosphorylcholine [127]. B cells expressing the T15 idiotype have very limited junctional diversity and can respond in a T cell-independent manner to bacterial antigens, serving a protective role in early life [128]. But T15 also binds to self-lipids, including oxidized LDL and has amino acid motifs that resemble other self-antigens associated with inflammation including C reactive protein, which is an acute phase reactant [129]. T15 can even bind to itself (it is an 'autobody' [130]), which could have autoregulatory functions and/or allow B cells to undergo BCR signalling in the absence of antigen engagement. In genetically engineered mice, the T15 antibody is anti-inflammatory, promoting apoptotic clearance, reducing atherosclerosis and suppressing TLR responses [126,131,132].
In addition to forming at high frequencies in the primary repertoire, where they are sometimes termed 'natural' autoantibodies, multireactive antibodies can also be created and selected for by SHM. For example, the well-known 3H9 antibody binds not only to DNA, but also to cardiolipin [23] and dioleoyl phosphatidylserine [133]. Phosphatidylserine is normally confined to the inner leaflet of the cytoplasmic membrane, but becomes expressed on the cell surface in apoptotic cells, where it is recognized by phagocytic cells [134]. Increased binding to phosphatidylserine was observed in sequence variants of 3H9 that had arginine residues that had been shown previously to enhance DNA binding, suggesting that DNA binding and phospholipid binding are co-selected [133]. When dysregulated in autoimmunity, multireactive B cells may be very hard to eliminate because of the selected functions of their healthy 'natural' counterparts in apoptotic cell clearance, reduction of inflammation and protective immunity. Multireactive B cells may also benefit from several different means of survival, through interaction with various self-and non-self-ligands and through different activation pathways that may involve or circumvent T cell help.

Markers of clonal evolution/distinctive selection under conditions of autoimmunity
We have described herein how diversification of the genes encoding the BCR can be used for clonal definition and can be analysed to make observations about clonal development.
Here, we would like to discuss the limitations of clonal thinking in terms of its impact on repertoire change and functional specificity. By segregating sequences based upon their history of diversification, it is implied that they share some function or purpose. This assertion need not be true, however. When we measure a repertoire of mutant B cells, we are querying the history of cells that have already survived selection. When thinking of a prototypical example of how a clone develops, we would expect all surviving mutants of the same clone to share a set of common mutations at their origin which would later diverge with a few individual mutations for each mutant (figure 4a). This divergence occurs either because not enough time has passed to select only one outcome or because, following these initial key mutations, survival is not dependent on a gain of affinity, but only on maintaining it, and so a wide range of neutral mutations can sustain it [135]. However, this is not the only possible scenario of how a clone can diverge by mutation. Other options lead to outcomes where the common function of members of the expanded clone is less clear. The first such example arises when there are two sub branches of a clone that share the same germline source but no common mutations (figure 4b). In most cases, we rstb.royalsocietypublishing.org Phil. Trans. R. Soc. B 370: 20140239 would consider this to indicate that there are actually two clones that arose independently and were mistakenly identified as being one. In our final example (figure 4c), we see a clone in which all members share a few origin mutations but then diverge extremely. In this instance, the divergence is exemplified by ten mutations shared by all of the sequences in the left subclone. However, the divergence could be a deletion found in only a subset of sequences in a clone, or a different pattern of mutation types in different regions of the lineage. These differences could be linked to different clonal outcomes, such as cancer or autoimmunity, in which cells either expand or persist without sufficient constraints.
While this last example is extreme, it probably exemplifies some basic characteristic of most large expanded clones in immune repertoires. As we can now observe with highthroughput sequencing, clones can be widely dispersed. Even if they share substantial mutation backgrounds, their end branches can be quite divergent. We would suggest that this is linked to a functional divergence. When we consider how clones expand and improve their affinity with time we can consider two scenarios: (i) each clone has a specific affinity or is multireactive (figure 5, yellow-and pink-versus blueshaded lineages), and mutation and selection improve this affinity or (ii) all clones start being multireactive but have a weak affinity of interaction for one or more specific antigens, and this specific interaction is selected for over time, making the antibody more specific and potentially less multireactive.
As with theories of evolution, it may be naive to think of affinity maturation as being simply a matter of 'survival of the fittest'. Only clones and mutants that have survived previous rounds of central and peripheral selection will participate in further rounds of affinity maturation and so we can predict that the shape space they occupy should be relatively flat, i.e. there is some baseline level of responsiveness that is robust to mutation [136]. How does this translate to antigen receptor repertoire selection? It has been suggested that to avoid negative selection, TCRs should be comprised many weak interacting amino acids that bind to strongly interacting amino acids in the antigen, creating a set of moderate strength interactions that are needed to create a strong overall avidity of binding [137]. Because even the difference of one amino acid in the antigen can degrade this type of binding, it leads to high specificity. In B cells, however, the receptor-antigen interactions are more complex, making it hard if not impossible to predict exact amino acid interactions in the BCR. Yet, in spite of this, the overarching need to avoid negative selection and maintain a repertoire that can change without mutating into an unusable form remains. Thus, although we cannot predict exactly how mutations will change the contact positions of the antibody [138], we can speculate that many successful clones that persist over time may at some point in their lifespan be multireactive.
Ethics. No primary data were generated for this study. There were no human or animal study subjects.
Data accessibility. This is a review paper and does not contain primary data. Authors' contributions. U.H. and E.L.P. wrote the manuscript together, worked on several drafts and revisions, conceived of the original ideas in the review and both authors approve of the final version to be published.  Figure 5. The common affinity of a clone: somatic mutation and selection can be considered as a movement on genotype space that leads to reciprocal changes in a related shape space. Thus, through mutation members of a clone traverse regions of different antigen specificity and affinity, from regions of more multireactive binding (light blue) to areas of greater specificity to particular antigens ( pink or yellow). We can envision two scenarios of clonal evolution. (a) Clones start and continue in the same vein either improving their affinity to specific antigens (more opaque pink nodes on right) or maintaining some level of general (low) cross-reactivity (left lineage with blue nodes). (b) A clone that shifts it affinities and specificity over its lifetime.