Proceedings of the Royal Society B: Biological Sciences
You have accessResearch articles

A genotype network reveals homoplastic cycles of convergent evolution in influenza A (H3N2) haemagglutinin

Andreas Wagner

Andreas Wagner

Institute of Evolutionary Biology and Environmental Sciences, University of Zurich, Building Y27, Winterthurerstrasse 190, Zurich 8057, Switzerland

The Swiss Institute of Bioinformatics, Quartier Sorge, Batiment Genopode, Lausanne 1015, Switzerland

The Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA

[email protected]

Google Scholar

Find this author on PubMed

    Abstract

    Networks of evolving genotypes can be constructed from the worldwide time-resolved genotyping of pathogens like influenza viruses. Such genotype networks are graphs where neighbouring vertices (viral strains) differ in a single nucleotide or amino acid. A rich trove of network analysis methods can help understand the evolutionary dynamics reflected in the structure of these networks. Here, I analyse a genotype network comprising hundreds of influenza A (H3N2) haemagglutinin genes. The network is rife with cycles that reflect non-random parallel or convergent (homoplastic) evolution. These cycles also show patterns of sequence change characteristic for strong and local evolutionary constraints, positive selection and mutation-limited evolution. Such cycles would not be visible on a phylogenetic tree, illustrating that genotype network analysis can complement phylogenetic analyses. The network also shows a distinct modular or community structure that reflects temporal more than spatial proximity of viral strains, where lowly connected bridge strains connect different modules. These and other organizational patterns illustrate that genotype networks can help us study evolution in action at an unprecedented level of resolution.

    1. Introduction

    The human influenza A virus causes up to half a million deaths annually and infects more than 5% of the world's population [1,2]. The virus evades the human immune system through antigenic change, requiring costly regular updates of influenza vaccines [3,4]. A major target of the immune response is the viral haemagglutinin (HA) protein, a surface glycoprotein that enables the virus to bind and enter host cells via sialic acid residues on cell surface receptors [5]. HA is a homotrimeric membrane protein whose monomers form three globular heads, as well as a helical coil that resides in the membrane [6]. The globular heads contain epitopes—those parts of a foreign molecule recognized by the immune system—which can be bound by antibodies that prevent HA's interaction with host cells [7]. Influenza A subtypes are classified according to the their variants of HA and neuraminidase, a viral protein that is important to release viruses from the cell surface. The H3N2 subtype (HA subtype 3, neuraminidase subtype 2) dominates the seasonal flu that recurs annually in temperate regions. Intense monitoring of influenza epidemiology and evolution [1] make this virus ideal for novel, data-intensive approaches to understand the evolutionary dynamics of pathogens, such as the framework of genotype networks.

    Genotype networks are graphs whose nodes are genotypes with the same broadly defined phenotype. This phenotype could be as coarse-grained as ‘being viable’, or as fine-grained as the enzymatic activity or catalytic site conformation of a protein. Two genotypes are neighbours and connected by an edge in such a network if they differ minimally, e.g. in a single nucleotide or amino acid. Genotype networks and their structure can shed new light on many long-standing problems in evolutionary biology, such as how new evolutionary adaptations originate [8]. Thus far, genotype networks have mainly been characterized in systems where phenotypes need to be predicted computationally from genotypes [912]. High-throughput genotyping technologies can alleviate this limitation and help build genotype networks from experimental data, by characterizing many closely related genotypes and their phenotypes [1317]. Casting such data in the form of a network immediately makes many analytical tools from graph theory available [18]. Their use has contributed to fields as different as ecology, systems biology and the social sciences [1925]. For example, they can help identify ‘modules’ of cooperating molecules, interactions in ecological networks that affect their stability, or network properties that can influence the spreading of traits, such as innovations or diseases [19,24,26].

    The conventional way to study evolutionary processes with genotypic data (but see [27,28]) is to construct phylogenetic trees that reflect the evolutionary relationships among genes, individuals or species [29]. Different genotypes are leaf nodes on such a tree. Internal (non-leaf) nodes correspond to usually extinct ancestors. Modern phylogenetic methods permit a probabilistic reconstruction of such ancestors, in the sense that they can compute a probability that a given genotype was the true common ancestor of the actually observed leaf nodes. They allow one to choose the most likely such genotype as the best candidate common ancestor [29,30]. In both phylogenetic trees and genotype networks, the distance between two sequences reflects their evolutionary relatedness, but genotype networks differ from phylogenetic trees in at least two respects. First, when constructed from a population sample, they do not require reconstruction of ancestors, but contain these ancestors as their internal nodes. Second, as opposed to trees—which are by definition acyclic graphs—they can contain cycles, paths of edges that return to their origin, which can indicate unusual patterns of evolution.

    I here construct a genotype network from a large dataset of viable influenza genotypes to illustrate some of its applications to evolutionary genetics. Specifically, I construct a genotype network based on more than 1000 influenza A HA sequences isolated from H3N2 viruses between 2002 and 2007 [31] to illustrate how graph-theoretical concepts and methods can shed light on the evolution of this pathogen.

    2. Results

    (a) Structure of the protein genotype network

    The protein-based genotype network is a graph whose nodes are HA protein sequences from different viral strains. Edges connect two sequences if they differ in a single amino acid. The network based on the dataset used here [31] has 1565 sequences, 33763 edges, and is organized into 284 components—subgraphs where any two sequences can be connected via a path of edges (electronic supplementary material, figure S1). After removal of all but one identical sequences from this network, the non-redundant network still contains 742 sequences, with 539 edges between them, which are partitioned among 265 connected components of various sizes (figure 1a). Most components (238) consist of a single isolated sequence, 27 components contain two or more sequences and the single largest ‘giant’ component [33] comprises 246 sequences connected by 285 edges. Figure 1b shows this largest component. The distribution of sequence degrees—a sequence's number of neighbours—is highly skewed towards sequences with few neighbours (figure 1c). The network is disassortative [34], meaning that sequences with high degree tend to be neighbours of sequences with low degree (electronic supplementary material, figure S2). It has a highly modular or community structure (electronic supplementary material, figure S3), where a strain's membership in a module is most significantly associated with its year of isolation (electronic supplementary material, Supplementary Results and figure S4). An analogous network, based on DNA sequences rather than protein sequences, is highly fragmented and thus less informative (electronic supplementary material, figure S5).

    Figure 1.

    Figure 1. Protein genotype network without redundant sequences. (a) Component size distribution of the HA protein genotype network, where groups of identical amino acid sequences are represented by only one member. (b) The largest connected component, where circles correspond to sequences, and edges connect sequences that differ in a single amino acid. The component's layout is computed by a force-directed algorithm [32]. Larger circles and darker hues of blue correspond to sequences with higher degree (more neighbours). (c) Degree distribution of the entire genotype network. The left-most bar in the histogram includes isolated nodes with no neighbours as well as nodes with one neighbour. Note the many low-degree nodes.

    (b) Abundant cycles in the genotype network cannot be explained by random homoplasy

    Although cycles cannot occur in a phylogenetic tree by definition [29], they can occur in genotype networks. Such cycles can reflect constrained evolution and in particular homoplasy, parallel or convergent evolution, where two sequences do not diverge or even become more similar over time. The HA protein genotype network contains remarkably many cycles. Specifically, among the 504 sequences in the network that are not isolated, 122 (24%) form part of a cycle. All of those cycles are contained in the six largest components, in which 28.6% of sequences form part of a cycle. In the largest component itself, 79 of 246 sequences (32.1%) are contained in cycles. To better visualize the extent of cycles, one can display the so-called 2-core of the genotype network (figure 2a), defined as the largest subgraph, where every sequence has at least two neighbours. All 122 sequences of this 2-core form part of a cycle. The 2-core thus reveals the extended cyclic structure of the protein genotype network.

    Figure 2.

    Figure 2. The protein genotype network contains many cycles. (a) The 2-core of the protein genotype's largest component shown in figure 1. Apparent differences between the local topologies of the two graphs are a consequence of the embedding algorithm [32]. The 2-core of the entire genotype network contains all 122 nodes that are part of a cycle. Larger circles and darker hues of blue correspond to sequences with higher degree (more neighbours). (b) Venn diagram showing the number of nodes contained in triangles (48 = 28 + 20 nodes, left ellipse), in squares (94 = 20 + 74 nodes) and in both (20 nodes).These numbers refer to the entire protein genotype network, not just the largest component. (c) An example of a triangle in the protein genotype network. Circles correspond to HA sequences from strains whose names are shown above each circle. Edges are labelled with the amino acid difference between two neighbouring strains, e.g. K222I indicates that strain Nagoya/26/2006 contains lysine (K) at position 222, whereas strain Osaka/4/2006 contains isoleucine (I) at that position of the HA amino acid sequence. (d) A square in the genotype network. Note that amino acids at two positions (159 and 226) change in this square. The amino acid differences are read from the left node to the right node in each sequence. (e) The topology of a subtree of the maximum-likelihood HA phylogeny from the electronic supplementary material, figure S13, containing the four different influenza isolates from (d). Each tree branch connecting two nodes corresponds to exactly a single amino acid change (dashed arrow). The dashed line at the root would connect this clade to the much larger HA sequence tree.

    Detailed examination shows that all cycles in the network are decomposable into triangles and squares (electronic supplementary material, tables S1, S2). First, the network contains 24 triangles that involve only 48 sequences, implying that many triangles share sequences and edges (see also figure 2a). Second, the network contains 40 squares that involve a total of 94 sequences which are shared among squares. Twenty sequences are shared between triangles and squares, implying that 28 sequences in a cycle occur only in a triangle and 74 sequences in a cycle occur only in a square (figure 2b). The network contains many longer cycles of five or more edges, but none of them is an elementary cycle (electronic supplementary material, figure S6)—all of them can be decomposed into triangles and squares.

    I next developed an algorithm (see Material and methods) to ask whether the cycles in the genotype network could have arisen by chance alone, i.e. from the limited amount of homoplasy to be expected in independently evolving sequences. With this algorithm, I created 1000 random genotype networks, and counted the number of sequences involved in cycles in them. Among 1000 random genotype networks of the same size as the largest component of the HA genotype network, 999 contained no cycle at all, one contained a triangle and none contained a square. Not a single network contained more than a triangle or square. Similarly, 1000 random networks of the same size as the second and third largest component did not contain a single cycle. The number of cycles observed in the HA genotype network cannot be explained by chance alone.

    (c) Global evolutionary constraints cannot explain abundant cycles

    Because the HA gene is important to the viral life cycle, its evolution is constrained, i.e. not every amino acid can be substituted for any other amino acid. To find out whether such constraints could explain the extent of cycles in the genotype network, I first quantified these constraints. Among the 329 amino acids in the protein coding sequence, 156 vary in this dataset. On average, each of these 156 sites is involved in 3.5 change events (edges), but with a distribution that is broad and ranges from one edge to 16 edges (electronic supplementary material, figure S18a). The total number of different amino acids that occurs at each variable position ranges from two to four (electronic supplementary material, figure S18b), with a mean of 2.67. Positions that are involved in more change events also harbour significantly more amino acids (Spearman's R = 0.74, n = 156; p < 10−17).

    Next, I created random genotype networks that reflect these constraints in a conservative way. That is, they are based on sequences with 156 variable positions, but I assumed that each position can only admit two different amino acids. The distribution of the number of sequences that form cycles in 1000 such random networks is shown in the electronic supplementary material, figure S18c (note the logarithmic vertical scale). A total of 905 of these networks contain no cycle at all, and the maximum number of sequences in cycles is 12 for any of the networks. Not a single random network contains 79 sequences in cycles, as does the largest component of the actual network. These observations suggest that the abundance of cycles in the genotype network cannot be explained from global constraints on sequence evolution.

    (d) Squares and triangles reflect two different kinds of constrained evolution

    All triangles involve change at only a single amino acid site (electronic supplementary material, table S1), as in the example of the triangle between strains Miyazaki/39/2005, Nagoya/26/2006 and Osaka/4/2006 (figure 2c), which differ only in position 222. If the Miyazaki strain is the ancestor of the other two strains, then the arginine (R) at its position 222 gave rise to a lysine (K) in the Nagoya strain and an isoleucine (I) in the Osaka strain. Analogous scenarios hold if either the Nagoya or the Osaka strain are ancestral. This triangular pattern of change reflects strongly constrained sequence evolution: out of all possible positions that could have changed in the common ancestor, only one did change.

    Figure 2d illustrates that squares also reflect a pattern of constrained evolution, but of a different kind that leads to temporary sequence convergence. If Mae Hong Son/33/2003 is the most recent common ancestor of its two neighbouring strains, then it underwent a valine to isoleucine substitution at position 226 (I226V), which created strain Singapore/95/2003, as well as a tyrosine to phenylalanine substitution at position 159 (Y159F), which produced strain Ukraine/UA-2003918069/2003. Thus, unlike in a triangle, two different positions changed in the ancestor. Then either the Singapore strain underwent the same Y159Y change involved in creating the Lipetsk strain or the Ukraine strain underwent a I226V change (or both kinds of changes occurred). Regardless of this order, and regardless of which strain is the common ancestor of the others, the characteristic pattern is that its descendants temporarily become more similar to one another, such as the Lipetsk strain which differs in only one amino acid from the Ukraine strain, whereas its Singaporean ancestor differs in two amino acids from the Ukraine strain. This temporary sequence convergence characterized by the same two substitutions on opposing edges of a square (figure 2d) exists for all but one square (electronic supplementary material, table S2). This only exception is a square where sequence change occurred at only one site (electronic supplementary material, figure S7), but this square is really just a composite of two triangles. I note that a square cannot involve substitutions at more than two positions, because that would make cycle-closure after four edges impossible. Figure 2e illustrates how the group of four strains from figure 2d would appear in a maximum-likelihood phylogenetic tree of the HA sequences considered here (electronic supplementary material, figure S13). The four-strain subtree correctly captures the single amino acid change that separates the pairs of isolates Lipetsk-Ukraine, Lipetsk-Singapore and Singapore-Mae Hong Son, but it incorrectly suggests that Lipetsk-Mae Hong Son are two amino acid changes apart, whereas they actually differ in only one amino acid. Other instances of squares would show the same misleading phylogenetic topology.

    (e) Cycles are enriched with changes in epitopes

    HA sequences are subjected to positive selection of beneficial mutations [3539] that frequently occur in antibody-binding epitopes whose mutational change helps a virus evade the host's immune response [7,40]. Such changes are also involved in the HA genotype network, because amino acid change in this network is almost twice as likely to occur in an epitope than outside an epitope (electronic supplementary material, figure S8; p < 4 × 10−3). More importantly, epitope-affecting changes are especially prevalent in cycles. That is, significantly more amino acid changes in cycles affect epitopes than outside cycles. This difference also holds for triangles and squares when they are considered separately (electronic supplementary material, Supplementary Results).

    (f) Tolerable or beneficial mutations are especially rare in cycles

    A frequently used indicator of positive selection in protein-coding DNA is the ratio dN/dS of non-synonymous changes dN per non-synonymous site and synonymous changes dS per synonymous site [3739,4145]. Using it to identify whether individual amino acid changes have been beneficial is not straightforward. First, only in highly divergent sequences does a ratio dN/dS > 1 indicate positive selection [37,43]. In sequences with low divergences like those considered here, positive selection can be at work even if dN/dS is significantly smaller than one. Second, neighbouring sequences in the genotype network differ in only a single amino acid and often show no synonymous change (electronic supplementary material, figure S9), such that this ratio cannot even be determined for many individual amino acid changes. These considerations show that any analysis of dN/dS needs to focus on groups of edges in this genotype network. If one considers all amino acid changes in the network, one finds more than 100 edges (117 of 539, 21.7%) in the genotype network where an amino acid change occurred without any synonymous change, suggesting that positive selection occurred at least in some of the sequences considered here.

    In closely related sequences, the relative incidence of synonymous changes to amino acid changes can be useful to indicate the average time between occurrences of tolerable or beneficial amino acid changes. Many synonymous changes per amino acid change indicates a long waiting time. Under neutral evolution, the expected number of synonymous changes per amino acid change in the dataset considered here is approximately 0.36 (±0.02 s.e.m; see Material and methods). The mean number of synonymous changes per amino acid change in the whole network is much higher at 2.12 (±0.08 s.e.m, n = 539). What is more, this number is even higher for edges in cycles (2.51 ± 0.16 s.e.m, n = 180), and it is lower for all edges outside cycles (1.93 ± 0.09 s.e.m, n = 359), a difference that is significant (p = 0.0089, Mann–Whitney U-test, electronic supplementary material, figure S10a). Even in the edge with the lowest synonymous distance within any one cycle, this distance is significantly greater than that of comparable edges outside cycles (electronic supplementary material, figure S10). In summary, tolerable or beneficial mutations are rarer in cycles than in the rest of the network. An additional pertinent observation is that cycles are enriched with codons experiencing amino acid change through double or triple nucleotide change, and the HA sequences in which these codons reside also show greater synonymous divergence (electronic supplementary material, figures S11, S12).

    (g) Highly central strains and bridge strains reflect the trunk-like genealogy of haemagglutinin sequences

    Phylogenetic trees of evolving HA sequences have a trunk-like structure: relatively few sequences in the trunk propagate the lineage further, whereas sequences in side branches are evolutionary dead-ends [39,4648]. This topology is responsible for a disassortative network organization (electronic supplementary material, figure S2). One can quantify the ‘trunkness’ or centrality of a sequence through the graph-theoretical concept of a node's betweenness centrality B—the number of shortest paths connecting pairs of nodes that pass through that node. Figure 3a shows the largest component of the HA genotype network where the most central strains (electronic supplementary material, table S3) are highlighted. The most central strain is Taiwan/TW-1554/2004, with B = 20020 shortest paths passing through it, followed by Okayama/15/2005 (B = 17 528 paths) and Osaka/18/2006 (B = 15 213 paths). These strains form part of the trunk of the HA phylogeny, giving rise to many descendants in the HA phylogenetic tree (electronic supplementary material, figure S13).

    Figure 3.

    Figure 3. Central sequences in the genotype network include some bridge sequences with few neighbours. (a) The largest connected component of the genotype network, where circles correspond to sequences, and edges connect sequences that differ in a single amino acid. The topology of this network is the same as that of figure 1b, apparent differences being a consequence of the embedding algorithm [32]. Circles in purple correspond to those 11 sequences in the largest component of the protein genotype network whose betweenness centrality exceeds 2500, i.e. those sequences through which the most shortest paths between other sequences must pass. The component's layout is computed by a force-directed algorithm [32]. (b) Scatterplot of betweenness centrality B for strains where B > 0 (horizontal axis) versus their number of neighbours (vertical axis). Note the logarithmic scale on each axis. A few strains have high betweenness centrality but relatively few neighbours. The names of three such bridge strains are written immediately underneath the three circles representing their centrality and degree. (c) Mean (±s.e.m.) of betweenness centrality for sequences that are part of a cycle (left) and not part of a cycle (right). (d) Mean (±s.e.m.) of betweenness centrality for sequences that are part of a square (left) and not part of a cycle (right). The differences are highly significant in both cases (p < 10−17; Mann–Whitney U-test), showing that sequences in squares tend to be central.

    The more surviving descendants a virus leaves—the higher its number of neighbours in a genotype network—the greater should be the likelihood that it is a trunk strain, because chances are greater that one of its descendants propagates the lineage further. The quantitative analysis of figure 3b supports this assertion by showing that strains with many neighbours are significantly more central (Spearman's r = 0.91, p < 10−17, n = 246). But more remarkable than this rule are its exceptions, because several central strains have few immediate neighbours. These include the strains Osaka/18/2006, ranked third in terms of centrality, but having only six neighbours, as well as Lipetsk/15/2004 (rank 5, eight neighbours), Hong-Kong/2982/2004 (rank 9, two neighbours), Perth/20/2005 (rank 10, three neighbours). Because such unusual strains visibly link major clusters of sequences, I refer to them as bridge strains (see also the electronic supplementary material, table S3). That all these bridge strains are mere artefacts of undersampling in time or space is possible, but made less likely by the observation that: (i) sampling in time for the sequences shown in figure 3a is quite even, ranging from 113 isolates in 2002 to 173 isolates in 2006; and (ii) three of the four bridge strains above come from the top four (out of 53) sampled countries Japan, China and Australia. That such strains bear some biological significance is also supported by the observation that highly central and bridge strains (B > 2500 and degree ≤ 10) are significantly enriched in cycles and, more specifically, squares (figure 3c,d; p < 10−17; Mann–Whitney U-test). Such enrichment does not exist for triangles (p > 0.29). Central and bridge strains thus show an elevated incidence of convergent evolution (see electronic supplementary material, supplementary results for their possible biological significance).

    3. Discussion

    The HA genotype network differs in a major respect from phylogenetic trees—those acyclic graphs used to describe groups of sequences related by descent—because cycles permeate this network. For example, the diameter of the subgraph formed by those strains that are part of a cycle (figure 2a) equals nine edges, only one fewer than the whole network's diameter of 10 edges, which implies that one can traverse almost the entire network along cycles. Most of these cycles are squares, which reflects an extent of sequence homoplasy that can neither be explained by chance alone nor by global constraints on sequence evolution (electronic supplementary material, figure S18). All observed cycles are very short and involve change at only one or two sites, which suggests that sequences in a cycle can experience very few tolerable or beneficial amino acid changes. This is further underscored by the observation that some amino acid sites are involved in multiple homoplastic cycles, such as site 50, which is involved in seven different squares (electronic supplementary material, table S2). The limited overlap between sites that undergo homoplastic change here and in previous studies further highlight this sequence context specificity [39,41,42]. For example, among the 25 sites reported by Kryazhimskiy et al. [41] to have undergone directional evolution, a form of homoplasy, only 13 are implicated in such change here, and among 11 sites reported by Wolf et al. [39], only two are implicated in this dataset. (All of these common sites are part of an epitope.) More generally, the genotype network approach reveals the extent of homoplasy to be more extreme than hitherto realized. Previous analyses indicated homoplastic changes between leaf sequences on different branches of a HA phylogenetic tree, usually separated by multiple further additional amino acid changes [39,41,42], but convergent changes in this dataset occur in the same square, only one amino acid change apart. I note that the amino acid changes that occur in squares are suggestive of epistatic interactions [13,42,4952], which demonstrably exist in influenza HA evolution [42] and can be a source of homoplasy.

    Is the observed homoplasy only caused by extremely few tolerable amino acid changes, or do positive selection and beneficial mutations contribute to it? Positive selection is pervasive in HA evolution [3539,41,44,53]. Unfortunately, one prominent criterion to detect it, a ratio dN/dS > 1 is only valid for sequences much more distantly related than those considered here [43], whose pairwise nucleotide divergence of 3.22 × 10−3 is similar to that within a single influenza outbreak (3.4 × 10−3) [54]. For such lowly diverged sequences that coexist in the same population, the ratio dN/dS can be significantly lower than one, yet positive selection may be rampant [43]. However, at least some of the homoplastic changes observed here are likely to be beneficial, because, first, 32 of 117 edges (27.3%) that involve no synonymous change at all occur in cycles. Second and more importantly, amino acid changes in cycles are significantly more likely to affect epitopes than changes outside cycles (electronic supplementary material, Supplementary Results). Given past observations on the association of beneficial changes with epitopes [37,39,41,53], this observation is not consistent with the notion that homoplasy occurs only because of increased selective constraints.

    In principle, the rate of antigenic evolution in influenza could be limited by the rarity of mutations that cause antigenic change [31,39,47], or by immune-mediated selection in the host. In the latter case, only a limited number of antigenic types exist, which circulate in a population. Temporally changing patterns of host immunity can then cause different types to rise to prominence over time [55,56]. This data can speak the question whether mutation limitation contributes to HA evolution, because the ratio dN/dS can help estimate the waiting times between successive amino acid changes: if synonymous changes occur according to a molecular clock, as is the case for influenza [57], then a higher number of synonymous changes per amino acid change imply a longer waiting time. It is of particular interest to study changes in cycles, because such changes are significantly associated with epitopes (electronic supplementary material, Supplementary Results). Based on the significantly higher average number of synonymous changes inside than outside cycles (electronic supplementary material, figure S10, 2.51 versus 1.93, or 2.54 × 10−3 and 1.96 × 10−3 per nucleotide site), and given an estimated 3.4 (±1.1 × 10−3) [57] silent changes per site per year, the expected waiting time between two amino acid changes is 273 days inside cycles and 210 days outside cycles. Moreover, the HA sequences with non-synonymous codons that carry double nucleotide changes have a higher number of 3.17 (±0.38 s.e.m.) synonymous changes than HA sequences where these codons experienced only a single nucleotide change (2.05 ± 0.08 s.e.m.), a difference that is significant (p = 0.002; Mann–Whitney U-test) and that amounts to an increase in the expected waiting time of 54% (from 223 to 344 days). These observations suggest that mutation limitation plays some role in HA evolution. The episodic occurrence of beneficial mutations [39,47,58] and results from recent epidemiological modelling [47,48] also support this notion.

    Genotype networks and phylogenetic trees are graphs with complementary strengths for the analysis of evolutionary data. First, phylogenetic analysis uses sophisticated algorithms [29] to infer a tree's structure from data on its leaf nodes, whereas the structure of a genotype network immediately follows from the raw data. Second, in a tree the genotype of interior (ancestral) nodes need to be inferred probabilistically [29,30], whereas in a genotype network these nodes are plainly visible. Third, a phylogenetic tree has inherent ancestor-descendant directionality, which would need to be inferred in a genotype network. Fourth and conversely, although phylogenetic analysis can detect homoplasy [39,41,59], homoplasy is a confounding factor in tree reconstruction and not readily visible from a tree itself, whereas cycles make homoplasy plainly visible in a genotype network. Fifth and finally, trees are well suited to study evolutionary relationships of sequences with arbitrarily high divergence, whereas genotype networks would be highly fragmented and thus of limited use for such sequences. But once abundant data on closely related sequences are available, genotype networks become highly useful tools to understand evolutionary processes at fine-grained temporal resolution.

    4. Material and methods

    I obtained DNA sequences and the amino acid sequences they encode for 1565 influenza A HA genes from the National Center for Biotechnology Information (NCBI; www.ncbi.nlm.nih.gov), using database accession numbers published in table S1 of [31]. These sequences come from influenza A (H3N2) isolates obtained worldwide between 2002 and 2007. Their antigenic properties, identified through HA inhibition assays, were reported and analysed in [31]. I aligned DNA and amino acid sequences with the program MUSCLE [60] using default parameters. The resulting alignments contained no gaps. To build a genotype network of amino acid sequences from this alignment, I defined a graph (V,E) whose set of nodes V comprises all genotypes (amino acid sequences), and where two nodes v and w are connected by an edge (v,w) in the set of edges E if they differ in exactly one amino acid. A cycle is a sequence of edges in a graph that returns to the sequence's starting node, but does not visit any node or edge more than once. I computed the number of nodes that are part of a cycle, and performed all other graph computations and visualizations with the aid of the perl package GRAPH (v. 0.94; www.cpan.org), as well as with gephi (v. 0.8.2) [61]. I associated each HA sequence in the network with information about its year and country of isolation, which I extracted from the annotated sequence files in table S1 of [31], and analysed categorical data with the aid of the R package ‘vcd’. Methods are described in greater detail in the electronic supplementary material, Supplementary Methods.

    Acknowledgements

    I thank members of the A*star Bioinformatics in Singapore, and especially Dr Sebastian Maurer-Stroh, for valuable discussions during a sabbatical visit in Singapore.

    Data accessibility

    This work uses only publicly accessible data (see Material and methods).

    Funding statement

    This work has been partially supported through Swiss National Science Foundation grant 315230-129708, as well as by the URPP Evolutionary Biology at UZH.

    Footnotes