Proceedings of the Royal Society B: Biological Sciences
You have accessResearch articles

Predicting the short-term success of human influenza virus variants with machine learning

Maryam Hayati

Maryam Hayati

Department of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada V5A 1S6

Google Scholar

Find this author on PubMed

,
Priscila Biller

Priscila Biller

Department of Mathematics, Simon Fraser University, Burnaby, British Columbia, Canada V5A 1S6

Google Scholar

Find this author on PubMed

and
Caroline Colijn

Caroline Colijn

Department of Mathematics, Simon Fraser University, Burnaby, British Columbia, Canada V5A 1S6

Department of Mathematics, Imperial College London, London SW7 2BU, UK

[email protected]

Google Scholar

Find this author on PubMed

    Abstract

    Seasonal influenza viruses are constantly changing and produce a different set of circulating strains each season. Small genetic changes can accumulate over time and result in antigenically different viruses; this may prevent the body’s immune system from recognizing those viruses. Due to rapid mutations, in particular, in the haemagglutinin (HA) gene, seasonal influenza vaccines must be updated frequently. This requires choosing strains to include in the updates to maximize the vaccines’ benefits, according to estimates of which strains will be circulating in upcoming seasons. This is a challenging prediction task. In this paper, we use longitudinally sampled phylogenetic trees based on HA sequences from human influenza viruses, together with counts of epitope site polymorphisms in HA, to predict which influenza virus strains are likely to be successful. We extract small groups of taxa (subtrees) and use a suite of features of these subtrees as key inputs to the machine learning tools. Using a range of training and testing strategies, including training on H3N2 and testing on H1N1, we find that successful prediction of future expansion of small subtrees is possible from these data, with accuracies of 0.71–0.85 and a classifier ‘area under the curve’ 0.75–0.9.

    1. Introduction

    Human influenza A virus remains a substantial global public health challenge, causing considerable illness and mortality despite the availability of effective vaccines. Influenza viruses are categorized according to features of two surface glycoproteins, haemagglutinin (HA) and neuraminidase (NA), with types such as H3N2 and H1N1 indicating the variant of HA and NA characterizing the strain. Influenza viruses are prone to variability, both in the form of the so-called antigenic drift, and in the form of reassortment. Reassortments can give rise to new variants with distinct antigenic properties compared to previous strains; resulting pandemic influenza virus strains may be highly pathogenic. In contrast to pandemic strains arising from reassortment, seasonal influenza virus primarily arises through genetic drift, as influenza virus has a high propensity for generating antigenic variation. This allows influenza viruses to evade host population immunity built up through previous exposure. As a consequence, seasonal influenza virus vaccines need to be regularly updated.

    Influenza virus vaccines typically focus on preventing infection by raising antibodies specific to the HA protein. In order to update a seasonal influenza virus vaccine, currently circulating strains must be selected for inclusion. This relies on surveillance and sequencing of circulating influenza virus genotypes, and on measured antigenic properties of circulating strains. These data do not, in and of themselves, describe future circulating strains, and sometimes the strain selection process does not reflect the near-future composition of the influenza viruses well enough to achieve the desired reductions in illness and mortality. Predictive models are now being used in conjunction with sequencing and immunological surveillance in order to improve the strain selection process.

    Phylogenetic trees encode patterns of ancestry and descent in a group of organisms, and so they necessarily include information about differences in these patterns between different subsets of taxa [1,2]. Trees contain both branch lengths and information in the form of the tree topology or shape. Phylogenetic trees have been used in infectious disease to estimate the basic reproduction number [3], parameters of transmission models [4], aspects of underlying contact networks [58] and in densely sampled datasets even person-to-person transmission events and timing [912]. It is therefore natural to hypothesize that phylogenetic tree structures and branching patterns contain information about short-term growth and fitness. Tree information is central in some predictive models for short-term influenza virus evolution and models of fitness [1,2]. However, the mapping between the phylogenetic tree structure and interpretable biological information can be subtle, [7,8,13,14] and trees do not directly reveal the short-term evolutionary trajectories of groups of taxa.

    Improvements in influenza virus surveillance, sequencing, data sharing and visualization [15,16] mean that sequence data over considerable time frames is now available to the community alongside intuitive and interactive displays showing how the population of influenza viruses has changed over time. Computational systems to reconstruct large-scale phylogenetic trees from sequence data have also been developed. Machine learning models are well suited to systematically explore subtle relationships between a suite of features and an outcome. These together present the opportunity to integrate information from different sources to improve short-term influenza virus predictions, using phylogenetic trees as a framework. Here, we use a convolution-like approach to identify small subtrees within a large global H3N2 phylogeny derived from HA sequences sampled between 1980 and July 2018. We fit classification models to detect early signs of growth and hence to predict the short-term success or failure of these subtrees. We validate the predictions on a portion of the data not included in the fit procedure. We relate our predictions to the WHO defined clades [17,18] for sequences sampled from 2015 to 2018. Our approach could be performed in real time, is computationally efficient and can be continually refined to improve the quality of predictions as more data are gathered. We suggest that small groups of closely related influenza virus sequences and the phylogenetic trees that capture their recent shared ancestry patterns can complement other approaches to better predict short-term seasonal influenza virus evolution.

    2. Results

    Briefly, we extract subtrees from the H3N2 phylogeny. Each subtree corresponds to an internal node of the tree and the tip descendants that have occurred within a fixed time frame (1.4 years). The remaining tips occur after the fixed time frame following the relevant internal node, and help to define whether the subtree has successfully grown into the future.

    The approach results in a total of 396 subtrees, overlapping to some extent, containing 7615 of the 12 785 tips in the full phylogeny. We use a wide range of features of the subtrees, focusing largely on tree structure but also including some branch length features, and the number of changes in the epitope sites of HA compared to previous sequences—see Materials and methods. We train supervised machine learning models to use this information to predict whether subtrees will succeed (see electronic supplementary material, figure S1). Figure 1 shows the H3N2 HA phylogeny and highlights in yellow the tips that belong to at least one subtree.

    Figure 1.

    Figure 1. Phylogenetic tree reconstructed from H3N2 subtype sequences using RAxML, with tips highlighted. Each yellow tip is in a trimmed subtree (7615 out of 12 785 tips); grey tips are not. The sequences are downloaded from GenBank with dates from 1980 to May 2018. Long branches in this timed tree did not appear as long branches in the RAxML tree and were not removed (though their tips are not in any trimmed subtree). Inset: illustration of formation of trimmed subtrees: (a) the circled clade contains a subtree; (b) red branches reach tips that occur after the trimming time period and so are pruned out; (c) the resulting trimmed subtree. (Online version in colour.)

    The trained models successfully predict which subtrees will grow sufficiently, as measured within 3.4 years of a subtree’s originating node (2 years after the last possible tip in a subtree). Using support vector machine (SVM) classification with a linear kernel, our overall 10-fold cross-validation accuracy in H3N2 (using the HA sequences) was 74%. As the accuracy of a classifier can be misleading when there are uneven numbers of samples with the different outcomes, we use the area under the receiver operating characteristic curve (AUC) to describe the overall performance of our models. The SMV had an average AUC of 0.82 (range 0.73–0.9); see electronic supplementary material, figure S5. We found an accuracy (portion correctly classified) of 79% and AUC of 0.89 when training on 75% of the subtrees chosen uniformly at random, and testing on the rest (figure 2).

    Figure 2.

    Figure 2. (a) ROC showing performance of linear kernel SVM trained and tested on H3N2, trained and tested on H1N1, trained and tested on B, trained and tested on the merged subtrees of H3N2 and H1N1 and trained on H3N2 and tested on H1N1. Electronic supplementary material, figure S5 shows variation in these curves and their AUCs over 10 models trained on 90% of the test data for each case. (b) ROC showing classification performance for restricted sets of features. Top: tree shape features (no branch lengths). Epi, epitope features; LBI, local branching index; B.Len, some features include tree branch lengths. (Online version in colour.)

    Figure 2 shows receiver operating characteristic curves illustrating the trade-off between sensitivity and specificity. AUC ranges were obtained by training 10 models each on 90% of the subtrees; see electronic supplementary material, figure S5. We obtained 79% accuracy and 0.86 AUC when we trained a linear kernel SVM model on a training portion of the subtrees (75% of the subtrees chosen uniformly at random) obtained from the H1N1 phylogeny, reconstructed using sequences from 2009 to May 2018 (see Materials and methods; we did not use epitope features in any H1N1 analyses as the HA protein differs in H1N1). We performed 10-fold cross-validation on H1N1 subtrees, which resulted in 0.76 average AUC (range 0.52–0.95). We also pooled the subtrees of H1N1 and H3N2 and divided the pooled subtrees into training and test datasets; this resulted in an accuracy of 75% and an AUC of 0.85 (10-fold cross-validation AUC range 0.75–0.88). We also applied our method on a phylogenetic tree reconstructed from the HA gene of human influenza virus B. This results in a 0.82 AUC and 78% accuracy when we train our model on 75% of the data chosen randomly and test on the rest. For the influenza virus B tree, we used only the topological features. We also compared classifier performance using only portions of the data, and find that combining the tree shape features with the epitope and local branching index (LBI) [1] gives the highest quality, with AUC of 0.88 compared to 0.72 for either epitope features [19] or LBI alone, and 0.78 for these combined.

    Our subtrees are based on internal nodes in long-time phylogenies, and these nodes are present as a consequence of the relatedness patterns in all the data that are passed in to the tree reconstruction algorithm (in particular, including sequences from the entire time range). In consequence, a node’s existence and local structure may be conditional on sequences occurring chronologically long after the node. We took several approaches to ensure that our models were not influenced by some such subtle knowledge of the future. We trained models on H3N2 HA phylogeny but tested on an H1N1 HA phylogeny. We obtained an accuracy of 72% and an AUC of 0.75 (range 0.72–0.75 when training 10 models each on 90% of H3N2 subtrees and testing on H1N1; figure 2 and electronic supplementary material, figure S5). The reduced accuracy is natural given that the HA proteins differ between the two types. We also created ‘time slices’ from the H3N2 HA sequences using only tips occurring prior to time i (i ∈ {May 2012, May 2013, May 2014, May 2015, May 2016, May 2017}). We extracted subtrees, and tested their success using trees reconstructed from all the sequences prior to time i + 1, respectively (see Materials and methods and electronic supplementary material). This mimics a ‘real-time’ analysis and ensures that subtrees cannot depend on sequences arising after a set time. This approach performs comparably to our other tests, with accuracy of 70–76% and an AUC of 0.73–0.86.

    Subtrees originating after January 2015 (here called ‘recent subtrees’) did not have enough time to grow into the future, and we do not know whether they are successful, as the predictions are relative to 3.4 years after the initial node. To accommodate for this, throughout our analysis, we only trained and tested models on subtrees that originated prior to January 2015. In this second part, our aim is to make predictions for subtrees whose outcomes are not known. In order to do so, we trained 10 models using 10-fold cross-validation on the non-recent subtrees, as well as a general model, and used these 11 models for predictions on the recent subtrees.

    The recent subtrees were labelled according to the clades defined by the World Health Organization (WHO) (see figure 3a and Materials and methods). Every recent subtree contributed with 11 predictions to its respective clade. The predicted success of a clade in 2019 is the average success of all predictions coming from related recent subtrees. The predictions are presented in figure 3 and electronic supplementary material, figure S11.

    Figure 3.

    Figure 3. Summary of predictions for the recent subtrees (for detailed information, see electronic supplementary material, figure S11). (a) Relations between the clades defined by the WHO, showing the emergence of new clade designations from existing ones; (b) frequency of clades through years, based on 3298 H3N2 HA sequences sampled between 2015 and 2018. Every sequence is associated with a tip of one or more subtrees used in the predictions. The clade designation of the tips determines the clade designation of the subtrees by majority rule. (c) Predictions for the recent subtrees. Among the 120 recent subtrees, the outcome (success/failure) of 63 subtrees is not known. Each subtree was tested on 11 different SVM models (see Materials and methods). The number of subtrees associated with a clade is indicated in parentheses. Colours reflect the combined predictions of the subtrees associated with each clade. (Online version in colour.)

    Our recent subtrees originated between March 2015 and February 2017, and in this period, contained tips as shown in figure 3b, with the majority of tips in clades A1b, A1, A1a, A2 and A2/re. This reflects the sequences in GenBank, and is likely not globally representative [16]. Clades A3 and A1b/135K were most strongly predicted to be successful by our measure (fraction of the clade’s subtrees predicted as successful), but in A1b/135K there are only two subtrees on which to base predictions. In clade A3, we predicted seven out of eight subtrees as successful, with five of those already having shown sufficient growth to meet our success criterion. A2/re has an intermediate signal overall, but has 12 subtrees predicted to be successful. Of these, 10 had already shown sufficient growth to pass our success threshold by the end of our sampling. Indeed the A2/re clade did become very successful, probably due to a re-assortment event [18]. Our model also predicted that other parts of clade A2 (four of seven subtrees in A2 but not in A2/re) may grow. In the time frame we had, there were relatively few sequences in our GenBank data that were mapped into the A4, A1b/135K and A1b/135N clades by the ‘augur’ pipeline [16] so these clades have very few subtrees on which to make predictions.

    3. Discussion

    We efficiently predict the success of individual influenza virus subtrees using machine learning tools applied to phylogenetic trees. Our method allows binary classifiers to be trained to predict which currently circulating subtrees will persist into the future based primarily on a suite of phylogenetic features.

    Our approach is complementary to previous approaches, including the fitness based model proposed by Łuksza & Lässig [19] and the tree-centred work of Neher et al. [1,20]. Our approach requires a reconstructed timed phylogenetic tree, and can accommodate additional data (e.g. we have used epitope mutations) easily. Other approaches often require additional data such as HI titres and estimates of the ancestral sequences, introducing experimental and computational costs and uncertainty. Our approach makes use of the reconstructed phylogeny in two distinct ways, first in obtaining the groups of taxa (subtrees) considered together for analysis, and second in that the tree shape and length features are derived from the phylogeny, and capture features of the complex branching patterns within subtrees, as opposed to their overall rates.

    This work is rooted in the hypothesis that fitness and early success leave signatures in the branching time and structure of phylogenetic trees, which can be complemented with additional relevant information such as epitope diversification. With only slight modifications, our model could be applied to other organisms. We could also extend the approach and train regression models to predict the number of tips arising from subtrees. However, a natural limitation of this (and other tree-based approaches) is that it detects signs of early growth—if an adaptive new mutation arose in the population and was sampled before that early growth could occur and be sampled, then we could not detect early signs of growth and would not see the new adaptive mutation. By contrast, a principled modelling approach based on an understanding of both what makes an influenza virus fit and on the current composition of population immunity might be able to detect fit novel mutations without relying on such viruses already having begun to spread.

    Influenza virus A can be categorized based on the presence of different proteins on the surface of the viruses: HA and NA. We use trees reconstructed from HA sequences; relatedness in the HA tree corresponds to similar HA sequences and hence to similar immunity profiles, as antibodies are induced by HA. Indeed, path lengths in the HA phylogenetic tree provide a good model for antigenic differences modelled by serological assays [1]. Trees describe the relative number of recent descendants of a lineage compared to closely related lineages, the timing and asymmetry in the descent patterns and the short- and long-term future populations that are related to the lineages. Our approach allows this information to be included in predictive efforts.

    We did not include information about proximity of strains to current or recent vaccines, which might have led to false positives in our results, if a subtree showed early signs of success but was later suppressed by vaccination (but this is unlikely, as only a small portion of the global population is vaccinated). We also did not explicitly include immunological assay data, as these are not generally available. We used specific epitope sites following the approach of [19]; a model reflecting the impact of polymorphisms across more locations in HA and in other genes, if this were available, might improve predictions. We do not have good estimates of the current global circulating frequencies of subtrees or strains, and the sequences available in NCBI are not representative of circulating influenza virus. This is perhaps the biggest limitation of this work: are we predicting not what circulates, but what circulates and arrives, sequenced, into NCBI, and how is this to be interpreted? If up-to-date global strain frequencies were available at high resolution it would surely facilitate short-term prediction.

    Seasonal influenza vaccines are designed with surveillance data from over 100 national influenza centres worldwide, together with laboratory and clinical studies, and information about the availability and production feasibility of vaccine viruses [21]. They also must consider H3N2, H1N1 and influenza B together. Our predictive model could be enhanced to incorporate additional data in a combined analysis, in which our tree-based early success signature was incorporated as a variable in the broader vaccine design framework. Alternatively, additional data on immunology, feasibility and surveillance could readily be incorporated into a machine learning framework like the ones used here. The fact that early success of small sets of taxa is detectable in phylogenetic trees suggests that these efforts could lead to improvements in forecasting successful strains.

    We chose the time frames in part to balance the number and size of subtrees and their overlap. Ongoing sampling and sequencing and the natural passing of time will ultimately provide more data—more years, and more samples per year—such that more subtrees, of more consistent size, can be used. We anticipate that more data will increase the quality of predictions. Another next logical step will be to model competition or other interactions between major lineages. We have not explicitly modelled ancestral states, key individual mutations, serological data or estimates of the fitness of sequences, but our approach could easily integrate results from models that include these features. Ideally, all relevant sources of information would be integrated and updated in real time [19,20]. However, while short-term forecasting based on various data sources is feasible and is required to update seasonal vaccines, perfect short-term prediction and accurate long-term prediction are likely not possible because evolutionary events are fundamentally stochastic.

    Direct comparison of our method with previous approaches is challenging as different approaches have various definitions of success and use different evaluation metrics. Neher et al. [1,20] demonstrate that the shape of the reconstructed phylogenetic trees contain information about the fitness of the sample sequences and using this information, they proposed a model to predict the successful strains in the upcoming influenza season, using the LBI. The units of prediction in their approach are strains while we use subtrees. One of the advantages of using subtrees over strains is that strains are typically observed only in a single season, while subtrees have an evolutionary history. Our tests have found that at the subtree level, LBI does not perform as well as our combined feature sets. Łuksza et al. [19] infer the fitness components of influenza strains: adaptive epitope changes and deleterious mutations outside the epitopes for the strains circulating in a given year, using population-genetic data of all previous strains. They then used the fitness and frequency of each strain to predict the frequency of its descendent strains in the following year. They used clades as units of prediction which are close to our units of prediction (subtrees). However, we put a time restriction on the subtrees which enables us to compare subtrees in the same time interval. In order to have a balance of successful and unsuccessful samples, and accordingly a more powerful model, we chose a subtree growth ratio threshold of 1.1 to define success, with model growth over 2 years. These parameters are 1 and 1 year, respectively, in the model proposed by Łuksza et al. We do not have strain frequency data, but a comparison with epitope features as the primary basis of prediction found that our feature sets performed better.

    Recent studies on the viral isolates from vaccinated individuals indicate that they are significantly distinct from the vaccine strain and are broadly distributed on the tree, resulting in accelerated antigenic evolution [22]. Researchers have been working to develop a universal vaccine that would provide broad protection against both seasonal and pandemic influenza. Recent studies have also indicate that universal vaccines could decelerate the speed of evolution [22,23]. If successful, such universal vaccines would eliminate the need for continual updates to seasonal influenza virus vaccines. In the meantime, efforts to make short-term predictions based on surveillance and sequence data gathered over time can yield both practical results and broader insights into short-term patterns of evolution. Our approach indicates that fitness can affect the phylogenetic properties of a tree reconstructed from influenza viruses and that properties of small subtrees can be used as a set of predictors to estimate which groups of sequences are showing signs of success.

    4. Materials and methods

    Our approach is rooted in the hypothesis that fitness—the reproductive rate and capacity of a group of organisms—affects tree structure and branching patterns (including timing) and that this information can be extracted using machine learning tools.

    (a) Definitions

    Given a phylogenetic tree T, a tip (also called an external node) of T is a node of degree one. An internal node of T is any non-leaf node of the tree. A rooted tree is a tree in which a particular internal node, called the root, is distinguished from the others; it is usually considered to be a common ancestor of all the other nodes in the tree. In a rooted tree T, the parent of a node i is the node preceding it on the unique path from the node to the root r of T; all nodes of T except its root have a parent. A child of a node i is a node whose parent is i. A phylogenetic tree is bifurcating if all its internal nodes have two children. We use a rooted bifurcating phylogenetic tree that is reconstructed from the HA sequences of influenza virus A strains (by maximum likelihood).

    (b) Reconstructing influenza virus trees

    We collected full-length HA sequences from human H3N2, H1N1 and influenza virus B on GenBank [24,25]. We used unique sequences of H3N2-HA from human cases for years between 1980 and May 2018, excluding laboratory strains. This results in approximately 12 919 sequences. For influenza virus H1N1, we collected pandemic sequences from 2009 until May 2018, giving 10 652 unique sequences. For influenza virus B, we used sequences from 1980 to May 2018, resulting in 7257 unique sequences. We aligned each set of sequences using MAFFT [26,27] and then we used RAxML [28] to reconstruct maximum-likelihood phylogenetic trees; this is considered a state-of-the-art reconstruction algorithm [29,30]. Due to the large numbers of isolates we did not perform Bayesian Markov chain Monte Carlo (MCMC) tree reconstruction to accommodate tree uncertainty; in addition to each required MCMC run, this would also have required exploration of different priors and assumptions, and it is infeasible for thousands of tips. In order to check the robustness of our approach, we used different training and testing trees including training on H3N2 but testing on H1N1, pooling H3N2 and H1N1 and using distinct time slices and consistently obtained successful predictions (see below). The reconstructed trees using RAxML are neither rooted nor dated; in some cases, they include very long branches, which we removed. We rooted trees with the rtt function in the ape package [31] in R and then converted them to timed trees using the least squares dating (LSD) software [32].

    (c) Subtree extraction

    We use a convolution-style approach to identify subtrees of the global timed phylogeny that serve as units of analysis. For each internal node i in the tree, we find the tips that occur within a fixed time window (1.4 years by default) chronologically following i; this is node i’s ‘trimmed subtree’. We cannot train machine learning models on the subtree descending from every internal node in the tree, because these subtrees will overlap substantially. We use the notion of a node’s ‘relevant ancestor’ (described below) to control the overlap, and select subtrees in a convolution-like way. The choice of 1.4 years ensured that sufficient time elapsed from the subtrees’ origins to the tips that subtrees would have more than eight nodes frequently enough to obtain enough data, and because 1–2 years is a reasonable time frame with circulating strains, and vaccines, updated annually. A longer time frame would have resulted in either far fewer trees, or far more overlap between trees, limiting the analysis capabilities.

    We first initialize each node’s relevant ancestor to be its parent. We traverse the nodes of the tree in a depth-first search order. If a node’s complete subtree is too small (fewer than eight nodes by default), we reject the node and all its descendants, as none of the descending nodes can have a larger subtree than the node itself. If the node’s trimmed subtree is too small but its complete subtree is large enough, we reject the node but not its descendants, since they may have subtrees that are large enough. If the node’s trimmed subtree is large enough, we check the overlap between the node’s subtree and its relevant ancestor’s subtree. The overlap is the portion of node i’s trimmed subtree that is contained in the relevant ancestor’s trimmed subtree. If this overlap is not too large (under 80% of the subtree size by default), the subtree is included in our analysis. If the overlap is too large, we reject the node, and we set the relevant ancestor of the node’s children to be the node’s relevant ancestor. In that way, when we decide whether to accept the subtree of the node’s child, we will control the correct overlap.

    In this way, we obtain subtrees containing tips that are within a specified time window after their originating node, have at least a minimum number of tips, and have a limited overlap with other subtrees. We varied the minimum size, time interval and permitted overlap (see electronic supplementary material, table S1). We obtain a total of 392 subtrees in H3N2, and 198 subtrees in H1N1. After removing subtrees with big size in relation to the average size, and recent subtrees with insufficient growth to determine their outcome, we obtained 329 subtrees for H3N2 and 160 subtrees for H1N1.

    (d) Features

    We use a set of features defined on subtrees, including both tree shape and patterns in the branch lengths. The topological features are summarized in table 1. For the H3N2-HA dataset, we also consider some features derived from the epitope sites of the tips of the subtree. For each subtree, we consider the mean, median and the maximum genetic distances between the epitope sites of the tips of a subtree and the epitope sites of the sequences with dates prior to the subtree. We used the locations of known antigenic epitopes as mentioned in [33], namely 72 sites in the HA1 subunit of HA.

    Table 1. Brief definition for tree shape statistics. Here ri and si are the number of tips of the left and right subtrees of an internal node, respectively. n is the number of tips of a subtree. ni is the number of nodes at depth i, Mi represents the height of the subtree rooted at an internal node i, and Ni is equal to the depth of node i. A ladder in a tree is a set of consecutive nodes with one tip child. We represent the set of all internal nodes of a tree by I, the set of all tips (or external nodes) by L, and the root of a subtree by r. In ‘generalized branching next’ we chose m = 2. Skewness and Kurtosis are two measures to describe the degree of asymmetry of a distribution [35]. The tree shape statistics induced by betweenness centrality, closeness centrality and eigenvector centrality are defined as the maximum values of each centrality over all the nodes of a tree, and distances are in units of number of edges (without branch lengths). Features called ‘natural’ may not have been used as tree features previously but are natural extensions of simple features (e.g. skewness is a natural quantity to compute). The network science properties were computed in R using the treeCentrality package [36] and the tree-wide summaries were primarily obtained using the phyloTop package [37].

    name description reference
    properties from network science
    betweenness centrality max number of shortest paths through nodes [38]
    closeness centrality max total distance to all other nodes [38]
    eigenvector centrality max value in Perron–Frobenius vector [38]
    diameter largest distance between 2 nodes [39]
    WienerIndex sum of all distances between 2 nodes [40]
    mean tips pairwise distance average distance between 2 tips natural
    max tips pairwise distance max distance between 2 tips (with branch lengths) weighted graph diameter
    numbers of small configurations
    cherry number number of nodes with 2 tip children [41]
    normalized pitchforks 3 * (number of nodes with 3 tip descendants)/n [42]
    tree-wide summaries
    normalized Colless imbalance 1n3/2iI{r}|risi| [43]
    normalized Sackin imbalance 1n3/2iLNi [44]
    normalized maximum height the maximum height of tips in the tree/(n − 1) [45]
    maximum width max number of nodes at the same depth [13]
    stairs1 the portion of imbalanced subtrees [46]
    stairs2 the average of min(si,ri)max(si,ri) over all internal nodes [46]
    max difference in widths max i (ni+1ni) [13]
    variance the variance of internal node depth [47]
    I2 iI{r}ri+si>2|risi||ri+si2| [47]
    B1 iIMi1 [47]
    B2 iLNi2Ni [47]
    normalized average ladder the mean size of ladders in the tree/(n − 2) [45]
    normalized ILnumber number of internal nodes with a single tip child/(n − 2) [45]
    branching speed the ratio of the number of tips to the height of the tree new
    measures from edge length
    branching next index mean of indicator: does the next branching event descend from this node new
    generalized branching next number of next two branching events descending from this node new
    skewness the skewness of the internal branch lengths natural
    kurtosis the kurtosis of the internal branch lengths natural

    Our features cover a wide range of global and local structures in trees, expanding considerably over previous approaches which largely focus on tree asymmetry and a few properties of branch lengths [1,2]. Previous authors have noted that fitness leaves traces in genealogical trees [2] by observing in fixed-size populations that increased fitness resulted in increased asymmetric branching and long terminal branch lengths; Neher and colleagues used the LBI, a measure of the total branch length surrounding a node, in their predictive model [1]. We significantly expand on the repertoire of tree features, including asymmetry and measures of local branching but also including features derived from network science that capture global structure of the subtrees, small shape frequencies and others—see table 1.

    For comparison purposes, we implemented Neher et al.’s LBI [1], which is a measure of rapid branching near a node in the tree. In doing this, we noted that there are strong parallels between the LBI and the weighted version of the Katz centrality, a classic measure from network science [34]. Electronic supplementary material, figure S9 shows the correspondence. We performed the main classification task (H3N2) using all our features, only the topological tree features, only the epitope and LBI, only the epitopes and only the LBI (figure 2b). We found that the combined features gave the best performance, followed by the tree features.

    (e) Success and training approach

    We call a subtree of size n ‘successful’ if its root has a total of more than αn tip descendants in the time frame of 3.4 years from the root of the subtree. The threshold of α = 1.1 results in a good balance of successful and unsuccessful subtrees, which facilitates training the machine learning models (see electronic supplementary material). We chose to use fractional growth as our outcome rather than proximity to tips of the following season, because proximity to the following season fluctuates depending on when in the season the subtree originates, the definition of the season (i.e. the cut-off dates) and the subtree’s location (tropical versus temperate). We explore sensitivity to the choice of 1.1 for the threshold (see electronic supplementary material, figure S4).

    The potential overlap between subtrees could induce dependence in the outcome variable (success), i.e. if nodes ni and nj have overlapping subtrees, and ni is successful, then nj may be more likely to be successful. Notice that having some tips in common does not mean that overall subtree features are similar, but it is reasonable to believe that the chance increases as the proportion of shared tips grows. If ni was in the training data and nj in the test data, and if the two subtrees had similar features, then correlations in the success of these two subtrees could result in overfitting the data. Controlling this potential effect is one reason to use a cut-off of 3.4 years for success (meaning that nj could be unsuccessful with ni successful). In one of our experiments, we trained on H3N2 and tested on H1N1 (figure 2a) to ensure that test and training data are completely distinct. We also explored the performance of our approach on a set of pool subtrees from both H1N1 and H3N2.

    Our data are censored, because we cannot observe the future of subtrees beginning in 2018; furthermore, we have limited knowledge of the true success of subtrees beginning in the most recent 3.4 years of our data (since it takes approx. 2 years following the end of the trimmed subtree before its success is known). Our results are not sensitive to this choice, except that if we were to choose a longer time frame, then we would have fewer subtrees whose outcomes are known by the end of the dataset. We only know whether a subtree has been successful if it has already had a sufficient number of tips; other subtrees may yet do so. Accordingly, we could not train our models using the last few years of data. We used 10-fold cross validation on the set of subtrees whose ancestral nodes arose before January 2015. We refer to this as the set of ‘non-recent subtrees’. This results in one ‘out of fold’ prediction for each such subtree. We trained an additional ‘general’ model on the set of non-recent subtrees. We then had 10 cross-validation models and one additional model that we could use to make predictions on the subtrees arising after 2015 (for which the true success is only partially known).

    The structure (and hence internal nodes) of a large phylogeny depends on all tips, not only the tips prior to the node chronologically. To avoid having the ‘future’ tips affect the nodes on which we based our analysis, we also used a ‘time slicing’ approach that is amenable to use in real-time, season by season. Here, we extracted subtrees not from the full phylogeny but from a tree containing only tips prior to a fixed time. We then assessed the success of these subtrees with reference to later tips (see the electronic supplementary material).

    (f) Classification

    We use several different binary classification tools, including SVM with a range of kernel choices [48]. We use R implementations in the packages e1071 [49]. For all the experiments, we randomly chose 75% of our subtrees for training the model and left the rest for testing. We selected a training and a test dataset (75% for training and the rest for testing) and perform 10-fold cross validation on the training dataset alone; this is to find the best gamma and cost parameters without using all the data to do so, see electronic supplementary material, figure S1. Among different binary classification tools that we used, SVM with a linear kernel had the best performance on this 75% training set, so we proceeded with this option for the remaining results. Datasets can have outliers that affect the training process. We use the local outlier factor (LOF) algorithm [50] implemented in the DMwR package [51] in R. For classification on the H3N2 tree, we removed five outliers which mostly were large subtrees, most of whose descending tips were contained in other subtrees. In the H1N1 tree, we had 198 subtrees; we removed the largest 11 as outliers, for the classification. For the experiment on the merged H1N1 and H3N2 subtrees, we removed large subtrees of both H3N2 and H1N1 to have a set of subtrees of approximately the same size. In the experiments on the influenza virus B tree and the combinations of influenza virus B and other types, we remove large subtrees to obtain a set of trees of approximately the same size in the training and testing groups. Our training and testing scheme is illustrated in electronic supplementary material, figure S1.

    (g) Clade assignment

    We implemented a Python script using the same approach as the one found in augur (https://bedford.io/projects/nextflu/augur/) to assign a clade to a sequence. A clade is defined by a set of amino-acid or nucleotide mutations specific to that sublineage. All mutations used as markers occur in the HA1 and HA2 proteins of H3N2-HA sequences, and the detailed list of mutations can be found at https://github.com/nextstrain/seasonal-flu/blob/master/config/clades_h3n2_ha.tsv. A sequence is assigned to a clade only if it satisfies all criteria of the clade, i.e. only if the sequence contains all the specific mutations.

    It is possible for a sequence to satisfy the criteria of more than one clade, since some clades are descendants of previous ones. For instance, the sequence KU289613_A/Corsica/33-02/2015_2015/02/20 satisfies requirements for clades 3c, 3c3 and 3c3.B. In this scenario, the most recent clade is assigned (3c3.B). Among the 12 919 sequences analysed, 1106 sequences were assigned to exactly one clade, 7487 sequences were assigned to more than one clade and 4326 sequences had no clade assigned. Most of the unassigned cases correspond to older sequences (electronic supplementary material, figure S8). Among the sequences assigned to more than one clade, we used the clade relations presented in figure 3a to determine the most recent clade (e.g. A1a is contained within 3c2.A, though this information is not encoded in the clade names). The relations between clades were manually inferred from the tree at https://nextstrain.org/flu/seasonal/h3n2/ha/3y.

    The clade designation of the tips determines the clade designation of the subtrees by majority rule. As a result, in the predictions presented in figure 3, some older clades (3c2 and 3c3) have tips assigned to them (nine and four tips, respectively) but no subtrees and, therefore, no predictions.

    Data accessibility

    Our code and data are freely available at: https://github.com/MarHayat/flu_prediction.

    Authors' contributions

    C.C. conceived the study; C.C. and M.H. designed the subtree selection algorithm and the machine learning approach. P.B. obtained clade designations with augur. All authors wrote the manuscript; M.H. prepared the first draft.

    Competing interests

    We declare we have no competing interests.

    Funding

    C.C. was also funded in part by the Engineering and Physical Sciences Research Council of the United Kingdom (grant no. EP/K026003/1).

    Acknowledgements

    C.C. and P.B. would like to acknowledge the Canada 150 Research Chair programme. M.H. would like to thank Dr Leonid Chindelevitch for providing this visiting scholar opportunity through funding from an NSERC Discovery Grant as well as the CANSSI Collaborative Research Team.

    Footnotes

    Electronic supplementary material is available online at https://doi.org/10.6084/m9.figshare.c.4911456.

    Published by the Royal Society. All rights reserved.