Journal of The Royal Society Interface
Open AccessResearch articles

Estimating the fitness cost and benefit of antimicrobial resistance from pathogen genomic data

David Helekal

David Helekal

Centre for Doctoral Training in Mathematics for Real-World Systems, University of Warwick, Coventry, UK

Contribution: Conceptualization, Investigation, Methodology, Software, Writing – original draft, Writing – review & editing

Google Scholar

Find this author on PubMed

Matt Keeling

Matt Keeling

Mathematics Institute and School of Life Sciences, University of Warwick, Coventry, UK

Contribution: Conceptualization, Funding acquisition, Investigation, Supervision, Writing – review & editing

Google Scholar

Find this author on PubMed

Yonatan H. Grad

Yonatan H. Grad

Department of Immunology and Infectious Diseases, TH Chan School of Public Health, Harvard University, Boston, MA, USA

Contribution: Conceptualization, Investigation, Writing – review & editing

Google Scholar

Find this author on PubMed

Xavier Didelot

Xavier Didelot

School of Life Sciences and Department of Statistics, University of Warwick, Coventry, UK

[email protected]

Contribution: Conceptualization, Funding acquisition, Investigation, Methodology, Supervision, Writing – original draft, Writing – review & editing

Google Scholar

Find this author on PubMed


    Increasing levels of antibiotic resistance in many bacterial pathogen populations are a major threat to public health. Resistance to an antibiotic provides a fitness benefit when the bacteria are exposed to this antibiotic, but resistance also often comes at a cost to the resistant pathogen relative to susceptible counterparts. We lack a good understanding of these benefits and costs of resistance for many bacterial pathogens and antibiotics, but estimating them could lead to better use of antibiotics in a way that reduces or prevents the spread of resistance. Here, we propose a new model for the joint epidemiology of susceptible and resistant variants, which includes explicit parameters for the cost and benefit of resistance. We show how Bayesian inference can be performed under this model using phylogenetic data from susceptible and resistant lineages and that by combining data from both we are able to disentangle and estimate the resistance cost and benefit parameters separately. We applied our inferential methodology to several simulated datasets to demonstrate good scalability and accuracy. We analysed a dataset of Neisseria gonorrhoeae genomes collected between 2000 and 2013 in the USA. We found that two unrelated lineages resistant to fluoroquinolones shared similar epidemic dynamics and resistance parameters. Fluoroquinolones were abandoned for the treatment of gonorrhoea due to increasing levels of resistance, but our results suggest that they could be used to treat a minority of around 10% of cases without causing resistance to grow again.

    1. Introduction

    The levels of antimicrobial resistance of many pathogens have risen worryingly over the past few decades. In a report on the threat posed by antibiotic resistance published by the CDC (Centres for Disease Control and Protection), three microorganisms including Neisseria gonorrhoeae are classified as posing an urgent threat level, and twelve more represent a serious threat to public health [1]. A review on antimicrobial resistance estimated that resistance claims at least 700 000 lives per year worldwide and that the death toll could go up to 10 million per year by 2050 if current trends are allowed to continue [2], and a recent study estimated that there were almost 5 million deaths associated with resistance in 2019 [3]. Few new antimicrobials have been developed and deployed since the 1970s, whereas resistance to new drugs often emerges soon after initial introduction [4], so that several pathogens are dangerously close to becoming completely untreatable. Effectively tackling antimicrobial resistance requires greater understanding of epidemiological and evolutionary factors leading to emergence of resistance and the spread of resistance through pathogen populations. Achieving this goal requires development of mathematical models of antimicrobial resistance and robust statistical analysis of epidemiological models with informative observations. This modelling approach to resistance was initiated in the late 1990s [5,6] and has led to the development of many models, appropriate for different organisms, mode of spread, study scale and context [7].

    Resistance brings a clear fitness benefit to pathogens acquiring it in the presence of antimicrobials. The net value of this fitness benefit therefore increases with the frequency with which the specific antimicrobial is employed, either against the pathogen itself or more generally in the case of a pathogen that can be carried asymptomatically. However, resistance also typically comes with a fitness cost to the pathogen [8]. The simplest demonstration of this effect is when discontinued use of an antimicrobial leads to reductions in resistance rates. The fitness costs and benefits of resistance remain poorly understood for many pathogens and antimicrobials [9]. A better quantification of resistance benefits and costs is required to provide a solid basis for evaluating the potential effectiveness of public health intervention measures proposed to exploit fitness costs in the hope of stopping or even reversing the spread of resistance [9]. For example, the numbers of gonorrhoea cases sensitive and resistant to cefixime in England over a decade were recently analysed to quantify the cost and benefit associated with resistance to this antibiotic [10]. These estimates were used to predict that cefixime could be reintroduced to treat a minority (approx. 25%) of gonorrhoea cases without causing an increase in cefixime resistance levels, which would reduce the risk of emergence of resistance to the currently used antibiotics. Moreover, the extent of the fitness cost of resistance can vary by genomic background [11], such that the effect of interventions that seek to capitalize on the fitness costs of resistance may be lineage dependent. Therefore, it is necessary to estimate fitness costs at the per lineage level. The aim of this study is to quantify the contribution that changes in prescription policy have on the population dynamics of particular resistant lineages. This is in contrast to studies that are interested in the overall ecology of resistance or the eventual fate of resistant phenotypes (e.g. [12]).

    Pathogen genomic data have great potential to help us understand the evolutionary and epidemiological dynamics of infectious disease [13]. An important advantage of this phylodynamic approach is that analysis of genomic data is less sensitive to sampling biases, especially when using a coalescent framework which describes the ancestry process conditional on sampling [14]. A few studies have used this approach to shed light on the fitness cost associated with antimicrobial resistance. For example, a study showed the association between the growth rate of a methicillin-resistant Staphylococcus aureus lineage and consumption of beta-lactams [15]. Other studies quantified the relative transmission fitness of resistance mutations in HIV [16] and Mycobacterium tuberculosis [17]. Here, we take a different approach by modelling explicitly the phylodynamic trajectories of the sensitive and resistant lineages as a function of the fitness cost, which is constant, and the fitness benefit, which depends on the antimicrobial consumption. Our method therefore requires three inputs: the amount of antimicrobial being used over time, genomic data from a sensitive lineage and genomic data from a resistant lineage. From this, we disentangle the fitness cost and benefit of resistance, thereby providing the parameters needed to predict phylodynamic trajectories and inform recommendations on how to use antimicrobials without worsening the resistance threat. Overall, the scenario we are interested in is that of overall resistance dynamics at a large population level. In such a scenario, the bulk of incidence is going to be caused by local transmission rather than imports. We do not intend for the methods presented in this paper to be applicable to small populations dominated by imports and complex, heterogeneous routes of transmission, such as nosocomial infections in a hospital setting. For such a scenario, a different approach using birth–death type models would be more appropriate [16,17].

    2. Methods

    2.1. Overall approach

    Pathogen phylogenetic data contain information about past population size dynamics of the pathogen under study [13,18]. Under assumptions of the epidemic process being characterized well enough by a simple compartmental epidemic model, this information about population size dynamics can be translated into epidemic trajectories [19,20]. These epidemic trajectories can be described using an epidemic model which accounts for the effects of a fitness cost and benefit of resistance to a specific antimicrobial. As the use of this antimicrobial changes through time, so will the net fitness of the particular lineage in consideration. This will in turn lead to changes in the behaviour of the epidemic trajectory. However, not all changes in the behaviour of the epidemic trajectory will be due to changes in the fitness of the resistant phenotype. Confounding factors, such as depletion of susceptibles or changes in host behaviour, will also affect the epidemic trajectory. Under relatively mild assumptions detailed below changes in these confounding factors will affect other lineages equally. We can therefore use as ‘control’ some data from a susceptible lineage, ideally closely related and with the same resistance profile to other antimicrobials used in significant amounts as primary treatment. Differences between the trajectories of the sensitive and resistant lineages can then be ascribed specifically to resistance, allowing us to estimate the associated fitness cost and benefit parameters.

    Let us consider a pathogen causing infections at the level of a large population that are or were treated with a certain antimicrobial compound. We assume that at some point in the past one or several lineages with resistance to this antimicrobial compound have arisen. Our aim is to quantify the fitness cost and benefit of the resistance to this antimicrobial for a given lineage as a function of use of the antimicrobial of interest through time. To this end, we need data that quantify the use over time of the given antimicrobial to treat infections caused by this pathogen, as well as a reasonable sample of sequenced case isolates from infections caused by the pathogen over time. Furthermore, we need information that characterizes the resistance profiles of the individual isolates, which can be either obtained by resistance screening in vitro or predicted from the sequences in silico [21]. A dated phylogeny of these samples is estimated, for example using BEAST [22], BEAST2 [23] or BactDating [24]. This phylogeny is then used as the starting point for analysis [25], to identify which samples belong to resistant and susceptible lineages and to select related lineages for further study that are wholly resistant or susceptible to the antimicrobial of interest, but otherwise similar in their resistance profiles. Note that for simplicity resistance is treated as a binary trait, with samples being either resistant or susceptible to antimicrobials, as is usually the case in resistance modelling studies [7].

    2.2. Transmission model derivation

    In order to estimate the fitness cost and benefit of antimicrobial resistance, a transmission model needs to be specified. We focus on estimating the fitness parameters of a particular lineage harbouring a certain treatment resistant phenotype when previous infection does not confer immunity against reinfection. Under the simplifying assumptions that the host population is unstructured and that past infections do not confer any immunity, the multi-lineage susceptible–infected–susceptible (SIS) is a reasonable model [26,27]. This model is more commonly referred to as multi-strain SIS. Fluctuations in the carriage levels of different lineages can also be due to external factors, such as changes in host demography or behaviours. Left unaccounted, such fluctuations would bias estimates of the fitness cost and benefit of resistance to a given antimicrobial. Therefore, we modify the model with time-varying transmission rate β(t) and population size N(t). This leads to an n-lineage model described by a system of the following n-coupled ordinary differential equations (ODEs):

    where Ij(t) denotes the number of people infected with the jth lineage at time t. β(t) is the transmission rate that varies with time due for example to changes that are not specific to any lineage, for example host behaviour. N(t) is the host population size which may also change with time due to demographic factors. γj(t) is the recovery rate of the jth lineage at time t. These may or may not vary with time through their dependency on the antimicrobial usage which changes with time. Finally, S(t) denotes the number of susceptible hosts:
    Typically, this model could simply be reduced to a two lineage model, averaging over all lineages that are phenotypically similar in their resistance profiles. However, this is undesirable, as some of the lineages with the same resistance phenotype could differ in fitness due to different genomic background which would confound our estimates. Furthermore, this sort of model would not be readily tractable in a genomic framework, because phylogenetic data are generally going to be informative about the dynamics of a particular lineage only. Note that this also means that the analysis produced is valid for the lineages being studied, and cannot be extrapolated to the overall dynamics of resistance for a given pathogen.

    We therefore need to focus on the resolution of individual lineages. We note that environmental effects such as fluctuations in host population size or behaviour affect all lineages equally, if the population is well mixed. We denote the combination of these effects as b(t) = β(t)S(t)/N(t). Conditional on the knowledge trajectory of b(t) the ODEs in equation (2.1) become uncoupled, and this allows us to reduce the system to uncoupled equations corresponding to the lineage we will be focusing on. As such, we will treat b(t) as a random object that needs to be inferred. We further assume that for the susceptible lineages the average recovery rate denoted γs does not change over time, whereas for the resistant lineage it takes one of two values: γT = qT + γs if a given patient is treated with the antimicrobial of interest, or γU = qU + γs otherwise. If we also consider the known proportion of registered cases treated with the antimicrobial of interest u(t), this fully determines the average recovery rate of the resistant lineages as


    We can now fully write down the equations of the model we will be using for the sensitive and resistant lineages, respectively:


    In practice, we are interested in the difference in recovery rates between the susceptible and the resistant lineages when every case gets treated with the antimicrobial of interest, and when the antimicrobial of interest is not used at all. We denote these by

    The interpretation is therefore that qT captures the fitness benefit of resistance in the case qT < 0 and qU captures the fitness cost of resistance in the case qU > 0.

    This model can be applied to any number of resistant and sensitive lineages, simply by adding lineage-associated terms to the likelihood and adding required parameters. This is straightforward as the individual lineages are independent conditional on b(t), but for simplicity the remainder of methods description focuses on the case of a single sensitive and a single resistant lineage, with the general case being a straightforward extension.

    2.3. Link to phylogenies

    Having defined the epidemiological model, we can now link it to the phylogenetic process. Based on [19,28], the instantaneous coalescent rates for a single pair of lineages can be derived as

    in the susceptible and resistant populations, respectively. The likelihood of a dated phylogeny g with n leaves at times s1 < · · · < sn and n − 1 coalescent events at times c1 < · · · < cn−1 and A(t) lineages at time t is therefore given by Griffiths & Tavare [29]:
    where λ(t) = λs(t) and λ(t) = λr(t) for the susceptible and resistant phylogenies, respectively. However, in most cases, and indeed in our case, the integral in equation (2.7) is not analytically intractable. Furthermore, the antibiotic use data are unlikely to span the entire phylogeny. Therefore, we define the approximate likelihood for the phylogeny truncated to [tmin, tmax], which is the intersection interval spanned by the antibiotic use data and the phylogenies under study.

    As such we resort to the standard way of approximating coalescent likelihoods [30], partitioning the interval [tmin, tmax] into a fine mesh tmin = t1 < t2 < t3 < · · · < tN = tmax such that titi−1 < Δt and that all sampling and coalescent times between tmin and tmax are included in the mesh:

    We note that the approach of how we treat the relationship between the phylogenies and epidemic is effectively a structured coalescent with no migration and time varying Ne(t) determined by the deterministic epidemic model. Approaches reminiscent of ours have been used to formally study the expected age of a mutation in both the presence or absence of selection [31]. However, in that case the populations correspond to different alleles, and the Ne(t) curves follow the proportion of population with a given allele as determined by Wright–Fisher diffusion forwards in time. Migration between the demes corresponding to individual alleles can also further be added corresponding to recombination [32].

    2.4. Bayesian inference

    We first re-scale time from the interval [tmin, tmax] to [ −1, 1]. Denoting the scale factor D = (tmaxtmin)/2 associated with this re-scaling, we account for this in the model by defining γs¯=γsD.

    The model consists of independent first-order linear homogeneous ODEs for each lineage with time-varying coefficients. The solutions at time t subject to initial conditions Is(0) = Is0 and Ir(0) = Ir0 can be obtained in terms of the integral of the instantaneous rates up to time t:

    As it stands, this model would not be well suited for performing inference under, primarily due to the difficulty in choosing a sensible prior on b(t), and a very complicated dependency structure between the initial conditions and b(t). As such we re-parameterize the model by directly modelling the logarithm of Is(t) as a Gaussian process:
    where C(t) is an appropriately chosen zero mean Gaussian process, and μs is the susceptible intercept which relates to the susceptible initial condition Is0 as follows:
    We use this formulation principally to loosen the coupling between the intercept parameter and the Gaussian process in order to speed up sampling. From this, we can compute b(t) and log Ir(t) as
    Once again we follow the same reasoning for the resistant trajectory intercept μr, relating it to Ir0 as
    Note that (d/dt)C(t) exists as long as the associated covariance kernel is sufficiently smooth such as in the case of the radial basis function (RBF) kernel [33] which we used. Evaluating a full-rank, Gaussian process with differentiable trajectories on the entirety of the mesh would be prohibitively expensive due to the O(n3) computational complexity, where n is the number of grid points. Such a high computational cost would make the model infeasible. Instead, we work with a low-rank representation of C(t) based on the framework introduced in [34]. This leads to the representation of the low-rank projection of C(t), denoted by C^(t)
    This reduces the evaluation complexity of the Gaussian process prior from O(n3) to O(nm). L and m are approximation parameters that need to be specified a priori (see [34] for details). In practice, we used the Hilbert space Gaussian process (HSGP) approximation with parameters L = 6.5 and m = 60. These approximation parameters are appropriate for the 99% interval of the length-scale prior used as per [34]. Here fj are independent and identically distributed random variables following the standard Gaussian distribution, SRBF( · ; · , · ) is the appropriate spectral density for the RBF kernel, ρ is the kernel length scale and α is the marginal standard deviation of the kernel [34].

    Denote by θθ=(γs,γU,γT,Is0,Ir0,C^(t)) the parameters of the pathogen dynamics model. We can now factorize the model posterior π(θ,α,ρ,f1:mgs,gr), suppressing dependency on t where appropriate:

    The first two terms are computed using the coalescent likelihood in equation (2.7). The third term is given by combining equations (2.6), (2.10) and (2.12). The fourth term is obtained by combining equations (2.6), (2.12) and (2.13). Finally, the last term is given by
    where the first term is given by the Gaussian process (equations (2.15) and (2.16)) and the remaining terms correspond to the prior distributions listed below.

    2.5. Choice of prior and parameterization

    The model is parameterized with the priors summarized in table 1. The data are not expected to be very informative about the value of γs. As such, we impose a fairly informative prior on this parameter, centred around a guess γ* which must be known and supplied a priori. σ then governs how informative the prior is. We typically use a value of σ = 0.3, which includes relative fluctuations of over 50% in its 95% interval. The higher the value of σ, the more complicated the geometry and subsequently sampling of the posterior becomes. γT and γU represent the recovery rates for the resistant lineage when the resistant lineage is treated with the focal antibiotic of interest, or another antibiotic, respectively. A normal distribution centred at γs and truncated to positive values only is a natural choice. We choose its standard deviation to be 0.3γ* as this puts greater than 99% of the weight within 2γ* thus making implausibly large fluctuations unlikely. Such large fluctuations are hardly of interest here since they would lead to a very rapid selective sweep or extinction. The recovery rates γT and γU are related to the absolute changes in recovery and therefore fitness parameters using equation (2.5). γU > γs corresponds to faster recovery when the resistant lineage is treated with an antimicrobial it is sensitive to and therefore a cost of resistance. γT < γs corresponds to slower recovery when the resistant lineage is treated with the antimicrobial of interest and therefore a benefit of resistance. If instead a large proportion of posterior probability mass has γU < γs or γT > γs, we conclude that the result is consistent with either the cost or the benefit of resistance not being significantly present. The prior on ρ was chosen so that approximately 1% of mass lies on values of ρ < 0.2 and approximately 1% of mass lies on ρ > 2. The lower bound was chosen to avoid over-fitting, and the upper bound to suppress length scales that exceed the range of data and thus cannot be informed about by the data.

    Table 1. Summary of the parameters and priors used in the model.

    parameter symbol prior
    susceptible lineage recovery rate γs lognormal(logγ,σ)
    resistant lineage recovery rate if treated with focal antibiotic γT normal(γs,0.3γ)1[x>0]
    resistant lineage recovery rate if treated with other antibiotic γU normal(γs,0.3γ)1[x>0]
    initial prevalence of sensitive lineage Is0 lognormal(6,2)
    initial prevalence of resistant lineage Ir0 lognormal(6,2)
    GP kernel marginal variance α gamma(4,4)
    GP kernel length scale ρ inversegamma(4.63,2.21)
    approximate GP functions f1:m N(0,1)

    In practice, due to our choice of a sampling approach we need to parameterize γU and γT on an unconstrained space, and ideally also weaken the dependency on γs. To do so, we introduce parameters q~U and q~T, and define γU and γT to be a deterministic transformation of these:

    The Jacobian adjustment to the likelihood associated with this transformation is proportional to

    2.6. Computational implementation

    The posterior in equation (2.17) is a high-dimensional distribution and we expect many parameters to have a high degree of interdependency. In order to sample from this distribution, we use dynamic Hamiltonian Monte Carlo (HMC), a HMC sampler available in Stan [35]. HMC is a Markov chain Monte Carlo approach that due to possessing energy conserving properties is able to take large steps between individual states while maintaining high acceptance rates. This makes it efficient at sampling from moderately high dimensional posterior distributions with differentiable likelihoods, while requiring a much lower number of iterations. We implemented the model and inference method in an R package which is available at All results shown used four chains with 2000 iterations for warmup and 2000 iterations for sampling. For all model parameters and all analysis, the bulk effective sample size (bulk-ESS) was always greater than 500, and all R^ statistics were lower than 1.05 [36], values that indicate no issues with mixing. We also checked that there were no divergent transitions at least during the sampling phase.

    2.7. Use of simulated and real datasets

    For all simulations, we use a stochastic, discrete state-space version of the multi-lineage SIS in equation (2.1). The system is simulated using tau-leaping [37]. More specifically, we consider a scenario with three lineages simulated over the course of 19 years. Two lineages are set to be susceptible and thus unaffected by antibiotic usage fluctuations and one is set to be resistant. The first lineage aims to represent the unobserved bulk of the population and thus is set to start at much higher prevalence. Conditional on the trajectories of the two lineages, we sample phylogenies under Kingman’s coalescent with varying effective population size Ne(t) following equation (2.6) conditional on the trajectories [28]. The parameters for the simulation were selected as to consistently provide a reasonable range of plausible behaviours so that resistant lineages would reach prevalence with orders of magnitude between 102 and 104.

    A total of 1102 genomes were collected between 2000 and 2013 by the CDC Gonococcal Isolate Surveillance Project (GISP) [38]. A maximum-likelihood phylogeny was computed using PhyML [39], which was corrected for recombination using ClonalFrameML [40] and dated using BactDating [24]. This dated phylogeny is the same as previously used in an analysis of hidden population structure [41]. The distribution of primary antimicrobial drugs used to treat gonorrhoea among participants of the GISP between 1988 and 2019 was obtained from the GISP reports available at Note that usages of ciprofloxacin and ofloxacin were combined into a single fluoroquinolone category. All the data and code used in the simulated and real dataset analyses are available at

    3. Results

    3.1. Detailed analysis of a single simulated dataset

    To validate the performance of this model, we first resort to simulation from a three-lineages stochastic SIS with population size N(t), transmission rate β(t) and antimicrobial usage function u(t) varying over the past 20 years, as illustrated in figure 1. The first two lineages are susceptible and thus unaffected by fluctuations in antimicrobial usage, whereas the third lineage is resistant and therefore affected. The first lineage represents the bulk of the susceptible lineages and is thus left unobserved. The remaining two lineages represent the observed lineages, susceptible and resistant, respectively. The per-day recovery rate of the sensitive lineage was set to γs = 1/60, the fitness cost of resistance to qU = 1.25 and the fitness benefit of resistance to qT = −2.7. From each of these two observed lineages, a dated phylogeny with 200 leaves was simulated. The sampling dates were randomly assigned to one of the first 6 years, with the relative probability of a particular year being chosen proportional to the total prevalence in that year. We performed inference on this simulated dataset; the traces are shown in electronic supplementary material, figure S1, and the posterior distribution of the kernel parameters in electronic supplementary material, figure S2. The prevalence and reproduction number R(t) of both the susceptible and resistant lineages are shown in figure 2. As expected, the inferred values followed the correct values used in the simulation. The inferred values of the susceptible lineage recovery rate γs and the cost and benefit of resistance qU and qT were also found to be close to their correct values, as shown in figure 3. The posterior distribution of γs was almost identical to the prior, which was centred on the correct value 1/60, reflecting the fact that the data are uninformative about this parameter and stressing the importance of using an informative prior. There was a strong negative correlation between the inferred values of qU and qT, as expected since these two parameters play opposite roles in the overall fitness of the resistant lineage relative to the sensitive lineage. Nevertheless, we detected both the cost and the benefit associated with resistance, since the ranges of inferred values for qU and qT were, respectively, above and below one, contrary to their lognormal priors with mean one (figure 3). Finally, we computed the posterior predictive distribution [42] for the number of ancestral lineages through time A(t) and compared this with the input phylogenetic data (electronic supplementary material, figure S3). The data and posterior predictive trajectories were similar, indicating a good fit of the model to the data as indeed would be expected here since the same model was used for simulation and inference.

    Figure 1.

    Figure 1. Host population size function N(t), transmission rate over time β(t) and antibiotic usage function u(t) used in the simulated datasets.

    Figure 2.

    Figure 2. Posterior summary of dynamics for the sensitive (a,c) and resistant (b,d) lineages, showing prevalence (a,b) and reproduction number (c,d). Bold solid red lines indicate simulated values. Posterior median in bold dashed black line. Shaded bands indicate 95% posterior credible intervals. Solid light lines represent posterior draws.

    Figure 3.

    Figure 3. Marginal and joint posterior distributions for the recovery rate of the sensitive lineage (γs), fitness cost (qU) and fitness benefit (qT) of resistance. Bold red solid lines indicate simulation values. Bold blue dashed lines indicate prior density values.

    3.2. Benchmark using multiple simulated datasets

    We repeated the same application of our inference method to data simulated in the same conditions as described above and illustrated in figure 1, except the values of the fitness cost and benefit of resistance were varied. A total of 50 simulated datasets were generated and analysed, with the fitness cost qU increasing linearly from 1 to 1.2, and the fitness benefit qT decreasing linearly from 1 to 0.5. The prevalences of the susceptible and resistant lineages in these simulations are shown in electronic supplementary material, figure S4. The results of inference are illustrated in figure 4 and show that in almost all cases, the posterior 95% credible intervals covered the correct values of the fitness cost and benefit of resistance used in the simulations.

    Figure 4.

    Figure 4. Inferred parameters versus correct values. A total of 50 simulated datasets were generated, with decreasing values of qT and increasing values of qU as shown by the dotted lines in grey and blue, respectively. For each simulated dataset, we applied our inference method. The grey and blue dots show the mean inferred values of qT and qU, respectively, with vertical bars representing the 95% credible intervals for both parameters.

    3.3. Application to fluoroquinolone resistant N. gonorrhoeae in USA

    We demonstrate the use of our model and inferential framework by estimating the cost and benefit of fluoroquinolone resistance in N. gonorrhoeae. Based on the 1102 genomes collected between 2000 and 2013 by the CDC GISP [38], a recombination-corrected tree was constructed using ClonalFrameML [40] and dated using BactDating [24]. As there are two major fluoroquinolone resistant lineages present in this phylogeny [38], we decided to do a comparative study. The two fluoroquinolone resistant lineages and one fluoroquinolone susceptible lineage were selected based on similar resistance profiles against other relevant antibiotics. By inspecting the antibiotic usage data and the resistance profiles for the three lineages (figure 5), we can see that the resistance profiles match for antimicrobials that were in use as primary treatment at significant levels after 1995. As such, this is the year we set as the analysis start date (tmin = 1995) and the end date is the date when the last genomes were collected (tmax = 2013). Note that a subclade within the susceptible lineage that displayed a de novo gain of resistance to cefixime has been removed. The prior mean for the per-day recovery rate for the susceptible lineage was set to γ* = 1/90 based on previous gonorrhoea modelling studies [10,43,44].

    Figure 5.

    Figure 5. Antibiotic usage data and phylogeny used for the application to fluoroquinolone resistant N. gonorrhoeae.

    We performed inference for this dataset; the traces are shown in electronic supplementary material, figure S5, and the posterior distribution of kernel parameters in electronic supplementary material, figure S6. Figure 6 depicts the summary of posterior latent transmission dynamics for the two resistant lineages, whereas electronic supplementary material, figure S7, shows the same for the susceptible lineage. The two resistant lineages have similar dynamics, with a peak in prevalence around 2007, which corresponds to the moment when fluoroquinolone use dropped (figure 5). Figure 7 depicts the marginal and joint posterior distributions for the resistance parameters qU and qT for both resistant lineages. This is consistent with there being both a cost and benefit to fluoroquinolone resistance for both lineages, since both qT and qU are, respectively, localized below 1 and above 1, with high posterior probability. It is noteworthy that while both of these lineages come from distinct genetic background, their resistance profile is qualitatively very similar, indicating both of these lineages faced similar selective pressures and neither seems to have successfully adapted to overcome the fitness cost associated with fluoroquinolone resistance. We used a posterior predictive approach to ensure that the model can explain the data appropriately [42]. Posterior predictive trajectories for the function of ancestral lineages through time A(t) were simulated and found to be very similar to the ones implied by the phylogenetic data (electronic supplementary material, figure S8).

    Figure 6.

    Figure 6. Posterior epidemic dynamics for both fluoroquinolone resistant lineages of N. gonorrhoeae.

    Figure 7.

    Figure 7. Marginal and joint posterior distribution for the cost (qU) and benefit (qT) of both fluoroquinolone resistant lineages of N. gonorrhoeae.

    Under the assumption of perfect competition between lineages, if we want to ensure to that a resistant lineage cannot establish, and its proportion decays sufficiently fast, we fix a decay factor c > 0 and aim to ensure that the growth rate of the resistant lineage is c units lower than that of the sensitive lineage, that is rs(t) − rr(t) > c. Note that r(t) is the growth rate through time, not R(t), the time varying reproduction number. We choose to work with growth rates as these are less sensitive to susceptible recovery rate mispecification. Given that the lineages have the same transmission rate function b(t), this condition is equivalent to γs(t) − γr(t) > c, and using the definition of γr(t) from equation (2.3), this is equivalent to u(t)qT + (1 − u(t))qU > c. We use this to estimate posterior probabilities. The differences in growth rates between the susceptible lineage and each of both resistant lineages exceed c as shown in figure 8. In order to be 95% certain that the resistant lineages remain at a lower fitness than the susceptible lineage, fluoroquinolone should not be prescribed to more than approximately 20% and 15% of infected individuals, for resistant lineages 1 and 2, respectively.

    Figure 8.

    Figure 8. Posterior probabilities of Rr(t)/Rs(t) < C given usage u(t) in the x-axis and value of C in the y-axis, for both fluoroquinolone resistant lineages of N. gonorrhoeae.

    4. Discussion

    A bacterial pathogen lineage that is resistant to a given antibiotic incurs both a fitness cost and a fitness benefit compared to similar susceptible lineages [8]. When the antibiotic is used extensively, the benefit is likely to be greater than the cost. In that case, a resistant lineage has a selective advantage over susceptible lineages, and therefore grows at a faster rate. Conversely, if the antibiotic is used rarely or not at all, the benefit is likely to become smaller than the cost, which will lead to the resistant lineage decreasing in frequency. Estimating these parameters is therefore of primary importance to determine how antibiotics should be prescribed without causing an increase in resistance [9]. Here, we have shown how genome sequencing data coupled with data on antibiotic prescriptions can be used for this purpose, following on previous work that demonstrated the link between epidemic dynamics and phylogenetics [13,19,20,28]. By comparing the phylodynamic trajectories of susceptible and resistant lineages, and relating them with a known function of antibiotic use, we show that it is possible to estimate separately the parameters corresponding to the fitness cost and benefit of resistance. In particular, we reanalysed a large published collection of N. gonorrhoeae genomes [38]. We were able to infer these parameters for two lineages of N. gonorrhoeae resistant to fluoroquinolones, and found similar estimates of cost and benefit in both (figure 7). We were able to use this knowledge to make recommendations on antibiotic stewardship of fluoroquinolones (figure 8).

    Dated phylogenies for both susceptible and resistant lineages are needed as input into our method. Several software tools can be used to produce this either from a sequence alignment, for example BEAST [22] and BEAST2 [23], or from an undated phylogeny, for example treedater [45] and BactDating [24]. Building such a dated phylogeny requires either the population to be measurably evolving over the sampling period [46,47], or a previous estimate of the molecular clock rate [48]. Another input required by our method is the antibiotic usage function over a relevant timeframe and geographical location. This may not always be available in all historical contexts, but efforts are increasingly being made to capture these data [49]. Finally, our method requires an informative prior of the recovery rate for the susceptible lineage (table 1), since this is typically not identifiable from the data, as in many similar compartmental epidemic models [50]. This prior needs to be chosen carefully depending on the infectious disease under study and based on the existing scientific literature.

    Our inferential methodology is based on a well-defined and relatively simple epidemic model (equation (2.4)) which means making a number of assumptions the validity of which was considered before performing our analysis. Our model assumes multiple-lineage pathogen dynamics driven by person-to-person transmission in a well-mixed host population in the absence of any significant population structure, so that there is perfect competition between lineages. It also assumes that individuals become infectious as soon as they are infected, that their infectiousness remains constant until they recover, after which they become susceptible again without any immunity being gained. This list of relatively strong assumptions may seem to preclude application to any real infectious disease, but they are necessary to obtain a model under which inference can be performed. Furthermore, violation of some of these assumptions does not necessarily invalidate the results of inference. For example, if infection causes immunity, this will effectively reduce the number S(t) of susceptible individuals (equation (2.2)), but this number is not assumed to be constant in our model. In fact both the size N(t) of the host population and the number S(t) of susceptible individuals are integrated out as part of our parameterization in terms of the function b(t) (cf. equation (2.4)), so the inference is robust as long as the immunity conferred applies to all lineages under study. Likewise, the assumption of an unstructured population may seem problematic, including in our application to N. gonorrhoeae throughout the USA, but for anything other than small local outbreaks the genomes available for analysis are sparsely sampled from the whole infected population [51]. In these conditions, any effect of the host population structure on phylodynamics is likely to be insignificant as long as an effective rather than actual number of infections is considered [52,53].

    The compatibility of our model with the phylogenetic data under analysis can be tested using posterior predictive distribution checks (electronic supplementary material, figures S3 and S8). If these tests fail, or if the model assumptions are thought to be inappropriate, a solution may be to resort to other methods that postprocess a dated phylogeny [25] but make fewer assumptions, at the cost of not inferring directly the parameters of resistance. Alternative approaches includes non-parametric methods that detect differences in the branching patterns in different lineages [41,54] as well as methods parameterized in terms of the pathogen population size growth rather than underlying epidemiological drivers [15,55]. However, our model-based approach is both general and flexible, so that we expect it to be applicable in many settings using our software implementation which is available at We believe that this methodology, applied to the increasingly large genomic databases on many bacterial pathogens, will help quantify the exact link between antibiotic usage and resistance and therefore provide a much-needed evidence basis for the design of future antibiotic prescription strategies [9,56,57].

    Data accessibility

    All data and code used in this study are available from the GitHub digital repository:

    The data are provided in electronic supplementary material [58].

    Authors' contributions

    D.H.: conceptualization, investigation, methodology, software, writing—original draft, writing—review and editing; M.K.: conceptualization, funding acquisition, investigation, supervision, writing—review and editing; Y.H.G.: conceptualization, investigation, writing—review and editing; X.D.: conceptualization, funding acquisition, investigation, methodology, supervision, writing—original draft, writing—review and editing.

    All authors gave final approval for publication and agreed to be held accountable for the work performed therein.

    Conflict of interest declaration

    We declare we have no competing interests.


    We acknowledge funding from the National Institute for Health Research (NIHR) Health Protection Research Unit in Genomics and Enabling Data. This work was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) grant no. EP/S022244/1 for the EPSRC Centre for Doctoral Training in Mathematics for Real-World Systems II.


    Electronic supplementary material is available online at

    Published by the Royal Society under the terms of the Creative Commons Attribution License, which permits unrestricted use, provided the original author and source are credited.