Accurate detection of HIV transmission clusters from phylogenetic trees using a multi-state birth-death model

HIV transmission networks are highly clustered, and accurate identification of these clusters is essential for effective targeting of public health interventions. This clustering affects the transmission dynamics of the HIV epidemic, which affects the pathogen phylogenies reconstructed from patient samples. We present a new method for identifying transmission clusters by detecting the changes in transmission rate provoked by the introduction of the epidemic into a new cluster. The method employs a multi-state birth-death (MSBD) model where each state represents a cluster. Transmission rates in each cluster decrease exponentially over time, simulating susceptible depletion in the cluster. This model is fitted to the pathogen phylogeny using a Maximum Likelihood approach. Using simulated datasets we show that the MSBD method is able to reliably infer both the cluster repartition and the transmission parameters from a pathogen phylogeny. In contrast to existing cutpoint-based methods for cluster identification, which are dependent on a parameter set by the user, the MSBD method is consistently reliable. It also performs better on phylogenies containing nested clusters. We present an application of our method to the inference of transmission clusters using sequences obtained from the Swiss HIV Cohort Study. The MSBD method is available as an R package.


Background
Each state i is characterized by a specific initial transmission rate λ 0,i , a transmission decay 114 rate z i , and a removal rate µ i . Each individual produces an additional individual with a state-115 and time-dependent transmission rate λ i (t) (function of λ 0,i , z i as defined below), and is removed 116 with a state-dependent removal rate µ i corresponding to the rate of "becoming non-infectious".

117
The depletion of the susceptible population is modeled by the exponential decay of the trans-118 mission rates in the process. Each state is associated with a specific initial transmission rate λ 0,i 119 and a transmission decay rate z i . Equation 1 shows the transmission rate for a lineage in state 120 i at time t before the present, where t 0,i is the time of introduction into state i. Since time is 121 backwards, we impose z i ≥ 0, so that the transmission rate decreases as the process progresses 122 towards the present. shown in Figure 1. We assume that the state changes correspond to introduction events in newly 129 infected clusters, so that all tips inferred to be in the same state belong to the same cluster in 130 the original transmission network.

Likelihood function
We now derive the probability density of a phylogeny (including the state change times) given 135 the MSBD parameters, i.e. we derive the likelihood of the parameters given a phylogeny.
143 dq i,N dt (t) = −(γ + λ i (t) + µ i )q i,N (t) + 2λ i (t)q i,N (t)p i (t), q i,N (t s ) = µ i σ if N leads to a tip at time t s , q i,N (t t ) = λ i (t t )q i,N (t t )q i,N (t t ) if N undergoes transmission at t t , leading to N and N , q i,N (t c ) = γ n * − 1 q j,N (t c ) if N changes to state j at t c . (3) The probability of a phylogeny starting at root time τ with initial state I is q I,N (τ ) so the full 144 likelihood can be calculated from Eq 3. Rather than writing it recursively as in Eq 3, it can be This likelihood function can be applied to trees with or without a root edge, i.e trees starting 152 with one lineage or two at time τ . Since the real number of clusters in the underlying network n * is unknown, we need to estimate 156 it. However this parameter only appears in the likelihood in the factor γ n * −1 n−1 so maximizing 157 the likelihood is equivalent to minimizing n * . We further assume that each migration enters a 158 previously not visited state, i.e. n * ≥ n. Together, the maximum likelihood estimate will always 159 be n * = n. Thus we fix n * = n in the inference. The equations for p and f N do not have an analytical solution. Numerical integration is compu-162 tationally expensive and can be unstable for certain parameters, so we make the assumption that 163 no state changes happen in the unsampled parts of the tree, meaning we observe all state changes 164 in the final tree. With this assumption, the master equation for p i (t) changes to Equation (5), These equations have an analytical solution for constant transmission and removal rates, but not 166 necessarily for time-dependent rates. To obtain a closed form solution, we use time discretization 167 and assume that in each time step the transmission rate can be considered constant, as described 168 in the next section.

Time discretization
We discretize the time-dependent transmission rates by assuming that they can be considered 171 locally constant on small enough intervals. The grid size used for the discretization is fixed across 172 the tree and needs to be specified by the user. A smaller size will improve the accuracy of the 173 likelihood calculation but also increase the computational cost.

174
Time discretization for p 175 A closed form of the extinction probability and the likelihood function can be obtained for 176 piecewise constant transmission and removal rates. Assuming constant rates in Eqn 5, and a 177 generic initial condition p i (t IC ) = V IC (rather than the initial condition p i (0) = 1), we obtain 178 an analytic solution of Eqn 5, This solution can be verified by differentiating the solution and substituting the result into Eqn 180 5.

181
To obtain p i (t) using this time discretization, we divide the time interval [τ ; 0] into a grid.

182
Starting with p i (0) = 1, we can then evaluate p i using Eq 6 in each grid interval going backwards 183 in time, using as initial value the solution of the previous grid interval. 184 Time discretization for f N

185
A closed form solution of the edge likelihood function f N can now be calculated, for a small 186 time interval [t l ; t l−1 ] on an edge N in state i. This expression uses the value of p i (t l−1 ), which 187 can be calculated as explained in "Time discretization for p". We define f N (t l , t l−1 ) = and obtain This expression for f N (t l , t l−1 ) is a solution of the differential equation 3 with f N (t l−1 ) = 1, 190 8 assuming the rates are constant in interval [t l , t l−1 ] and using the approximate function p i (t) 191 from Eq. 6. This can be easily verified by differentiating Eq. 6 and Eq. 7 and substituting the 192 resulting expressions d dt p i (t) and d dt f N (t l , t l−1 ) into the differential equations 5 and 3. Equations of them can be found in (17). 195 We now describe how to calculate f N and obtain an evaluation of the likelihood provided in We now present an algorithm which identifies the state change configuration and associated 211 parameters that maximize the likelihood in Eq. 4 for a particular phylogeny T .

228
Once a configuration has been found in which no more state changes can be added to improve 229 the likelihood, we will attempt to recursively remove all the states from this configuration. This 230 step is designed to compensate partly for the fact that the greedy approach never goes back on 231 previous state change assignments, and so can end up in sub-optimal configurations.

232
Once no further improvements of the likelihood can be obtained by either adding or removing 233 a state, the method will return the best fitting model found, including the state configuration 234 and the maximum likelihood estimates for all parameters.

235
The full algorithm, including the initial coarse-grained search phase, is as follows:  The model and the likelihood function allow for state changes to be placed anywhere on an edge.

263
The implementation of the algorithm allows for the time positions of changes to be estimated as 264 additional parameters, but this is computationally expensive especially when the number of state 265 changes grow. As a consequence we also provide the option to limit the positioning of changes 266 to predetermined positions on edges: they can be positioned at either 10%, 50% or 90% of the 267 length of the edge they are on. An intermediate option is also available, which will test all three 268 predetermined options and keep the most likely. The algorithm as presented in the previous sections is fast at the beginning of the inference but 271 will progressively slow down as more states are added, due to the increase in the number of 272 parameters that need to be optimized. 273 We have thus added a so-called 'fast optimization' option, which limits the number of pa-274 rameters which are allowed to change during one step of the maximum likelihood optimization.

275
In practice, when adding the n-th state change, only the parameters λ 0,n+1 , λ 0,a , z n+1 , z a , µ n+1 In all types of networks, edges between communities are weighted, with the weight value 0.25, 299 0.5, 0.75, or 1. This means that the rate of transmission on these edges is respectively 25%, 50%, 300 75 % and 100% of the transmission rate on within-community edges.

301
Epidemics were simulated on these networks starting from one random introduction in A 302 networks, one random introduction in the main community in B networks, and two random 303 introductions in C networks. All infected individuals were sampled upon removal and a trans-

328
The correspondance between the real network communities and the clusters inferred from the 329 tree was assessed using the Adjusted Rand Index (ARI) (18; 19). We compare the results from 330 our method to the results obtained by (10) using cutpoint-based clustering methods.  removal and the process was ran until the tree reached 50 sampled tips. Since these trees were not 375 built from network simulations we did not try to assess the quality of the cluster inference, but 376 we focused on the quality of the parameter inference and on whether our method can adequately 377 distinguish between trees that contain several clusters and trees that do not.

15
The results are summarized in Table 3. We can see that although the MSBD method is able to consistently infer multiple clusters when they are present, it will also wrongly detect one 380 additional cluster in around 25% of the trees that only contain one cluster. This may be a problem 381 of noise, where due to the stochasticity of the simulation one subtree is slightly more likely to be 382 attributed different rates than the rest of the tree. This problem can be alleviated by looking at 383 the difference in the inferred transmission rates of each cluster, which are also outputted by our 384 method: a smaller difference is more likely to be indicative of noise. As previously noted, the 385 method also tends to underestimate the number of clusters in multi-cluster trees, mostly because 386 it cannot detect clusters below a certain size.

387
Regarding the parameter inference, the method has a slight bias towards overestimating 388 the transmission rate and underestimating the removal rate. This is potentially due to our

393
In conclusion, the parameter inference from the MSBD method is reliable, although it suffers 394 from noise when applied to trees which contain only one cluster. Since every step of the algorithm involves testing all edges of the tree, the "fast" option is thus 412 necessary to ensure completion of the inference on trees with more than 10 clusters. As with other cutpoint-based methods, the results depend strongly on the user-defined values. 436 We used three different cutpoint values for the genetic distance: 1.5%, 4.5% and 8%. 4.5% is the     (λ 3 ,µ 3 ) (λ 1 ,µ 1 ) 0 t c t s t t Figure 1: Visual representation of the phylogeny under a MSBD model. Each state is represented by a colour: the ancestral state, in black, starts at the root and represent the first cluster infected. The other states, in blue, red and green, start at change points along the tree. These states represent the clusters infected later in the course of the epidemic and the state change points represent the introduction event for each associated cluster.  Figure 2: Comparison of the average ARI obtained by the different clustering methods in function of the set cutpoint on networks A (parts A1,A2), B (parts B1,B2) and C (parts C1,C2). For each network the first column (part 1) shows the results for weight w = 0.25 and the second column (part 2) for w = 1. Our proposed MSBD method is not dependent on a cutpoint. The cutpoint range for Definition 3. is shown on the x-axis on the top, the cutpoint range for all other definitions are shown on the x-axis at the bottom. and likelihood difference (B) on one step of the algorithm when using the 'fast optimization' option compared to the default settings. The dataset used was divided in three categories based on the number of state changes already found by the inference before the test was run.