Probabilistic reconstruction of measles transmission clusters from routinely collected surveillance data

Pockets of susceptibility resulting from spatial or social heterogeneity in vaccine coverage can drive measles outbreaks, as cases imported into such pockets are likely to cause further transmission and lead to large transmission clusters. Characterizing the dynamics of transmission is essential for identifying which individuals and regions might be most at risk. As data from detailed contact-tracing investigations are not available in many settings, we developed an R package called o2geosocial to reconstruct the transmission clusters and the importation status of the cases from their age, location, genotype and onset date. We compared our inferred cluster size distributions to 737 transmission clusters identified through detailed contact-tracing in the USA between 2001 and 2016. We were able to reconstruct the importation status of the cases and found good agreement between the inferred and reference clusters. The results were improved when the contact-tracing investigations were used to set the importation status before running the model. Spatial heterogeneity in vaccine coverage is difficult to measure directly. Our approach was able to highlight areas with potential for local transmission using a minimal number of variables and could be applied to assess the intensity of ongoing transmission in a region.


Introduction
Establishing who infected whom during an outbreak can help inform the design and evaluation of control measures [1][2][3][4][5]. Transmission links can be reconstructed through contact-tracing investigations, whereby cases are asked their movements and contacts during their infectious period. Given that contact-tracing investigations are not always carried out due to the logistical effort and cost involved, inference methods have been developed to use epidemiological data to estimate the probability that a transmission event occurred between any given pair of cases [6][7][8][9][10][11][12]. This makes it possible to establish probabilistic transmission trees that link all observed cases. The ensemble of cases belonging to the same transmission tree is called a transmission cluster.
Wallinga & Teunis [2] first developed a likelihood-based estimation procedure to reconstruct probabilistic transmission trees from a given distribution of generation times and observed symptom-onset dates of each case. Since then, genomic, spatial or contact data have been used to supplement the timing of symptoms, which helped identify determinants of transmission, mixing behaviour, individual dispersion, evaluate control measures, anticipate future developments of outbreaks and study viral evolutionary patterns [5,8,9,[13][14][15][16][17].
As sequencing of pathogens has become more common, the use of such data to infer transmission trees has increased. Methods developed to add genetic distance to a Wallinga-Teunis algorithm, where cases with lower genetic distance are more likely to be grouped in the same transmission group, showed it substantially increased the accuracy of the reconstructed transmission trees [8,[18][19][20][21].
The utility of sequence data depends on the characteristics of the pathogen [22,23]. Based on the highly variable 450 nucleotides region of the N gene (N-450) of the measles virus genome, eight measles genotypes have been detected since 2009 [24,25]; these genotype designations are helpful in linking cases, as linked cases must be infected by a virus of the same genotype [25]; however, the diversity of measles genotypes is decreasing [26]. It has been suggested that further sequencing the M-F non-coding region, or full genome sequencing, could help identify measles virus transmission trees, but so far, extended sequencing during measles outbreaks has been scarce [27,28]. In addition, the evolutionary rate of measles virus is very low [29]; therefore, samples from unrelated cases can be very close genetically and genetic sequences from measles cases are not usually indicative of direct transmission links [27,28].
As measles is highly infectious, under-immunized communities (also called pockets of susceptibles) resulting from local heterogeneity in vaccine coverage can lead to large, long-lasting outbreaks [30][31][32][33][34]. Detecting these pockets of susceptibles can be challenging, as historical local values of coverage throughout a given country are rarely available. The number of cases in the transmission trees resulting from each importation during outbreaks, also called the cluster size distribution, will depend both on individual factors (e.g. age of the imported case which might affect contact patterns) and community factors (e.g. the history of coverage in the area) [35,36]. The size of a cluster can, therefore, reflect the level of susceptibility of individuals directly and indirectly connected to the imported case [37,38].
Here, we introduced a model combining age, location, genotype and rash onset date of cases to reconstruct probabilistic transmission trees. We chose these features to make the model applicable to a wide range of settings as they are commonly reported and informative on transmission. We wrote the R package o2geosocial to conduct inference on individual-level data using this model. It is based on the package outbreaker2 and is designed for outbreaks with partial sampling of cases, or uninformative genetic sequences, such as measles outbreaks [9,39]. We used the likelihood of transmission links between different cases to estimate their importation status. We compared the inferred importation status and cluster size distribution to the transmission clusters identified via contact tracing during measles outbreaks in the USA between 2001 and 2016.

Presentation of the algorithm
Transmission trees are used to represent who infected whom during an outbreak. They are directed acyclic graphs, where nodes are the reported cases and edges show the connection between them. The root of each transmission tree is an imported case, i.e. a case who was infected in a different transmission setting. The cases placed in the same transmission tree form a transmission cluster. We estimated the number of cases per cluster (cluster size distribution) and the importation status of the cases from probabilistic transmission trees inferred using routinely collected epidemiological variables.
We used a Metropolis-Hastings algorithm with Markov chain Monte Carlo (MCMC) to classify a set of cases into a set of transmission trees with associated probabilities quantified using a Bayesian model to combine the epidemiological features of the cases. At every iteration of the MCMC algorithm, we proposed a new set of model parameters, infection dates and connections between cases. These three elements formed a tree proposal. We computed the ratio between the posterior probability of this proposal and the current posterior probability. The posterior probability (up to a multiplicative constant which would cancel out when calculating the ratio) was calculated from the likelihood of the trees, and the prior probability of the parameters. The log-likelihood of each tree was equal to the sum of the log-likelihoods of each case. All the notations are defined in table 1.

Likelihood function and parameter definition
In a tree proposal, each case i was assigned an infector j and an infection date t i . We computed the log-likelihood of each case, L i (t i , j, t j , θ) to calculate the likelihood of the tree. The log-likelihood of i was split in two: (i) the log-probability density of observing the onset date T i if case i had been infected at time t i log(f(t i − T i )) and (ii) the log-likelihood of connection between i and j L ji (t i , t j , θ ), with θ the parameter set of the model (2.1): ð2:1Þ royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20200084 The function f represents the distribution of the incubation period. The log-likelihood of connection L ji was computed from five components reflecting the age group, genotype, location, inferred date of infection of cases i and j, and the report ratio (2.2). We allowed for an indirect link between cases due to unreported individuals, κ ji corresponds to the number of generations between i and j. If κ ji = 1, j infected i, whereas if κ ji = 2, an unreported case infected by j infected i, κ ji increases with the number of missing links between i and j We calculated the temporal probability of transmission between i and j from the number of days between t i and t j and the distribution of the generation time of the disease w(t). This probability was quantified by w (k ji ) (t i À t j , k ji ), w (k ji ) ¼ wÃwÃ . . . Ãw, where * is the convolution operator applied κ ji times. We used a geometric distribution p(κ ji |ρ) to quantify the probability of observing κ ji missing generation between i and j, given the conditional report ratio ρ. The conditional report ratio quantifies the probability of missing generations between two connected reported cases. Entire missing clusters, cases infected after the last cases or cases infected before the ancestor of a cluster would not interfere in the connection between two cases and, therefore, would not affect the value of the conditional report ratio. The conditional report ratio can be higher than the overall report ratio of an outbreak. The 'ancestor' is the earliest identified case in a cluster.
a(α i , α j , κ ji ) was defined as the probability of transmission between age groups α i and α j . This probability corresponds to the proportion of contacts to the age group α i that originated from α j and can be deduced from studies such as POLYMOD [36]. We defined G(g i , g tj ) as the probability of observing the pathogen genotype g i in case i in the tree τ j containing case j. There can only be one measles virus genotype per transmission tree, or cases with unreported genotype. The genotype g τj is the genotype contained in the tree τ j and is known if at least one case in τ j had a reported genotype g tj unknown 1 if g i and g tj both known and g i ¼ g tj 0 otherwise s(r i , r j , κ ij ) was defined as the probability of connection from r j to r i , regions of residency of i and j (2.4). We used an exponential gravity model to quantify the connectivity of the different geographical units [40]. This approach showed good performance at modelling short-distance commuting, and was easy to parametrize [40][41][42][43][44]. In the simplest form of the exponential gravity model, the number of connections between r i and r j is proportional to the product of the origin population m rj , the destination population m ri and an exponential decrease of the distance between r i and r j d rjri : n rjri / e ÀaÂdr j r i Â m b rj Â m c ri , with a, b and c parameters adjusting for the impact of distance and population. From this definition, we deduced s(r j , r i ), the spatial probability of transmission from i to j s(r i ,r j ) ¼ n rjri P h n hri Only the parameters a and b were required to compute the spatial probability of transmission. If r i = r j , then (2.4) becomes: Other distributions than the exponential decrease can be used in this framework if transmission follows a different pattern.
The parameters ρ, a and b were estimated. At each iteration of the MCMC, the log-likelihood of the trees was equal to the sum of all individual log-likelihoods L i from equation (2.1). The log-posterior density of the proposed trees was calculated by summing the overall log-likelihood of the trees and the log-priors of the parameters.

Tree proposals
We used a Metropolis-Hastings algorithm with MCMC to sample from the posterior distribution of parameters and transmission trees. To do this, we developed a set of proposal tree updates. These updates were accepted with acceptance probability as defined by the Metropolis-Hastings algorithm [45]. We used eight types of tree proposal to ensure good mixing. Each proposal conserved the overall number of trees, with a maximum of one unique genotype reported per tree.
Five of the proposals had already been implemented in the outbreaker2 package and were adapted to this setting: (i) change the number of generations between two cases; (ii) change the conditional report ratio ρ; (iii) change the time of infection; (iv) change the infector of a case (if the case is not the ancestor of a tree); (v) swap infector-infectee (if none is the ancestor of a tree).
We added two proposals to change a and b, the spatial kernel parameters. For each proposal, the probability of transmission between every geographical unit was recalculated with the new values. The distance matrix had to be computed for each number of generations between cases, which considerably slowed down the algorithm. As we could not use sequence data, assessing whether a case was isolated or whether it was connected to a reported infector with two missing generations would be very challenging using our model alone. Therefore, we limited the maximal number of missing generations to 1 when a or b were estimated (max(κ ji ) = 2). Finally, the last proposal was designed to change the ancestor of the tree while conserving the overall number of trees (figure 1).

Inference of importation status and cluster
Unrelated measles cases stemming from different importations and different regions can be part of the same dataset. Grouping cases and excluding unrealistic transmission links reduces the number of possible trees and speeds up the MCMC runs. To do so, we listed each case's potential infectors using three criteria: (i) the potential infectors must be of the same genotype as the case, or have unreported genotype, (ii) the location of potential infectors must be less than γ km away from the case, and (iii) the potential infectors must have been reported later than δ days before the case. This threshold should be determined from the maximum plausible generation time of the disease. The spatial threshold γ should be defined according to the relevance of long-distance transmissions. Cases with no potential infector were considered as importations. Otherwise, they were grouped together with (i) their potential infectors and (ii) cases with common potential infectors.
After grouping the cases, we estimated their importation status and the cluster size distribution using two runs of MCMC (figure 2). The first run was shorter and aimed at removing the most unlikely connections among each group, as they can reflect unrealistic estimates for incubation periods or generation times and corrupt the estimation of the date of infection. We defined a reference threshold λ, whereby if the individual value of log-likelihood L i was worse than λ, then the connection between i and their infector was considered unlikely. In Outbreaker2, λ was a royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20200084 relative value, defined from a quantile of the individual loglikelihoods. In o2geosocial, λ can be a relative value or an absolute value, chosen from the number of components of the likelihood. For each sample saved from the short run, we computed the number of unlikely connections n. If there was no iteration where all connections were better than λ, min(n) new importations were added to the initial tree for the long run ( figure 2).
Finally, we ran a long MCMC chain and obtained samples from the posterior distribution. After removing the burn-in period and thinning the chain, we deleted the unlikely transmission links in each iteration and identified transmission clusters. Therefore, unlike the previous versions of outbreaker2, the number of importations in each sample can vary and the individual probability of being an importation can be computed (figure 2). To evaluate the performance of the model, we inferred the transmission clusters from a dataset that also included information on whether measles cases were part of a cluster based on contacttracing investigations. Measles cases in the USA are reported by healthcare providers and clinical laboratories to their corresponding health department. Each case is investigated by local and state health departments classified according to standard case definitions [46], and linked into clusters epidemiologically (e.g. by establishing a direct contact or a shared location between cases, or when cases are part of a specific community where an outbreak is occurring). Cases are considered internationally imported if at least part of the exposure period (7-21 days before rash onset) occurred outside the USA and rash occurred within 21 days of entry into the USA, with no known exposure to measles in the USA during the exposure period. Confirmed measles cases are routinely reported by state health departments to the CDC. A total of 2098 measles cases were reported in the USA between January 2001 and December 2016. The number of annual cases did not exceed 700 cases during this time period (figure 3; electronic supplementary material, figure S1). The importation status, 5-year age group, onset date, county and state of residence were fully reported for 2077 cases. The 21 cases with missing data were discarded. Twenty-five per cent of the cases were classified as importations. Thirty-nine per cent of the cases had their genotype reported.

Validation
Among cases with complete data, 737 independent clusters, containing 1-380 cases, were reconstructed through contacttracing investigations. Not every identified case could be linked to an importation, and some transmission clusters contained multiple imported cases (e.g. when related individuals travel together to a foreign country and were infected there). Out of the 737 reference clusters, 38 had several cases classified as importations, 256 had none identified.

Model and parameters
The distributions and priors used in the studies are listed in table 2. As no studies quantifying the probability of age-specific contacts have been carried out in the USA, we used the estimates from the POLYMOD study in the UK [36]. The incubation period and the generation time of measles were taken from previous studies [47][48][49]. We used the population centroid of each county to compute the distance matrix [50]. We used a beta distribution as the prior of the conditional report ratio [8]. The mean of the prior distribution was calculated using the number of clusters whose first case was not classified as an imported case, meaning the investigations were not able to trace back to the first case imported. As there was no prior information on the possible values of the spatial parameters a and b, we used uniform distributions between 0 and 5.
For pre-clustering of cases, we set the temporal threshold δ to 30 days, which is above the 97.5% upper quantile of the generation time with a missing generation. We were interested in local transmission to describe the impact of an imported case on a community. But we only had information on the county of residency for each case. Counties are large geographical units: the average county land area is 2911 km 2 and the maximum values reach 50 000 km 2 . Therefore, we set the spatial threshold γ to 100 km to exclude long-distance transmission, while still allowing for cross-county transmission.
Finally, we tested several relative and absolute importation thresholds λ. Absolute values were calculated from a factor k, multiplied by the number of components in L i , excluding the binary genetic component. Tested values were k = 0.05 (λ = log(0.05)*5 = −15) and k = 0.1 (λ = −11). Connections were considered unlikely if the log-likelihood was worse than λ. Relative values were quantiles of all recorded log-likelihoods in the sampled trees (table 2).

Inference of importation status
Using the contact-tracing investigations, we considered three different initial distributions of the importation status. In scenario 1, there was no inference of the importation status of cases, and the first case of each epidemiological cluster was classified as importation (ideal importation). In scenario 2: there was no inference of the importation status of cases, and all cases identified as importation in the contact-tracing investigations were classified as importations (epidemiological importation). Finally, in scenario 3, the importation status of cases was inferred, using different thresholds λ, and using no prior information on the importation status of cases or the importation status from the contact-tracing investigations.

Inference of clusters
In order to compare the inferred and reference clusters, we calculated for each case i: (i) the proportion of cases from the same reference cluster as i that were inferred with i (sensitivity) and (ii) the proportion of cases in the same inferred cluster as i that were part of the reference cluster ( precision). These values were calculated at every iteration, and the median values were used to evaluate the fit obtained with different values of λ. We also compared the inferred cluster size distribution to the reference data. The credibility intervals for each case are reported in electronic supplementary material, figure S2.

Results
We clustered 2077 measles cases reported in the USA between January 2001 and December 2016 using their onset date, age groups, location and genotype. Using the contacttracing investigations, we considered three different initial importation status distributions: (i) only the ancestors of each epidemiological cluster (first case of each cluster) were importations (ideal importation), (ii) all cases classified as importation in the contact-tracing investigations were importations (epidemiological importation), (iii) no prior information on importation status of cases. The importation (3) initial tree Figure 2. Estimating importation status and cluster size distributions in two MCMC runs.
Step 1: initial tree obtained after pre-clustering, with the minimum number of importations (here 2, as there are two reported genotypes).
Step 2: samples from the first short run, with red lines showing connections worse than the arbitrary threshold λ.
Step 3: initial tree for the final run, with one more importation than in step 1, which corresponds to the minimum number of unlikely transmissions at step 2.
Step 4: samples from the long run.
Step 5: final trees used to compute cluster size distribution and importation status of each case. Case 7 is an importation in one-third of the final samples, whereas case 3 is an importation in all of them.
royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20200084 status of the cases was, therefore, not probabilistically inferred in scenarios 1 and 2. The length of the short preliminary run was 30 000 iterations and the main run was 70 000 iterations. For each run, the trace of the posterior distribution shows the convergence of the algorithm (electronic supplementary material, figure S3).
In scenario 1, we did not infer the importation status of cases. The inferred cluster size distribution matched the contact-tracing investigations (figure 4a); 98% of the reference singletons were also isolated in the inferred cluster. For 94% (95% credibility interval: 91-98%) of cases, the inferred cluster had a sensitivity and precision above 75%, meaning more than 75% of the cases in the inferred cluster were in the reference cluster, and more than 75% of the cases in the reference cluster were in the inferred cluster (figure 4b). For 80% (78-93%) of cases, the inferred clusters were a perfect match with the reference clusters. The cluster size distribution stratified by state was similar to the contact-tracing investigations (electronic supplementary material, figure S4). Therefore, when each ancestor was considered as an importation, the inferred clusters were very close to the reference ones.
In scenario 2, we used the importation status distribution of cases reported in the contact-tracing investigations (539 importations). Pre-clustering highlighted 165 cases with no potential infector, which were also classified as importations. We observed discrepancies between the inferred cluster size distribution and the reference one: among the 704 cases inferred as importation, 61 (9%) were not importations in the reference cluster. Furthermore, 94 cases were the ancestor of a reference cluster and were not classified as importations in the inferred clusters (13%). The overall cluster size distribution matched the reference distribution, but 111 reference singletons were inferred as part of transmission clusters (figure 4a; electronic supplementary material, figure S5). Although the precision of the inferred cluster was above 75% for 93% (88-93%) of the cases, 31% (6-39%) had a sensitivity score below 0.5, meaning they were classified with less than half of the cases from their reference clusters (figure 4c). The discrepancies observed in this scenario are due to inconsistencies between the importation status distribution and the clustering of cases in the contact-tracing investigations, as reference clusters that gathered several importations were split into different inferred clusters.
In scenario 3, we used different threshold λ to infer the importation status of cases. We tested λ = −15, λ = −10 (absolute value), and λ = 95th centile of all recorded log-likelhoods (relative value). For each case i, if the log-likelihood L i was  worse than λ, the connection between the case and its infector was removed and the case was considered imported. Firstly, using an absolute factor λ = −15, 586 (581-593) cases were classified as importations, and 361 (355-369) of them were singletons. These numbers are much lower than the reference dataset that contains 737 clusters, and 539 singletons (figure 5a; electronic supplementary material, figure S6). However, very few cases inferred as importations or singletons were not classified as such in the reference dataset (15 (10-22) misclassified importations, 4 (0-14) misclassified singletons), and the cluster size distribution for clusters including two cases and more was very similar to the reference one. The precision of the reconstructed cluster was very high (above 75% for 88% (85-93%) of cases) ( figure 5b). Overall, the algorithm was not able to accurately identify importations and singletons as the threshold was too low to eliminate some unrealistic connections, but the inferred larger clusters matched their reference counterparts. We then observed the impact of increasing λ on the inferred cluster size distribution. Runs obtained using an absolute threshold with λ = −11 and 95% relative threshold yielded very similar results. The number of cases inferred as importations was higher than in previous runs, while all remaining links showed good connection between cases. The number of importations was closer to the reference dataset, and the number of singletons was greater than the reference. Nevertheless, 11% (10-12%) of the inferred importations was not classified as importation in the reference clusters. Furthermore, the number of two-case chains was overestimated, and bigger clusters were likely to be split because of the removal of weaker connections. Therefore, increasing λ did not improve the cluster size distribution, as many importations in the reference clusters were not identified and the number of mismatches increased (electronic supplementary material, figure S7).
Finally, we combined prior information and inference of importation status to create a scenario where the importation status of only a proportion of the cases is known, because of disparities in the contact-tracing investigations. This scenario is relevant for a dataset combining different outbreaks scattered across a large area or a long period of time. Cases considered as importations in the contact-tracing investigations were set as importations, and we inferred the importation status of the remaining cases. We used a low threshold to remove the least likely transmission links (λ = −15). Including prior information led to some misclassification of importation status due to the inconsistencies between the epidemiological importation status and the reference clusters. As in scenario 2, some cases were classified with only part of their reference clusters because clusters with several importations were split into different clusters. Indeed, the sensitivity royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20200084 score of 34% (7-51%) of cases was below 0.5. Nevertheless, the cluster size distribution observed in the simulation was the closest to the reference clusters. There were 725 (719-731) clusters, 89% of importations were also ancestors of reference clusters and the number of singletons matched the reference clusters (figure 5a-c). The inferred clusters of 88% (86-94%) of the cases had a precision score of 1, showing they were clustered without any false positives. Despite discrepancies in several states (Massachusetts, Ohio), the cluster size distribution stratified by state showed good agreement with the reference clusters (electronic supplementary material, figure S8). The conditional report ratio in the transmission chains ρ and the spatial parameters a and b was estimated in each scenario. The parameter estimates did not depend on the prior importation status distribution or the value of λ. ρ was consistently estimated above 90%, showing a low number of missing generations between cases (electronic supplementary material, figure S9). High values of ρ show that most of the reported cases could be connected without missing generations. This is not representative of the overall report ratio, which is usually much lower [51].
There was little variation in the estimates of the spatial parameters between the different scenarios. The population parameter a was estimated between 0.6 and 1 for every scenario, and the distance parameter b was between 0.08 and 0.12. In every scenario, more than 80% of the inferred transmission were between cases distant of less than 10 km, and few longdistance transmissions were recorded (50-100 km); hence, although most of the reconstructed connections were between cases from the same county, the algorithm was able to identify clusters spreading over several counties or states (electronic supplementary material, figure S10).
We highlighted the added value of including the spatial distance between cases in the likelihood by comparing the cluster size distribution inferred by selecting certain components of L i (electronic supplementary material, figure  S11). The credibility intervals were much wider when the distance between cases is not part of the likelihood, and the number of chains containing 2-10 cases was overestimated. The important impact of the spatial component of likelihood was also due to the widespread American territory, and could be lower in a different setting.
We used the ratio of the number of importations over the number of subsequent cases per state to evaluate the intensity of transmission in each state between 2001 and 2016 ( figure 6). The maps obtained in scenario 1 (ideal scenario) or in scenario 3 (estimation of importation, with epidemiological importations and λ = −15) were very similar. We only observed minor differences, for example, in South Dakota and in Massachusetts, where the ratios were higher in scenario 3. The highest ratio (31.8 in scenario 1) was observed in Ohio, and is mostly due to a 383 case outbreak in 2014 Similarly, we used the inferred transmission chain to compute the inferred reproduction number in each state. According to the model, about 60% cases did not cause future transmission, and about 5% caused more than five subsequent cases (electronic supplementary material, figure S12). These numbers were consistent in each run. The geographical distribution of reproduction number was very similar to the importation-subsequent cases ratio (electronic supplementary material, figure S13).

Discussion
We developed the R package o2geosocial to classify measles cases into transmission clusters and estimate their importation status using routinely collected surveillance data (genotype, age, onset date and location of the cases). As recently observed during the 2018-2019 measles outbreak in New York, delays in childhood vaccination, local susceptibility and increased contacts can lead to large outbreaks following importations [52,53]. Therefore, we were interested in highlighting the effect of imported cases on communities and we focused on short distance transmission to identify areas where they repeatedly caused subsequent transmission chains. Although this is not predictive of future transmission, it highlights communities with potential for large transmission clusters.
We compared the inferred transmission clusters to the contact-tracing investigations of 2077 confirmed measles cases reported in the USA between 2001 and 2016. We were able to produce reliable estimates of known transmission clusters using epidemiological features with only few misclassifications. Estimating the importation status of cases without prior knowledge was challenging and caused uncertainty on the results. We tested different threshold λ to eliminate unlikely transmissions, and we were able to identify most of the imported cases.
Nevertheless, if several cases were imported in the same region at a similar time, we could not find all of them without discarding valid transmission events, and increasing the number of false positives. When we used the importation status as defined in the contact-tracing investigations without probabilistic inference (scenarios 1 and 2), the reconstructed clusters were similar to the reference ones. Results were also conclusive when we combined prior information and importation inference. The reconstruction of transmission greatly depends on the epidemiological investigations to identify measles importations in a community.
We used the genotype to censor connections between cases when it was reported, as there can be only one reported genotype per transmission cluster. Using a simulated dataset (toy_outbreak_long in o2geosocial), we explored the impact of increasing the proportion of genotyped cases on clustering and observed it could help identify the number of concurrent transmission trees when multiple genotypes are co-circulating. Moreover, we introduced a spatial component to the likelihood of connection between cases using an exponential gravity model. Previous studies showed this model was able to capture short-distance dynamics better than other gravity models, and was easy to parametrize. Introducing the spatial component greatly improved the precision and the sensitivity of the reconstructed clusters (electronic supplementary material, figure S11), and the parameter estimates were robust in the different scenarios.
The final results on the clustering of the 2077 cases using o2geosocial were obtained in 7 h for each run of 100 000 iterations on a standard desktop computer (Intel Core i7, 3.20 GHz 6 cores), which is much faster than previous implementations of outbreaker and outbreaker2. With the addition of the pre-clustering step, whereby we reduced the number of potential infectors for each case, the algorithm ran faster. For smaller chains (50 000 iterations), 4 h were needed to estimate the importation status and cluster the cases. The code for the package and the analysis developed in this project is shared on Github (https://github.com/alxsrobert/o2geosocial and alxsrobert/datapaperMO), with an illustrative toy dataset, and can be used to analyse recent outbreaks where contact-tracing investigations were not carried out. Although the results obtained are promising, it should be noted that the dynamics of measles transmission in the USA are likely to be very specific to this location. Indeed, there were less than 700 annual cases between 2001 and 2016. These cases were scattered across a large area, which made the pre-clustering of cases very efficient as we focused on short-distance transmission. In smaller or more endemic settings, the number of potential infectors per cases after the pre-clustering step might be higher, which would increase the running time.
Furthermore, as the location of each case was deduced from the population centroid of counties, we assumed that the distance between cases from the same county was effectively zero. American counties are large and widespread geographical units that can include more than 1 million individuals. For future use of o2geosocial, more accurate information on the location of cases could improve cluster inference by identifying multiple importations in a given county. Because cases are reported by the state of residency, we had to ignore that cases may have been out of the reported county or state during their incubation and infectious period, which has been seen during some outbreaks, such as the 2015 'Disney outbreak' in California [54].
We did not include prior information on the local susceptibility of the different areas affected in o2geosocial, and these could be estimated using historical values of local coverage. However, protocols to estimate local vaccination coverage can differ in time and space and be difficult to compare, or unavailable at the local level. Furthermore, these estimates are cross-sectional in nature, and might not take into account catch-up vaccination campaigns, or immunity induced by previous outbreaks. Local seroprevalence surveys could identify pockets of susceptibles, but they have not been carried out on a subnational scale in most countries [55].
There has been no national quantitative analysis of agespecific contact patterns carried out in the USA, so we relied on a contact matrix between age groups available for Great Britain from the POLYMOD study [36]. Nevertheless, little variation in the contact rates between age groups has been observed between European countries, and a previous projection of the social contact matrix in the USA yielded similar results [56]. POLYMOD data were probably the most reliable source of information we could use to deduce an estimate of the contact matrix in the USA.

Conclusion
Heterogeneity in immunity can cause large outbreaks in countries with high national vaccine coverage, and identifying potential foyers of transmission in post-elimination settings is key for outbreak prevention and control. We have presented a method for estimating the cluster size distribution of past measles outbreaks from routinely collected surveillance data. We found that adding prior knowledge on the importation status of cases improved the inference of the transmission clusters. Although the method was able to identify a proportion of importations, epidemiological investigations on the history of travel and exposure reduced uncertainty on the clustering of cases. We believe these investigations are needed to produce reliable estimates of past transmission clusters. In lieu of the importation status, if multiple genotypes are co-circulating, increasing the proportion of genotyped cases could help discard potential connections and find imported cases. Even with limited information, this method was able to infer probabilistic transmission clusters in a fast and efficient way.
Ethics. This study is a secondary analysis of data collected as part of routine surveillance of measles outbreaks in the USA. The research was approved by the London School of Hygiene and Tropical Medicine Research Ethics Committee (reference number 15735).
Data accessibility. The package we developed is publicly available on Github (https://github.com/alxsrobert/o2geosocial), along with the code used to analyse the data and generate the figures (https://github.com/alxsrobert/datapaperMO). Combinations of variables in the surveillance data used to validate this algorithm may contain sensitive personally identifiable health information which are subject to the Privacy Act and cannot be shared publicly. A toy dataset was attached to the o2geosocial package (in o2geosocial/data). The script analysis_generated_data.R in the datapaperMO repository generates toy datasets with different parameters (distance kernel, number of cases, reproduction numbers etc.) and can be used to re-run the model and test its performance.