Extending statistical models for source attribution of zoonotic diseases: a study of campylobacteriosis

Preventing and controlling zoonoses through the design and implementation of public health policies requires a thorough understanding of transmission pathways. Modelling jointly the epidemiological data and genetic information of microbial isolates derived from cases provides a methodology for tracing back the source of infection. In this paper, the attribution probability for human cases of campylobacteriosis for each source, conditional on the extent to which each case resides in a rural compared to urban environment, is estimated. A model that incorporates genetic data and evolutionary processes is applied alongside a newly developed genetic-free model. We show that inference from each model is comparable except for rare microbial genotypes. Further, the effect of ‘rurality’ may be modelled linearly on the logit scale, with increasing rurality leading to the increasing likelihood of ruminant-sourced campylobacteriosis.


Introduction
Modelling of disease surveillance data to explore patterns of infectious diseases has had a long history in public health. Infectious diseases can cause high economic and medical costs due to morbidity and mortality. In recent decades, the annual number of global deaths caused by infections has levelled off at approximate 15 million and may remain at this level for the next three decades [1,2]. In order for such an enormous health burden to be reduced, preventing and controlling infectious diseases becomes extraordinarily important, and our ability to intervene depends on how much we know about the nature of disease transmission.
For zoonotic diseases, transmission to humans from animal reservoirs may be complex, involving many sources and exposures linked by different pathways, via food, water, through environmental contamination or direct contact with animals. Knowledge of the potential sources and pathways of infection is key to reducing the burden of disease. For instance, infected wild birds may contaminate environmental water and cause disease spread to water users, either humans or other animals [3]. Tracing the source of infection becomes crucial to increasing the ability to implement risk management and intervention [4,5].
Modelling zoonoses requires an advanced approach with the focus changed from just epidemiology to a combination of epidemiology, evolutionary genetics and biology [6]. Some source attribution models have been proposed to estimate the number of cases attributable to different sources by using epidemiological information and the association with genotypes found in humans and sources [7][8][9][10]. The genetic information used in such integrated models is typically derived from molecular genotyping that groups closely related organisms together [11]. A common method used is multilocus sequence typing (MLST) [12 -14], which uses nucleotide sequences of internal fragments of a small set of housekeeping genes. Such sequences have sufficient variation to distinguish differing pathogen lineages, while being relatively stable within lineages. Each unique nucleotide sequence (allele) at each housekeeping gene (locus) is assigned a number, and the set of numbers across all loci (the allelic profile) is then taken as the genotype, which is assigned a sequence type (ST) number.
For the pathogen Campylobacter which causes campylobacteriosis, a worldwide gastrointestinal disease in humans, the commonly used seven-gene MLST scheme consists of housekeeping genes aspA (aspartase A), glnA (glutamine synthetase), gltA (citrate synthase), glyA (serine hydroxymethyltransferase), pgm ( phosphoglucomutase), tkt (transketolase) and uncA (ATP synthase a subunit). An illustrative example of MLST data for Campylobacter is presented in table 1. It shows that the genotypes ST-2026 and ST-474 have different allelic combinations across all seven loci, while ST-403 differs from ST-2026 only at glnA, and ST-2343 differs from ST-474 at gltA and pgm. The different allelic profiles enable comparison of gene similarities or dissimilarities so that an association between sources and infected cases can be made, by comparing the distribution of genotypes from human cases with those from potential reservoirs.
Human campylobacteriosis is caused mainly by C. jejuni and C. coli which are the dominant species associated with approximately 80% and 15% of illnesses respectively [15]. Common symptoms of infection are diarrhoea, abdominal pain and fever; however, a severe complication named Guillain-Barré syndrome may develop, which is a life-threatening disease that weakens the nervous system and leads to paralysis of the limbs and respiratory failure [16]. The pathogen can be spread between animals, or from animals and wild birds to humans. Transmission routes may be via drinking contaminated water, eating undercooked animal food products, or handling animal food products that are already contaminated by faeces.
The first step for attribution models that use genetic information is building the sampling distribution of genotypes among each putative source. This may range from using the proportion of each observed genotype [9] on each source through to using allelic profile information to derive mutation and recombination rates within each source, and migration rates between each source [4]. A key question is whether more complex genetic models yield superior attribution results or whether a significantly simpler model may suffice, but few authors in the literature have addressed this point. This becomes more important as model complexity extends to include epidemiological covariates. We are therefore motivated to develop a simple model in order to assess the additional information that the more complex models provide by using data originating from a study on human campylobacteriosis conducted in New Zealand [17].
In this study, we develop statistical models for source attribution and demonstrate their use on the campylobacteriosis study. We compare the performance of the asymmetric Island model [4], which considers genetic evolution when estimating the genotype sampling distribution on each source, to a simple model that uses only the prevalence of each type to derive the sampling distribution. This comparison brings into sharp focus the contribution of the asymmetric Island model to the overall analysis, enhancing our understanding of the operation of these models and facilitating model checking. We then extend both models in a Bayesian context to incorporate covariates, exploring the effect of human case rurality on attribution results via a linear trend on the logit scale or with separate categories, and performing model comparison.

MLST data
Our data originate from the campylobacteriosis study and comprise microbial genotype information from each observed human case obtained from analysis of stool samples and also from a pool of non-human cases corresponding to potential zoonotic sources of disease. These samples were obtained at a surveillance sentinel in the Manawatu region of New Zealand from March 2005 to December 2014. Further details can be found in [17]. Briefly, the data contain 1460 isolates taken from human cases, and 2128 isolates sampled from chicken carcases, cattle, sheep, environmental water, wild birds and so on, over the same time period and from the same geographical location. The non-human samples were categorized into four groups representing major sources of infection: poultry, ruminants, water and others (consisting of cats, dogs and various wild birds). The total number of unique genotypes from all isolates is 348, with 36% of genotypes found among human cases. Table 2 lists five common genotypes found in human and source isolates, the first four of which are frequently observed in human cases. As found in other studies, ST-45 and ST-474 are detected mainly in poultry, while ST-42 and ST-2026 are detected mainly in ruminants [6,10,13]. The fifth genotype, ST-2381 is not found among human cases, appearing only in the water and other sources, in this case being found in Pukeko and Takahē birds from the Rallidae family [18,19].

Location information of human cases
The data also contain location information, in the form of an ordinal classification of urban and rural areas with seven levels coded from 2 3 to 3: highly rural/remote area, rural area with low urban influence, rural area with moderate urban influence, rural area with high urban influence, independent urban area, satellite urban area and main urban area. Approximately 8% of individuals in the Manawatu dataset have no information about the location, which we assume are missing at random. Table 3

Genotype models with and without microbial genetic information
The goal of attribution models is to estimate the probability that the observed human cases arise from each putative source. Given the genotyping information, we first estimate the sampling distribution of genotypes for each source and then estimate the appropriate combinations of those genotype distributions that most likely give rise to the set of genotypes observed among human cases. Specifying first the sampling distribution of genotypes found on sources is fundamental for the purpose of not only exploring how it affects the source attribution probability but also investigating the difference in attribution effect made between different genetic models. Suppose we have isolates collected from human and nonhuman cases, of which H isolates belong to humans, and the remaining N isolates are categorized in J groups as the major sources attributed to the infection. Let I genotypes be the total number of unique types detected from all isolates and denote n j as the marginal frequency of types found in source j, where P J j n j ¼ N. Typically, the number of detected types I is smaller than the sample size of isolates as multiple isolates will be of the same type.
Each type i, i ¼ 1, . . ., I, may be found in more than one human case and so we model the likelihood of observing human cases with genotype ST i[h] using a multinomial distribution, in which i[h] is the index of the ST found in human case h. The likelihood via the law of total probability may be expressed as is the probability that genotype ST i found in human case h arises from the sampling distribution of source j, and p(source j ) is the attribution probability that a random human case is infected from source j. Given we know p(ST i[h] jsource j), estimation of p(source j ) may be found by optimizing the likelihood (2.1), for example, using a Metropolis -Hastings algorithm within a Bayesian context, with suitable priors on p(source j ). The asymmetric Island model [4] adopted in the source attribution study for human campylobacteriosis [17] uses the allelic profile information for each genotype in an evolutionary    royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 16: 20180534 model, estimating mutation and recombination probabilities within, and migration probabilities between, each source 'island'. It thus estimates p(ST i[h] jsource j) indirectly, by first estimating the evolutionary parameters, and then deriving the sampling distributions. This allows the asymmetric Island model to estimate the likelihood of observing a genotype on a source when it has not been previously observed.
To discover the effect of incorporating genetic information at the allelic profile level as used in the asymmetric Island model, a simple model is developed for the genotype sampling distribution. With the assumption that the observed distribution of genotypes is representative of the true distribution, we model observed genotypes using a multinomial distribution. Let x ij denote the count of genotype ST i found in source j with probability p ij , where i ¼ 1, . . ., I and j ¼ 1, . . ., J. To make inference about p ij , the likelihood is of a multinomial form, where x ij can be 0, indicating genotypes are not observed on the sources; n j ¼ P I i¼1 x ij is the total count of all types found on source j and p ij is subject to P I i¼1 p ij ¼ 1, and 0 p ij 1. As the family of Dirichlet distributions is a conjugate pair for the multinomial distribution, assume the prior for p j follows a Dirichlet density with parameters g j , : Then the posterior for p j takes the form of a Dirichlet probability function with parameters (g j þ x j 2 1), , g ij . 0: To express the belief that every isolate is equally likely a priori, the parameter of the Dirichlet prior is assumed as g j ¼ 1. Therefore, p(ST i[h] jsource j) can be obtained by simulating from the Dirichlet posterior.

Model fitting on rurality scale
Previously we described how to estimate the marginal probabilities that a randomly selected human case is due to a given source. To estimate the attribution probability, 100 posterior samples of p(ST i[h] jsource j) are generated using the asymmetric Island or Dirichlet models. For each posterior sample, we infer p(source j ) using the likelihood (2.1). This has the effect of integrating over the uncertainty in p(ST i[h] jsource j) when estimating p(source j ).
To extend this analysis so as to include individual level covariates, we need to calculate subject-specific attribution (conditional) probabilities, p(source jjcovariates). To that end, let F jh denote the attribution probability of source j for the h th human case, with constraints P J j¼1 F jh ¼ 1 and 0 F jh 1, where h ¼ 1, . . ., 1460 and j ¼ 1, . . ., 4. We model the probabilities F jh using a linear model on the logit scale, that is, where f 4h ¼ 0 is treated as the baseline of f jh . Consider the case where the genotype data for each human case are supplemented by p additional variables. A general model of f jh with linear combinations of the variables, c 1 , . . ., c p , then has the form for the hth individual. Note that if there is a single categorical variable with L levels, then F jh and f jh will take no more than L distinct values. In a slight abuse of notation, we will at times refer to F jh in which the h index refers to the factor level, rather than to a particular subject at that level.
To apply the general model of f jh to the campylobacteriosis data, assume z is the variable ranging from 2 3 to 3 representing the classified rurality of each human case. Then two ways of treating the variable z in model fitting are proposed: one is to treat it as numeric, and the other as categorical. To differentiate the performance between the two fitted models, we link 'the linear model' and 'the categorical model' to the first and the latter fitted model, respectively. Hence, the linear prediction function for source j for each human case given the degree of rurality is a numeric variable and can be written as where z h can be any number of the seven scales if case h was from such a degree of rurality. Conversely, if we treat each of the seven rurality degrees as an indicator with a superscript number d, which corresponds with the position of the category ranged from 2 3 to 3, the model (2.3) can be rewritten as

Posterior attribution probability
Posterior attribution of human cases of campylobacteriosis (F) with 80% credible intervals for each source is illustrated for each rurality grade in figure 2. The graphs are categorized by the types of model (asymmetric Island or Dirichlet) and the manner in which the rurality variable is modelled (categorical or linear on the logit scale).
Overall, the attribution results are relatively stable irrespective of the types of model or how rurality is represented in the attribution model. The majority of human cases are attributed to ruminants and poultry, with more cases attributed to ruminants in rural areas and more cases attributed to poultry in urban areas. For the Dirichlet and asymmetric Island models, the linear and categorical models of rurality show broadly the same trend, suggesting that the additional flexibility given by the categorical model is not required and that the shift in attribution as rurality changes is adequately modelled by a linear trend on the logit scale. The linear model has the advantage of tighter credible intervals as it can share data across the seven levels of rurality, resulting in a clearer separation of ruminant and poultry attribution, particularly in highly rural areas where the data are sparse. There are some small differences between the genotype models, with the Dirichlet model showing a greater attribution to poultry (ranging from 40% in highly rural areas to 75% in main urban centres) than the asymmetric Island model (ranging from 30% in rural areas to 65% in urban centres). This also occurs similarly in the categorical model.
Interestingly, the asymmetric Island model attributes approximately 7% of human cases across all rurality levels to sources other than poultry, ruminants and water and gives a small attribution to water in highly rural areas, while the Dirichlet model indicates that both these sources are unimportant.

Model selection
We use deviance information criterion (DIC) for model comparison. DIC values obtained from our MCMC runs are displayed in table 4. Overall, there is a clear signal that a linear representation of rurality (on the logit scale) is adequate due to relatively small values compared to the categorical model. Note that the asymmetric Island and Dirichlet models are not directly comparable by DIC as the likelihoods are on different scales: the Dirichlet model assumes all potential sequence types have been observed so that P j p(p j ) ¼ 1, whereas the asymmetric Island model allows for unobserved sequence types so that P j p(p j ) , 1.

Investigation between genotype models
The small differences in attribution observed between the asymmetric Island and Dirichlet models may be due to the additional genetic information available to the first model. To illustrate this, figure 3 shows the probability of source given a selection of four genotypes, assuming a priori that each source was equally likely. Genotypes ST-403 and ST-2343 are observed primarily in humans (six cases each), with ST-403 not being observed among the sources, and ST-2343 being observed once in poultry so that the Dirichlet model has little information available to distinguish between sources. The asymmetric Island model, however, can exploit the genetic relationship between genotypes. ST-403 differs at just one locus from ST-2026, a type observed frequently in human cases and ruminant isolates, while ST-2343 differs at two loci from common genotype ST-474, observed frequently in human cases and poultry isolates (tables 1 and 2). Thus, the asymmetric Island model can clearly assign ST-403 to ruminants and ST-2343 to poultry, while the Dirichlet model cannot distinguish between sources. By contrast, both models provide similar probabilities for ST-2026 and ST-474 which are both observed frequently.

Robustness analysis
As noted previously, a major public health initiative in 2007 led to a significant reduction in the number of cases of campylobacteriosis in New Zealand. In order to examine the effects of this change on attribution probabilities, we repeated the analysis by including an interaction, with time period 2005-2007 and 2008-2014. Figure 4 shows that the general trend in attribution by rurality for each of the time periods using a linear trend on the logit scale to incorporate rurality. There is a clear difference, with a significantly lower attribution to poultry (and correspondingly higher attribution to ruminants) in all but the most rural of areas, being strongest in highly urban areas. Thus, although the intervention did not eliminate infection arising from poultry [22], the reduction highlights the significant improvement in contribution of poultry to disease, particularly in urban areas where most cases occurred.

Sensitivity analysis
As with any Bayesian analysis, it is of interest to examine the sensitivity of the results to the choice of prior distributions. We originally used standard normal priors for regression coefficients on the logit scale. We also considered priors with s 2 ¼ 4, and while this meant that f tended to drift further from 0, the resulting attribution probabilities F did not change, largely as the attribution is dominated by poultry and ruminant sources, with the water source in particular being close to zero. Thus, f poultry and f ruminants are positive, while f water is negative and the magnitude of these can increase without making a significant difference to their corresponding F's. The prior on f thus tends to restrict this ill-behaviour rather than acting as a strong constraint on attribution probabilities. The prior g j in the Dirichlet model also makes little difference if kept small, as it most strongly effects genotypes that are rare, which do not contribute significantly to the overall attribution. The prior can be thought of as data augmentation such that g j ¼ 1 is equivalent to adding a single observation of each genotype to source j. Thus, large values of g j will cause the genotype distributions across sources to look more similar, and hence result in equal attribution to each source.

Discussion
Models that determine the source of human infection, particularly for zoonotic pathogens that originate in animal populations, are of considerable value to public health policymakers. However, such models may be complex, particularly when using evolutionary models. An outstanding question is whether such complexity is required, or whether a simpler model may work as effectively. Here we developed a Our results show that the Dirichlet and the asymmetric Island models give largely similar final attribution probabilities, with both models demonstrating a clear effect of rurality on attribution: cases in rural areas are more likely to have originated from ruminants, while those in main urban centres are more likely to be of poultry origin. As most people in the Manawatu region live in urban centres, this highlights the importance of poultry as a reservoir for campylobacteriosis, which is well established in the literature [10,17,23,24].
When we ran models allowing attribution probabilities to differ before and after the intervention in the poultry industry in 2007, we saw a clear difference, with much lower poultry (and higher ruminant) attribution, particularly in main urban centres during 2008-2014. Considering that most people in the Manawatu region live in urban areas, and that case rates in urban areas decreased from 2008 onwards, it is clear that this intervention coincided with a dramatic reduction in poultry attributed illness as reported elsewhere [25]. Given that campylobacteriosis cases in rural areas are mostly attributed to ruminant sources, and that case rates in these areas have been higher than those in urban centres since 2008, there is a clear need for public health interventions to focus on this area.
While the overall attribution was consistent between the Dirichlet and asymmetric Island models, it would be expected that the conditional probabilities for a given genotype might differ markedly. For those genotypes observed infrequently (or not at all) among the sources, the Dirichlet model has little information while the asymmetric Island model can exploit information from cases with similar (but not identical) genetic profiles. In the case of MLST data with just seven loci, the majority of human cases and source isolates come from a relatively small number of sequence types which are observed often. Thus, the Dirichlet model performs well, as it has sufficient observations to estimate the genotype distribution well where the bulk of the data lie. It is only those genotypes that are rarely observed where it performs poorly, but as they are rarely observed, they do not contribute significantly to the overall attribution. In other circumstances, such as where we have many more than seven loci, we would expect to have many more rare genotypes, so that the Dirichlet model might provide little useful information. At the extreme example of whole genome MLST (wgMLST) where each isolate would typically be unique, it would provide essentially no information at all. In such circumstances, however, the asymmetric Island model would be expected to still perform well, assuming that information could still be transferred between similar genotypes.
The Dirichlet model is less complex than the asymmetric Island model that uses the prevalence of genotypes in sources to derive the sampling distribution of genotypes. It is similar to a recently published model, sourceR [8], that jointly models the source and human cases, accounting for uncertainty in the sampling process. However, sourceR is an extension of the Hald [9] and modified Hald [26] models which model human cases using a Poisson distribution rather than royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 16: 20180534 a multinomial, and instead of estimating the proportion of cases attributed to each source directly, model source effects as well as genotype effects [8,27]. Future research might focus on possible extensions to these models. One direction is to adapt the models with additional covariates, which might include age, occupation and other risk factors such as contact with animals. For example, there is evidence that children in rural areas are at higher risk of campylobacteriosis through contact with farm animals [23,28]. Another direction is in expanding the role of water. In these models, we have assumed that water is a source of human campylobacteriosis infection, but water differs from the other food and environmental sources in that it is not an amplifying reservoir for Campylobacter [3]. By contrast, genotypes found in water might be expected to originate in the other sources present here, particularly ruminants and wild birds, but also potentially from humans as well via discharge of unprocessed human waste. Hence, water acts as a transmission pathway from sources to humans, being both an endpoint (reduced water quality from faecal contamination) and a source (human consumption of water, either recreationally or through untreated water supplies). While there is presently little evidence that water is an important source for human campylobacteriosis from the current models, the models are fitted using sporadic cases of campylobacteriosis. However, water is known as a key source of outbreaks of campylobacteriosis, such as the large outbreak in Havelock North, New Zealand in 2016 where an estimated 5500 out of 14 000 residents became ill [29]. Thus, characterizing the source of Campylobacter found in water has important implications for both water quality and public health.
Ethics. This work has been evaluated by peer review and judged to be low risk. Consequently, it has not been reviewed by one of the University's Human Ethics Committees. All authors are responsible for the ethical conduct of this research. If you have any concerns about the conduct of this research that you want to raise with someone other than the authors, please contact Dr Craig Johnson, Director (Research Ethics) at Massey University via email: humanethics@ massey.ac.nz.
Data accessibility. All codes for fitting the genotype distributions, estimating attribution probabilities and producing figures, along with the dataset are available on GitHub [30].