Partially observed epidemics in wildlife hosts: modelling an outbreak of dolphin morbillivirus in the northwestern Atlantic, June 2013–2014

Morbilliviruses cause major mortality in marine mammals, but the dynamics of transmission and persistence are ill understood compared to terrestrial counterparts such as measles; this is especially true for epidemics in cetaceans. However, the recent outbreak of dolphin morbillivirus in the northwestern Atlantic Ocean can provide new insights into the epidemiology and spatio-temporal spread of this pathogen. To deal with uncertainties surrounding the ecology of this system (only stranded animals were observed), we develop a statistical framework that can extract key information about the underlying transmission process given only sparse data. Our self-exciting Poisson process model suggests that individuals are infectious for at most 24 days and can transfer infection up to two latitude degrees (220 km) within this time. In addition, the effective reproduction number is generally below one, but reaches 2.6 during a period of heightened stranding numbers near Virginia Beach, Virginia, in summer 2013. Network analysis suggests local movements dominate spatial spread, with seasonal migration facilitating wider dissemination along the coast. Finally, a low virus transmission rate or high levels of pre-existing immunity can explain the lack of viral spread into the Gulf of Mexico. More generally, our approach illustrates novel methodologies for analysing very indirectly observed epidemics.


S1. Stranding data
S2. Data preparation S2.1. Prediction of background stranding rates. Annual stranding rates in non-epidemic years (1996-2012) were predicted using a Poisson generalized linear model (GLM). Separate models for summer (June-August), fall (September-November), winter (December-February) and spring (March-May) were used to allow for seasonal variability in stranding patterns ( Figure  S2). Each model was defined by where L is a vector representing latitude degrees, N s a vector for the total number of strandings at each latitude degree in a given season, s, and r s a vector for the stranding rate at each latitude during season s.
The resulting predictions for the number of background strandings in each season and latitude degree (in the absence of a disease outbreak) sufficiently capture the observed seasonal patterns in the background stranding data, with spring and summer displaying higher stranding rates in addition to a stronger multimodal pattern in the distribution of rates across latitude points compared to fall and winter ( Figure S4 and Table S1). We therefore use these estimates to separate likely background cases in the 2013 -2014 UME data from cases that are likely due to disease (Section S2.2), and to estimate the seasonal distribution of the total population along the coast (Section S2.3). S2.2. Removing background strandings from the UME dataset. As discussed in Section 2.2, the background cases must be removed from the UME dataset in order to restrict analysis to strandings due to DMV infection. Since we only consider the first year of UME data in our analysis, the number of recorded UME strandings in each season and latitude is analogous to the annual UME stranding rate. These UME stranding rates, in addition to the background stranding rates calculated previously, are used to determine the proportion of cases that should be removed from the UME dataset.
Let r s,l and u s,l be the rate of background and UME strandings in season s and latitude l, respectively. Note that since the UME data includes background strandings, r s,l < u s,l . Assuming the stranding events are independent, the probability that a randomly selected case from season s and latitude l in the UME dataset is a background stranding follows a binomial distribution with 'success' probability r s,l /u s,l . The successful outcomes are then removed from the UME dataset. Figure 1A compares the raw dataset to the resulting epidemic dataset once these background cases have been removed. The pattern shown here is preserved across multiple simulations of the above procedure and we therefore assume that our approximation of the epidemic data is a fair representation of the underlying epidemic process. S2.3. Seasonal distribution of population density. If N s is the the total number of strandings (across all latitude degrees) during a particular season s, and N s,l is the number of strandings at latitude l during that given season, then the proportion of strandings in season s that occur at latitude l is N s,l /N s . The number of individuals that occupy latitude l during season s can then be estimated by multiplying N s,l /N s by the total population size (26,317) of the four coastal stocks (NMCS, SMCS, SCGCS and NFCS). Since it is assumed that estuarine dolphins do not contribute significantly to the overall UME strandings, their stock sizes are not included in the total population estimates or any subsequent analysis. Note: * p <0.1; * * p <0.05; * * * p <0.01 The method described above gives a point estimate for the number of individuals at each latitude that remains constant throughout a given season. In reality, however, the number of individuals is likely to fluctuate continuously throughout the season. Since we lack data on these fluctuations, we incorporate them by linearly interpolating around our point estimates. For example, if D F l , D W l are the point estimates for the population size at latitude l in fall and winter, respectively, then for each day (t = 1, 2, ..., 91) of fall, the number of individuals at l is given by . This procedure is repeated for each season to give a smoother temporal function for the population density at each latitude degree ( Figure S5). These estimates then provide the model inputs for the population size at each day and latitude degree, D t,l , throughout the epidemic.

S3. Log-likelihood function
As presented in Section 2.3.1, the probability of observing N cases in time t ∈ [0, T ] and space l ∈ [L 1 , L 2 ] is defined as [1] The probability of observing case i is λ(t i , l i )/Λ [1], and the probability of observing an ordered sample of cases 1, 2, ..., N is It follows that the likelihood of observing the whole epidemic dataset is given by [1] Pr and therefore the log-likelihood is [1, 2] log(λ(t i , l i )).

S4. Information criterion
Typical Akaike information criterion (AIC) values are calculated as AIC = 2k − 2 log(p(y|θ)), where y represents the observed data, k is the number of estimated parameters in the model, andθ is the maximum likelihood estimate of the model parameters θ [3,4]. Here, the maximum log-likelihood value, log(p(y|θ)), is approximated by taking the median value of the posterior log-likelihood distribution. Watanabe-Akaike information criterion (WAIC) values are calculated as where p P (θ) represents the posterior parameter distribution [4,5]. The first term sums the posterior variance of the log-likelihood for each observed data point, y i , and approximates the number of fitted parameters, and the second term provides a measure of the pointwise loglikelihood [4].

S6. Binomial chain model with susceptible depletion
A binomial chain version of the self-exciting process was developed to investigate the importance of susceptible depletion to the overall epidemic dynamics. In this model the hazard function is defined as for density-dependent transmission, and for frequency-dependent transmission. The number of susceptible individuals at time t and latitude l, X t,l , is calculated by subtracting the cumulative number of strandings at l from the total population size, and then the number of new cases is drawn from a binomial distribution with X t,l trials and probability of success given by p = 1−exp(−λ t,l ). The effective reproduction number is R = βX t,l for density-dependent transmission and R = βX t,l /D t,l for frequencydependent transmission, and the functions f and g, and all other parameters, are as described in the main body of the text. Parameter estimates and information criterion values are given in Table S2, and model fits are shown in Figures S8 and S9. *Reproductive values are calculated using the inferred posterior distribution of β (results not shown). The minimum value refers to R at the lowest estimate of X t,l (X t,l /D t,l ) in the density-dependent (frequency-dependent) case, for any one latitude degree and time point, and the maximum value refers to R at the largest estimate. Figure S8. Simulated predictions of the density-dependent susceptible depletion model. Cases were simulated across latitudes, l, and time, t, as Bi(p = 1 − e −λ t,l , X t,l ) using the hazard function, λ l,t , calculated during parameter estimation. Lines represent median values from 2000 simulations in RStan, shaded regions represent the 2.5 th -97.5 th quantile range and points represent actual data. Latitude bands are distinguished by colour. Figure S9. Simulated predictions of the frequency-dependent susceptible depletion model. Cases were simulated across latitudes, l, and time, t, as Bi(p = 1 − e −λ t,l , X t,l ) using the hazard function, λ l,t , calculated during parameter estimation. Lines represent median values from 2000 simulations in RStan, shaded regions represent the 2.5 th -97.5 th quantile range and points represent actual data. Latitude bands are distinguished by colour.

S7. Transmission networks
S7.1. Generating the networks. As discussed in Section 2.4, transmission networks were reconstructed using the distribution of parameter estimates from the best-fitting model. Stranded individuals were connected with the source of their stranding (i.e. the previously stranded individual that most likely caused their infection) according to the following steps: (1) One sample was taken from the joint posterior parameter distribution i.e. a value was drawn for each estimated parameter (the baseline and additional transmission rates, and the temporal and spatial decay rates). This sample was then used to reconstruct the hazard function (according to Equation 1), and the current contribution of each infected individual to the hazard, across all space and time points. (2) For each individual i (that stranded at latitude l i and time t i ): i) The current contributions at l i and t i of each previously stranded individual (i.e. each potential source) were determined from the previous step. ii) These contributions were divided by the total hazard function at l i and t i to determine the relative contribution of each potential source. A vector, C, representing the cumulative sum of these contributions was then created. iii) Finally, the source was chosen using random number generation: a random number between 0 and 1, r, was drawn from a uniform distribution and the individual corresponding to the first element in C greater than r was identified as the source of infection. (3) Step 2 was repeated for each stranded individual to generate the entire network.
A different transmission network was generated for each random sample of the estimated parameter distributions; 100 samples were taken in total. These transmission networks then served as a tool for visualizing and interpreting the results of the model inference.
S7.2. Visualization. Figure S10. Example of one of the directed transmission networks. Each arrow represents the direction from the source of infection to the resulting stranding. Colours distinguish the different latitudes that generated each transmission event.