Synchronous diversification of Sulawesi's iconic artiodactyls driven by recent geological events

The high degree of endemism on Sulawesi has previously been suggested to have vicariant origins, dating back to 40 Ma. Recent studies, however, suggest that much of Sulawesi's fauna assembled over the last 15 Myr. Here, we test the hypothesis that more recent uplift of previously submerged portions of land on Sulawesi promoted diversification and that much of its faunal assemblage is much younger than the island itself. To do so, we combined palaeogeographical reconstructions with genetic and morphometric datasets derived from Sulawesi's three largest mammals: the babirusa, anoa and Sulawesi warty pig. Our results indicate that although these species most likely colonized the area that is now Sulawesi at different times (14 Ma to 2–3 Ma), they experienced an almost synchronous expansion from the central part of the island. Geological reconstructions indicate that this area was above sea level for most of the last 4 Myr, unlike most parts of the island. We conclude that emergence of land on Sulawesi (approx. 1–2 Myr) may have allowed species to expand synchronously. Altogether, our results indicate that the establishment of the highly endemic faunal assemblage on Sulawesi was driven by geological events over the last few million years.

strictly followed protocols developed by [2,3]. Differences in shape were tested using MANOVA, whereas differences in log-transformed centroid size were tested using Wilcoxon tests and visualized using boxplots. Variation in shape was first visualized using a principal-components analysis (PCA) before between-groups variation was explored using Canonical Variate Analyses (CVA). The resemblance between groups was visualized with a neighbor-joining network calculated on the Mahalanobis distances.
Manova and CVA were performed after a dimensionality reduction of the data following [3].
The variances of the two species on Sulawesi were compared using a Fligner-Killeen test based on the distance between each specimen and the mean shape (or size) of its species. M2 and M3 were analysed separately before being pooled together to produce the synthetic Figure 2a.

DNA extraction
We extracted DNA from 520 Anoa, 251 Babrirusa, and 317 SWPs. We For DNA extraction from bone, we grounded samples of cortical bone to powder in a Mikrodismembrator (Sartorius). We then digested bone powder overnight at 50 °C in 2 mL of buffer (0.425 M EDTA pH8, 1 mM Tris-HCl pH8, 0.05% w/v SDS, 0.33 mg/mL Proteinase K) under constant rotation. The digested solution was concentrated to approximately 500 µL using 30 kDa molecular weight cut-off centrifugal filters (Amicon® Ultra, Millipore). We passed the concentrated solution through a silica column (QIAquick®, Qiagen) following the manufacturer's protocol, and eluted the final extract in 100 µL of TE buffer. We measured DNA concentration using 2 µL of extract on the Qubit® platform (Invitrogen), and stored the extracts at -20 °C.

mtDNA sequence data
From our samples of Anoa, we amplified D-loop and cytb fragments by polymerase chain reaction (PCR) using the primers described in Table S6. Both primers were designed by Dr D. Bradley (Trinity College, Dublin) to amplify the mtDNA of multiple bovine species [4,5]. Numerous samples were not sequenced due to the low quality of their DNA.
Fragments were amplified by PCR using one cycle of denaturation at 96 °C for 3 min, followed by 30 cycles of: denaturation at 96 °C for 30 s, annealing at 50 °C for 20 s, and extension at 60 °C for 4 min. Both primers were run separately with an M13 tail added to the 5'-end. Sequencing was carried out using M13 universal primers and the ABI BigDye 3.1 sequencing kit (Applied Biosystems). Sequences were determined using an ABI 3700 automatic DNA capillary sequencer (Applied Biosystems), OrbixWeb™ Deamon software, 3700 DATA collection software and DATA Extractor software.
From our samples of Babirusa and SWP, we amplified two overlapping d-loop fragments for both species, which were amplified by PCR using primers designed by G. Larson (University of Oxford, UK) [6,7] and described in Negative controls (without DNA) were included in all reactions. The PCR products were analysed using an ABI 373 (Applied Biosystems) DNA fragment analyser. Results were scored with the programs GENESCAN 3.0, GENOTYPER 2.5 or PEAK SCANNER 2.0 (Life Technologies).
For samples of SWP, PCR was performed in an Eppendorf Mastercycler® gradient apparatus. In general, the PCR profile was as follows: the 10 µL reaction mixture  (Table S6), and 72 °C for 1 min. PCR products were visualized on a 1.5% agarose gel (Acros organics) with GelRed Nucleic Acid Gel Stain (Biotium) in order to check the amplification.  Table S6), and 72 °C for 1 min. PCR products were visualized on a 1.5% agarose gel (Acros organics) with ethidium bromide (Merck) in order to check the amplification.  Table S6. cold chain) of 10,000,000 steps each (with samples drawn every 1000 steps). The first 25% of samples were discarded as burn-in. We carried out 4 independent MCMC analyses and combined the samples from the posterior. Convergence was assessed by ensuring that average standard deviation of split frequencies was below 0.01 and that the potential scale reduction factor was close to 1 for all parameters.

Phylogenetic analyses of mitochondrial DNA
For each species we defined haplogroups based on highly supported clades. For each geographic region, the proportion of each haplogroup was plotted on a map using the R package "maps". For each sample, haplogroup membership was transposed to create an ancestry matrix. All samples lacking precise geographic coordinates were removed. The ancestry matrix was then plotted onto a map with a tessellated projection, using the R package "tess3r" [10][11][12]. We then divided Sulawesi and nearby islands into 11 regions based on previous work on amphibians and primates that defined areas of endemism on the island [13][14][15]. We assessed the significance of the difference in haplogroup frequency in each area of endemism using Pearson's chi-squared test, p-values were computed using 2000 simulation replicates, as implemented in R.
To infer the evolutionary timescales of the three species, we performed a Bayesian phylogenetic analysis using a molecular clock in BEAST v1.8.4 [16]. First, we analysed a mtDNA combined data set comprising the sequences of Babirusa, SWP and relatives (S.

cebifrons, S. philippensis, Hylochoerus meinertzhageni, Potamochoerus porcus,
Potamochoerus larvatus, Phacochoerus aethiopicus, and Phacochoerus africanus). This data set comprised 700 aligned nucleotides from 243 samples. To calibrate the molecular clock, we used a normal calibration prior for the age of African suids (mean 10.5 My, standard deviation 2.551 My), based on the estimate from a combined nuclear and mitochondrial data set by [17].
We then analysed the mtDNA sequences of Anoa and related bovids (Bison bison, Bison bonasus, Syncerus caffer, Bos taurus, Bos gaurus, Bos frontalis, and Bos grunniens). This data set comprised 726 aligned nucleotides from 170 samples. We used a normal calibration prior for the age of the root (mean 8.8 My, standard deviation 1.02041 My), based on a fossil calibration used by [18]. Given the use of relatively deep calibrations in both analyses, the date estimates should be regarded as being conservatively old because our approach is likely to produce underestimates of the substitution rates [19].
The Bayesian information criterion was used to select the HKY+G model as the best-fitting substitution model for both data sets, after excluding models allowing a proportion of invariable sites. For each data set we compared two models of rate variation: the strict clock and the uncorrelated lognormal relaxed clock [20]. We also compared three tree priors: constant-size coalescent prior, Bayesian skyline coalescent prior, and birth-death speciation prior. For each combination of clock model and tree prior, the marginal likelihood was estimated using path sampling with 25 power posteriors [21]. Samples were drawn every 2,000 steps from a total of 2,000,000 MCMC steps per power posterior.
Posterior distributions of all parameters, including the tree, were estimated by MCMC sampling, with samples drawn every 5000 steps over a total of 50,000,000 MCMC steps.
To ensure convergence, each analysis was run in duplicate and the samples were compared and combined. Sufficient sampling was confirmed by examining the effective sample sizes of parameters. For both data sets, the strict clock and Bayesian skyline tree prior yielded the highest marginal likelihood (Table S7).

Analyses of microsatellite data
For each species, we used STRUCTURE v2.3.4 [22] to infer population structuring. The maximum number of populations (K) was set to 12 (the total number of region defined on Sulawesi). For each species, we ran 10 independent MCMC analyses, each with 1,000,000 steps, discarding a burn-in of 50,000 steps. We computed ∆K ( Figure S8) to infer the best-fitting K value using structure Harvester [23]. Independent runs were merged using CLUMPP with M=2 [24]. For all samples with precise geographic coordinates, results were plotted onto a map with a tessellated projection, using the R package "tess3r" [10][11][12]. Results were also plotted on a map using the R package "maps" in each region of endemism (see above). To limit the possibility of provenance uncertainty, we excluded all samples that were from zoos or from unknown locations from this analysis (see Table S1).

Geographical origins of population expansions
To infer the location of origin of population expansion for each species, we employed a spatially explicit discriminative modelling approach in which we assume a monotonic decline in diversity with distance from origin of a range expansion. A spatial grid of latitude and longitude values covering the geographic space of Sulawesi, of resolution 0.05 by 0.05 degrees, was explored using a flat kernel of radius 500 km for SWP and Babirusa and 350 km for Anoa. If at any location in the grid we found within the kernel at least 5 sampled individuals for SWP, or 3 sampled individuals for Babirusa and Anoa, the local diversity was calculated using ASD and recorded for that grid location. The grid was then re-explored with each latitude/longitude location treated as a potential origin location, and we recorded the correlation between geographic distance to the accepted kernels and local diversity at those kernels. This provided a grid of correlation values, which was then interpolated and visualized on a map.
Regions with the highest negative correlations were considered the best hypothesized origin locations. To quantify statistical support for inferred origin locations, the data were permuted among sample sites 1000 times, and for each permuted data set the above analysis was repeated. Following this, we plotted only the grid locations where the negative correlation between geographic distance and genetic diversity was more extreme than 99% (98% for Anoa) of those obtained from the permuted data.

Approximate Bayesian computation
For each species, we used both mtDNA and microsatellite data to evaluate the fit of four different models ( Figure S11) and to obtain a posterior distribution of the parameters under the best-fitting model. We compared the fit of models with constant population size ( Figure   S11a), population expansion ( Figure S11b), a bottleneck (Figure 10c), and a bottleneck following an expansion (Figure 10d). The rationale behind these models is to test whether these species have undergone a population expansion due to the uplift of Sulawesi (see main text) and/or if they have undergone a bottleneck due to recent human activities. The prior distributions used for the simulations are summarized in Table S4.
We calculated multiple summary statistics for each data set using arlsumstat [35]. For the mtDNA data, we computed the number of segregating haplotypes K, the number of segregating sites S, Tajima's D [36], Fu's FS [37], and the average pairwise difference π.
For the microsatellite data, we computed the total number of alleles K, the range of the allele size R, the expected heterozygosity H and the Garza-Williamson statistic GW [38].
We ensured that the observed summary statistics fell well within the distribution of simulated summary statistics ( Figure S13-15). . We also computed the average Root Mean Square Error (RMSE) for each parameter using pseudo-observed data to assess our power to infer each parameter in the model (see Table S4).
To estimate parameter values, we ran a total of 2,000,000 simulations under the bestfitting model for each species. We extracted five partial least square (PLS) components             (1)(2)(3)(4) are displayed in Figure S11. Obs. P-value= observed fraction of the retained simulation (2,000) with a marginal likelihood value (marginal lnL) smaller than the observed data. Posterior P. = Posterior probability of the model. Table S4: Characteristics of the prior and posterior distribution of parameters estimated via approximate Bayesian computation. All priors are uniformly distributed. The average root mean square error (RMSE) of the mode of each parameter was computed using 1,000 pseudo-observed data sets. Values close to 1 and 0 indicates little and large power, respectively. 95CI represents the 95% credibility interval. See Figure  S11 for further information about the parameters.      c. Figure S10: Results of the STRUCTURE analysis for K=2 to K=6. a. Anoa b. Babirusa c. SWP.

Figure S15 Observed (red vertical line) and simulated (histogram) of all summary statistics used in the approximate Bayesian computation analysis (SWP).
c.