Speciation over the edge: gene flow among non-human primate species across a formidable biogeographic barrier

Many genera of terrestrial vertebrates diversified exclusively on one or the other side of Wallace’s Line, which lies between Borneo and Sulawesi islands in Southeast Asia, and demarcates one of the sharpest biogeographic transition zones in the world. Macaque monkeys are unusual among vertebrate genera in that they are distributed on both sides of Wallace‘s Line, raising the question of whether dispersal across this barrier was an evolutionary one-off or a more protracted exchange—and if the latter, what were the genomic consequences. To explore the nature of speciation over the edge of this biogeographic divide, we used genomic data to test for evidence of gene flow between macaque species across Wallace’s Line after macaques colonized Sulawesi. We recovered evidence of post-colonization gene flow, most prominently on the X chromosome. These results are consistent with the proposal that gene flow is a pervasive component of speciation—even when barriers to gene flow seem almost insurmountable.

M. hecki (6 individuals), and M. nemestrina from Borneo (4 individuals). We ex-867 cluded data from the east population of M. tonkeana (i.e., M. togeanus) and the M. 868 nemestrina populations from Sumatra and Peninsular Malaysia to reduce the effects 869 of population subdivision on the results. 870 As described in Evans et al. 2014, the first method standardized the ratio of pair-871 wise nucleotide diversity per site (π) on the X over that of the autosomes using the 872 Jukes-Cantor corrected (1969) divergence from baboons for the X and autosomes, 873 respectively, and included a correction for ancestral polymorphism as detailed in 874 Charlesworth and Charlesworth (2010). Because this method required no missing 875 data within a species in order for a site to be considered, we excluded M. maura 876 individual PM613 due to low coverage. The second method estimated the X/A 877 polymorphism ratio (λ) using a model of evolution that included the possibility of 878 a dynamic demographic history and natural selection on GC content  biased gene conversion), as described elsewhere [11,[14][15][16][17]. Thus, N eX =λN eA , where 880 N eX and N eA are the effective population sizes of the X and autosomes, respectively, 881 and λ is not influenced by differences between these genomic regions in mutation 882 rate. This model allowed for missing data, so no individuals were excluded within 883 each of these species or populations. The models to which the data from each species 884 were fitted have several parameters, and include a 'full' model in which all param-885 eters are estimated independently, and several nested models in which one or more 886 parameters are fixed to some constant, or set to be equivalent to one another. The 887 full model included two time intervals with different effective population sizes with 888 instantaneous change between the ancestral and recent population size occurring τ generations ago, and the ratio of the current to ancestral population size being equal 890 to ρ. To account for the possibility that natural selection acted on GC content,891 or the equivalent genomic effects of GC-biased gene conversion, we fitted the data 892 to a model of evolution between two types of nucleotides: those with a weak bond 893 (adenosines and thymines) and those with a strong bond (guanines and cytosines). 894 The polymorphism data were recoded to include only variable sites in which a gua-895 nine (G) or a cytosine (C) nucleotide was segregating with an adenine (A) or a 896 thymine (T). In these models, the parameters θ 01 and θ 10 refer to the mutation pa-897 rameters from G or C nucleotides to A or T nucleotides, and the reverse, respectively, 898 as detailed in Evans et al. 2014. The parameter γ reflects whether GC-biased gene 899 conversion (gBGC) or natural selection on GC content favors an increase in GC con-900 tent (a positive parameter value) or a decrease in GC content (a negative parameter 901 value). In the full model, γ is estimated separately for the autosomes (γ A ) and the 902 X (γ X ), and in some of the nested models γ A and γ X are set to be equivalent and/or 903 equal to zero, which corresponds to no gBGC or neutrality of GC content. The 904 polymorphism data were also fitted to an equilibrium model in which population 905 size is constant and for which there is no X/A polymorphism ratio (λ) parameter.

906
More detailed information and the statistical rationale for these models are avail-907 able elsewhere [11,[14][15][16][17]. If the equilibrium model was poorly supported, weighted 908 parameter estimates were then calculated across all models using AIC weights, as RADseq data from four species, after separating the polymorphism data into cate-914 gories based on genomic position relative to annotated genes in the rhesus genome.

915
Additional polymorphism statistics are presented in Supplementary Tables S3 -S6. 916 As expected, diversity and divergence was similar in M. tonkeana to that previously 917 reported based on an expanded dataset that included paired-end sequences [11], even 918 though there were differences in the bioinformatic analyses of each study.

919
In the four species in which we assessed population genetic variation, M. nemest-920 rina from Borneo was the most polymorphic. The 95% CIs for π and θ W overlapped 921 for the three Sulawesi species with population genetic data from at least 4 individuals 922 (M. tonkeana, M. maura, and M. hecki). In genomic regions far from genes, which 923 presumably are least affected by natural selection, the X/A polymorphism ratio was 924 lower than expectations (Fig. S2). However, in these genomic regions Tajima's D of 925 autosomal DNA was significantly negative (Tables S3 -S6), indicating an excess of 926 low frequency polymorphisms. This could stem from population expansion, although 927 in M. maura, the 95% CI for this parameter was near zero suggesting population size 928 of that species may have varied less than that of the others.

929
That Tajima's D is significantly different from zero provides circumstantial evi-930 dence for a dynamic demography in at least some of these species, and changes in 931 population size are known to influence the X/A polymorphism ratio [19]. Addition-ally, other factors such as gBGC or natural selection on GC content have the potential 933 to affect the X/A polymorphism ratio (e.g., Evans et al. 2014). For these reasons, 934 we fitted the polymorphism data from genomic regions far (>51,000 bp) from genes 935 to several models of evolution to polymorphism data from the X and autosome for 936 each of the four populations or species for which we had data from >3 individuals.

937
For the three species where the equilibrium model (with no change in population size  Table S9 that is over twice as high as that 945 of any other model. The other three populations/species each had evidence for a 946 dynamic demography and zero weight for the equilibrium model.

947
For all species, gBGC and/or selection favoring increased GC content is supported 948 because the the maximum likelihood estimates and/or model averages for the gBGC 949 parameters on the autosomes and the X, γ A and γ X respectively, are greater than 950 zero. This indicates that gBGC and/or selection for GC content favor an increase 951 in GC content (Table S11). Moreover, all models that set γ A or γ X equal to zero 952 had low AIC weights (Tables S7 -S9). If the strength of gBGC and/or selection 953 favoring increased GC is similar on the X and autosomes, we expect γ X = λγ A ; for 954 all four species/populations the AIC weights of these models were high compared to the other models that relaxed this constraint, suggesting that the forces driving GC-biased molecular evolution in these genomic regions were indeed similar. This 957 latter finding was also recovered previously for M. tonkeana [11] using an expanded 958 dataset from that species that also analyzed paired end sequences from RADseq.

959
In this analysis, variable positions are re-coded into two states, A 0 and A 1 , where 960 A 0 refers to G or C nucleotides and A 1 refers to A or T nucleotides (see Methods).

961
Additionally, θ 01 = 4N e µ 01 , where µ 01 represents the mutation rate from A 0 → A 1 , 962 and θ 10 = 4N e µ 10 , where µ 10 represents the mutation rate from A 1 → A 0 . In all 963 species, θ 01 > θ 10 for the X and for the autosomes (Table S11). This suggests that in 964 each of these different species there are more variable positions in which a G or a C 965 is the major (more common) allele than where an A or a T is the major allele. Also 966 of interest is the observation that in each species, θ ijA λ/θ ijX > 1, where ij is 01 or 967 10 and A and X refer to the autosomes and X respectively. This indicates that the 968 mutation rate is higher in the autosomes than in the X, which is suggestive of male 969 driven evolution -a result that is also suggested by pairwise divergence between the 970 three species for which we performed WGS as described below.

971
Thus, while we recovered significantly lower polymorphism on the X than ex-972 pected based on π in four macaque species, in each one this could be accounted 973 for by an evolutionary model that includes a dynamic demography and selection on 974 gBGC/natural selection on GC content. One factor not incorporated in these anal-975 yses and that is beyond the scope of this study is the possibility that hybridization 976 among species via male dispersal could influence the X/A ratio. As discussed above, 977 most papionin monkeys have female philopatry and hybridization has been detected between all parapatric species pairs on Sulawesi. If hybridization were mostly medi-979 ated by male migration, diversity in the autosomes would be increased to a greater 980 extent than the X. This possibility could explain the lower (but not significantly so) 981 levels of diversity on the X. Added to this, other factors such as natural selection in 982 males on deleterious recessive mutations, could also decrease diversity on the X.  Table S1: Information on sequence data analyzed in this study including the species (Species), sample identification number (SampleID), sex (Sex), number of reads before and after trimming (Untrimmed and Trimmed, respectively), read length after trimming if performed or before trimming for the HiSeqX data (Readlength), GC content (GC), and the number of mapped reads (mapped). For the HiSeqX data, trimming was not performed (np). Mapped reads were computed for RADseq and HiSeqX data by the flagstat command of samtools and gatk, respectively.         Table S5: M. maura polymorphism (n = 2 females, 3 males). Labels follow Table S1.  Table S6: M. hecki polymorphism (n = 4 females, 2 males). Labels follow Table S1.       Table S9). As described in the Supplement, θ 01X is the polymorphism parameter for sites on the X where the derived mutation is an A or T, θ 10X is the polymorphism parameter for sites on the X where the derived mutation is an G or C, γ X is the selection parameter for GC content on the X. θ 01A , θ 10A , and γ A are the corresponding parameters for the autosomes. λ, ρ 1 , and τ 1 refer respectively to the X/A polymorphism ratio, the ratio of the current to ancestral population size, and number of generations before the present that the population size changed   Figure S2: X/A polymorphism ratios (X/A) based on RADseq data for four species. Ratios were calculated in three genomic categories: (1) exonic and intronic sequences and flanking regions less than 1000 bp from genes (<1000bp), (2) nongenic regions that are between 1,000 and 51,0000 bp from genes (1000-51000bp), and (3) nongenic regions that are greater than 51,000 bp from genes (>51000bp). Bars indicate 95% confidence intervals and accommodate uncertainty in polymorphism and divergence. A horizontal line indicates the 0.75 expectation.  Figure S3: A Euler diagram of the number of sites in the non-pseudoautosomal region of the X chromosome for which a heterozygous genotype was inferred for one or more of the three individudals used in the analysis of gene flow across Wallace's Line. Numbers in each region of the chart refer to the number of shared or unshared heterozygous genotypes, and is based on~57.5 million genotyped sites in each individual after discarding repetitive regions. Most of the (pseudo-) heterozygous genotypes in each male were shared with the other, whereas most of the heterozygous genotypes in the female were not shared with either male. This analysis is consistent with the proposal that most of the heterozygous genotypes in the female are real.