Metagenomics of the modern and historical human oral microbiome with phylogenetic studies on Streptococcus mutans and Streptococcus sobrinus

We have recently developed bioinformatic tools to accurately assign metagenomic sequence reads to microbial taxa: SPARSE for probabilistic, taxonomic classification of sequence reads; EToKi for assembling and polishing genomes from short-read sequences; and GrapeTree, a graphic visualizer of genetic distances between large numbers of genomes. Together, these methods support comparative analyses of genomes from ancient skeletons and modern humans. Here, we illustrate these capabilities with 784 samples from historical dental calculus, modern saliva and modern dental plaque. The analyses revealed 1591 microbial species within the oral microbiome. We anticipated that the oral complexes of Socransky et al., which were defined in 1998, would predominate among taxa whose frequencies differed by source. However, although some species discriminated between sources, we could not confirm the existence of the complexes. The results also illustrate further functionality of our pipelines with two species that are associated with dental caries, Streptococcus mutans and Streptococcus sobrinus. They were rare in historical dental calculus but common in modern plaque, and even more common in saliva. Reconstructed draft genomes of these two species from metagenomic samples in which they were abundant were combined with modern public genomes to provide a detailed overview of their core genomic diversity. This article is part of the theme issue ‘Insights into health and disease from ancient biomolecules’.


Introduction
Multiple research areas have undergone revolutionary changes in the last 10 years due to broad accessibility to high-throughput DNA sequencing at reduced costs. These include the evolutionary biology of microbial pathogens based on metagenomic sequencing. Studies on Mycobacterium tuberculosis [1,2], Mycobacterium leprae [3,4], Yersinia pestis [5][6][7][8][9][10] and Salmonella enterica [11][12][13] have yielded important insights into the history of infectious diseases by combining modern and historical genomes. In principle, the same approach might also help to elucidate the evolutionary history of both commensal and pathogenic taxa within the human oral microbiome. Periodontitis and dental caries have likely afflicted humans since their origins [14][15][16][17]. They may now be amenable to population genetic analyses because a landmark publication by Adler et al. in 2013 [18] demonstrated that dental calculus (calcified dental plaque) from the teeth of skeletons that were up to 7500 years old could contain relatively well preserved ancient bacterial DNA. That publication was based on 16S rRNA sequences, which are not informative about intra-species genetic diversity. However, subsequent shotgun sequencing from modern and ancient dental calculus [19][20][21] has demonstrated that it should be possible to reconstruct genomic sequences that span millennia of human history from multiple individual species within the oral microbiome.
Reconstructing evolutionary history from the oral microbiome faces numerous technical challenges. Our understanding of the historical evolutionary biology of bacterial pathogens benefitted greatly from existing frameworks for the modern population genomic structure of those bacteria [22][23][24]. However, extensive bacterial population genetic analyses are largely lacking for the modern oral microbiome. The existing literature largely focuses on taxonomic binning into a traditional subset of 40 cultivatable species from periodontitis [25], whose sub-species population structure has not yet been adequately addressed at the genomic level. Instead, most analyses have focused on the 'oral complexes', which consist of groups of multiple species whose co-occurrence is statistically associated with periodontitis [26].
A second barrier to reconstructing evolutionary history is the limits of the currently existing bioinformatic tools. The genetic diversity of metagenomic sequences is usually classified by binning the microbial sequence reads into taxonomic units. Taxonomic assignments can be performed by the de novo assembly of metagenomic reads into MAGs (metagenomic assembled genomes) [27,28], or by assigning individual sequence reads to existing reference genomes. However, most current metagenomic classifiers rely on the public genomes in the NCBI database, whose composition is subject to extreme sample bias and which represents a preponderance of genomes from pathogenic bacteria [29]. Furthermore, shotgun metagenomes often include DNA from environmental sources, which include multiple microorganisms that have never been cultivated and may belong to unknown or poorly classified microbial taxa whose abundance is not reflected by existing databases. Recent evaluations have also demonstrated that current taxonomic classifiers either lack sufficient sensitivity for species-level assignments or suffer from false positives, and that they overestimate the number of species in the metagenome [29][30][31]. Both tendencies are especially problematic for the identification of microbial species which are only present at low-abundance, e.g. detecting pathogens in ancient metagenomic samples.
Over the last few years, we have developed a series of tools which can facilitate comparative metagenomics of modern and ancient samples. SPARSE, a novel taxonomic classifier for short-read sequences in the metagenome, was designed to provide accurate taxonomic assignments of metagenomic reads [32]. SPARSE accounts for the existing bias in reference databases [29,33] by sorting all complete genomes of Bacteria, Archaea, Viruses and Protozoa in RefSeq into sequence similarity-based hierarchical clusters with a cutoff of 99% average nucleotide identity (ANI99%). It subsequently extracts a representative subset from those clusters, consisting of one genome per ANI95% cluster because ANI95% is a common cutoff for individual bacterial species [34,35]. SPARSE then assigns metagenomic sequence reads to these clusters using Minimap2 [36]. However, such alignments are likely to be inaccurate when they are widely dispersed across multiple ANI95% clusters because such wide dispersion reflects either ultra-conserved elements of uncertain specificity or a high probability of homoplasies due to horizontal gene transfer. SPARSE therefore reduces such unreliable alignments by a negative weighting of widely dispersed sequences reads. The remaining metagenomic reads are then assigned to unique species-level clusters on the basis of a probabilistic model and labelled according to the taxonomic labels and pathogenic potential of the genomes within those clusters. Our methodological comparisons demonstrated that SPARSE has greater precision and sensitivity with simulated metagenomic data than 10 other taxonomic classifiers and yielded more correct identifications of pathogen reads within metagenomes of ancient DNA than five other methods [32]. SPARSE is also suitable for classifying reads from metagenomes from modern samples and can extract reads from any ANI95% taxon of interest.
SPARSE assigns sequence reads to taxa, but does not create genomic assemblies from the selected metagenomic reads. That task is performed by EToKi, a stand-alone package of useful pipelines that are used by EnteroBase [5] for manipulations of 100 000s of microbial genomes. EToKi is used to merge overlapping paired-end reads, remove low-quality bases and trim adapter sequences. It then excludes sequence reads with greater sequence similarities to genomes from a related but distinct out-group than to an in-group of genomes from the target taxon of interest. EToKi then masks all nucleotides in an appropriate reference genome and creates a pseudo-MAG by unmasking nucleotides with sufficient coverage among the reads that have passed the in-group/out-group comparisons. Finally, EToKi can create a SNP matrix from pseudo-MAGs plus additional draft genomes and generate a maximum-likelihood phylogeny (RAxML 8.2 [37]), which can be visualized together with its metadata in GrapeTree [38].
Here, we demonstrate the power of this combination of pipelines by examination of the metagenomic diversity of the human oral microbiome from a large number of historical and modern samples from diverse geographic sources. We address the question of which microbial taxa are uniformly present in human saliva, dental plaque and dental calculus, and which are specific to individual niches. We test the associations of oral taxa within the traditional oral complexes and conclude that their very existence needs re-examination. Finally, we examine the population genomic structures of Streptococcus mutans and Streptococcus sobrinus, which are associated with dental caries in some human populations [39][40][41].

(a) SPARSE analysis of oral metagenomes
We identified 17 public archives containing 1016 sets of metagenomic sequences (table 1) from 791 oral samples from a variety of global sources which had been obtained from modern human saliva, modern human dental plaque or historical dental calculus (electronic supplementary material, table S1). Individual sequence reads from those metagenomes were assigned to taxa with SPARSE. The assignments were made according to an upgraded database of 20 054 genomes of bacteria, archaea or viruses, one genome per ANI95% cluster among 101 680 genomes in the NCBI RefSeq databases in May 2018. Seven metagenomes (ancient dental calculus: 5; modern saliva: 2) lacked bacterial reads from the oral microbiome (electronic supplementary material, table S2). These seven metagenomes were ignored for further analyses, leaving assignments to 1591 microbial taxa from 1009 metagenomes (784 samples) (table 2). Electronic supplementary material,  table S3, reports the percentage assignment of the reads in each sample to each of the 1591 taxa, except for assignments with a sequence read frequency of less than 0.0001%, which are reported as 0%. Electronic supplementary material, table royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 375: 20190573 S3, includes a column identifying assignments to the oral microbial complexes defined by Socransky et al. [26]. SPARSE also identified 152 samples containing Archaea from four species, 214 samples containing at least one of four human viruses and 146 samples containing at least one of 12 bacteriophages (table 3). This dataset may represent the currently broadest sample of the oral microbiome from global sources and over time.
(b) Comparisons of microbiomes from saliva, plaque and historical dental calculus We tested whether individual oral taxa were particularly enriched or depleted according to source with multiple quantitative approaches, including UMAP (Uniform Manifold Approximation and Projection), principal component analysis (PCA) and hierarchical clustering. UMAP is a recently described, high-performance algorithm for dimensional reduction of diversity within large amounts of data by nonlinear multidimensional clustering [56]. A UMAP plot of the taxon abundances in each sample showed three clusters (figure 1a). The three clusters are totally discrete (electronic supplementary material, figure S1A) according to a machine learning approach, optimal k-mean clustering of the first three components from the UMAP analysis. With minor exceptions, the three UMAP clusters were also predominantly associated with the source, with one cluster for taxa from modern saliva, a second one for taxa from modern dental calculus and the third for taxa from ancient dental calculus (figure 1a). Similar results were obtained with a classical PCA, except that the clusters were not as clearly distinguished as with UMAP, and the proportion of exceptions was greater (electronic supplementary material, figure S1B). The assignments of source affiliations to cluster were also largely consistent between UMAP and PCA, with occasional exceptions (electronic supplementary material, figure S1C).
For the third approach, we calculated the Euclidean p-distances between each pair of samples and subjected them to hierarchical clustering by the neighbour-joining algorithm with the results shown in figure 1b. Hierarchical clustering also largely separated the samples by source with only a few exceptions. Samples from modern saliva formed one large cluster. Samples from modern dental plaque formed two related but discrete sub-clusters, one of which included a sub-sub cluster of samples from historical dental calculus. These clusters also largely corresponded to the clusters found by k-mean clustering of UMAP data.
Thus, three primary and distinct clusters were consistently identified by three independent methods from the quantitative numbers of reads in individual microbial taxa. The three clusters were largely source specific for modern saliva, modern plaque and historical dental calculus. This finding Table 1. Sources of metagenomic reads. Note: ancient calculus refers to ancient dental calculus from historical samples. Plaque and saliva refer to modern dental plaque and saliva. Sets of short reads were downloaded from GenBank except for Archive 17, which was downloaded from the Online Ancient Genome Repository. Seven metagenomes (electronic supplementary material, table S2; Archive 11:2; Archive 17:5) were excluded from further analyses because they contained too few reads from common microbial taxa in the oral microbiome. predicts that the microbiomes from these three sources contain source-specific taxa.
(c) Source-specific taxa We attempted to identify the most important bacterial taxa for the observed clustering by sample source with a second, powerful machine learning approach. A supervised support vector machine (SVM) [58] classification was used to identify the most optimal of 300 SVM model variants, and the 40 most discriminating ANI95% taxa according to that model are shown in figure 2, together with mini-histograms that summarize the relative abundance of sequences by source. As predicted from the discrete clustering described above, multiple taxa were dramatically more prominent in samples from one source than from either of the two other sources.
The results also show that the most prominent sample source varied with the taxon (figure 2). Eleven of the 40 most discriminatory taxa belonged to the oral complexes that are associated with periodontitis according to Socransky et al. [26]. Seven species from oral complexes (Veillonella parvula, Fusobacterium nucleatum, Capnocytophaga gingivalis, Streptococcus gordonii, Actinomyces naeslundii, Actinomyces viscosus and Capnocytophaga sputigena) were most abundant in modern plaque and two other species (Streptococcus sanguinis, Tannerella forsythia) were most abundant in historical dental calculus. The yellow complex includes Streptococcus mitis, which encompasses over 50 distinct ANI95% clusters [59]. Two of these ANI95% clusters, designated S. mitis s8897 (ANI95% cluster in electronic supplementary material, table S3; MG_43 in [59]) and S. mitis s126097 (MG_56) were included among the 40 most discriminatory taxa, and each of them was more frequent in saliva than in dental plaque or dental calculus.
Seventeen other taxa that were assigned to an oral complex by Socransky et al. [26] are not included in figure 2 because they were not among the 40 most discriminatory taxa. We therefore examined the relative abundances of all 28 taxa from oral complexes in greater detail (figure 3). Three of the four taxa in the blue and purple complexes are very abundant in oral metagenomes, and all four are preferentially found in modern plaque. However, the other oral complexes are heterogeneous in their patterns of relative abundances. For example, within the red complex, both T. forsythia and Treponema denticola were most frequently found in historical dental calculus but Porphyromonas gingivalis is most frequent in modern plaque and is generally much less abundant. Similar intra-complex discrepancies were found for the orange, yellow and green complexes. These inconsistent frequencies by source raise questions about the consistency of the compositions of those complexes in individual samples.

(d) Existence of 'oral complexes'?
Socransky et al. [26] initially treated oral complexes as a hypothesis. However, they have now attained the status of accepted wisdom and even play a prominent role in routine laboratory investigations and treatment of periodontitis. The oral complexes included 28 cultivated bacterial species, whose presence or absence was determined by DNA hybridization against a small number of probes. This technology is now outdated; the number of known oral taxa has increased dramatically and the data presented here are for relative abundance rather than presence or absence. However even after weighting for genome size, we do not find a close correspondence between the frequencies of cells in sub-gingival dental plaque measured by Socransky et al. [25] and the results presented here (electronic supplementary material, text). We therefore re-examined the strengths of association with the oral complexes from the data presented here according to  [26]. The original assignments to the oral complexes depended strongly on results from hierarchical clustering of the pairwise concordance between species for presence or absence in individual samples. The tree in figure 4 shows neighbour-joining clustering of the common microbial taxa in our dataset by the similarities of their abundances over all samples in our dataset according to SPARSE. This tree contradicts the original composition of the oral complexes: the four areas of the tree where oral complex taxa are clustered each contain representatives from multiple complexes, and none of those four clusters corresponds to the original compositions proposed by Socransky et al. [26].
It seemed possible that the discrepancies between figure 4 and the original compositions of the oral complexes might reflect the fact that this study identified many additional taxa, some of which were as common as those used to define the oral complexes (electronic supplementary material, text). We therefore performed cluster analyses of our current data for the original set of 31 cultivatable bacterial species examined by Socransky et al. [26]. We compared the neighbour-joining algorithm used here with the less powerful, agglomerative clustering method (UPGMA, unweighted pair group method with arithmetic mean) that had been used by Socransky et al.
We also compared the abundances across all samples with abundances in plaque, which was the primary source for bacteria tested by Socransky et al. The results (electronic supplementary material, figure S3) show dramatic inconsistencies between independent trees in regard to the clustering of the oral complex bacteria. For example, T. forsythia, T. denticola and P. gingivalis of the red complex cluster together (and also with C. rectus) in electronic supplementary material, figure S3A,C,F,G. However, T. denticola and T. forsythia are separated from P. gingivalis in the four other graphs in electronic supplementary material, figure S3. And none of the three cluster together with each other in electronic supplementary material, figure S3E. Similar, or even greater, discrepancies are visible for the other oral complexes in electronic supplementary material, figure S3. Inconsistencies in clustering patterns across minor differences in sampling and clustering algorithms raise severe doubts about the very existence of the oral complexes as defined by Socransky et al. [26].

(e) Numbers of taxa per source
The rarefaction curves in figure 5a provide a breakdown of taxa by sample source as additional samples are tested. Human alphaherpesvirus 1 royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 375: 20190573 SPARSE detected 1591 microbial taxa over all 784 metagenomic samples: 1389 from modern saliva, 842 from modern plaque and 696 from historical calculus. These estimates will increase as additional samples are added, but at increasingly slower rates because the rarefaction curves seem to be reaching a plateau, except for historical dental calculus where the fewest samples have been evaluated until now. The median numbers of taxa per sample range from 177 (historical dental calculus) to 288 (modern saliva) and were much smaller than the total numbers. These median values reflect a bimodal distribution for numbers of taxa per sample (figure 5b), wherein a few samples had jackpots of large numbers of taxa but all other samples had only a few.
The analyses described above focused on differences in taxon composition by source. However, the Venn diagram in figure 5c shows that 447 taxa were common to all three sources, even if their relative abundances varied. Modern plaque yielded only 34 taxa which were not found in either historical dental calculus or modern saliva. More source-specific taxa were found in historical dental calculus, which may possibly  [57]) hierarchical clustering based on the Euclidean distances between pairs of metagenomes. Euclidean p-distances were calculated between each pair as the square root of the sum of the squared pairwise differences in the percentage of reads assigned by SPARSE to each microbial taxon. Nodes whose cluster location was inconsistent with the UMAP clustering in (b) are highlighted with black perimeters. Tree visualization: GrapeTree [38]. Alternatively, some taxa may be absent in modern dental plaque because historical lineages have become extinct [11]. Saliva yielded 504 unique taxa, some of which might be transient and do not persist long enough to be incorporated into plaque.
(f ) Population genomics of organisms associated with dental caries The microbiome associated with early stages of dental caries is an unresolved topic that remains under active investigation [40,[60][61][62]. However, it is generally accepted that S. mutans and S. sobrinus are routinely associated with caries [63]. Our data confirm that reads belonging to these two taxa are abundant in modern dental plaque and also show that they are even more abundant in modern saliva ( figure 6a,c). However, there was no significant correlation between the relative frequencies of these species across multiple metagenomes (electronic supplementary material, figure S9). Prior analyses based on 16S RNA operational taxonomic units (OTUs) indicated that S. mutans was extremely rare in historical dental calculus and argued that this increase was caused by the introduction of high levels of sugar to human diets in industrialized societies in the last 200 years [18]. Our data show that S. sobrinus was undetectable in historical samples (frequency of less than 0.0001% of reads or less than 10 reads per metagenome; figure 6c). S. mutans was also undetectable in most of these samples, but up to 0.04% of all reads in 10 historical samples spanning the last 1500 years were assigned to S. mutans (figure 6a), in accordance with archaeological findings that dental caries has been common in multiple eras over the last 10 000 years [14]. The few reads from historical samples that were assigned to S. mutans showed increased deamination at their 5 0 -ends when tested by MapDamage2 [64] (electronic supplementary material, figure S4), confirming that they were truly from ancient DNA. We exploited the high frequency of sequence reads from these two Streptococcus species in modern dental plaque and saliva to illustrate how SPARSE and EToKi can be used to extract pseudo-MAGs from metagenomic sequence reads and combine them with genomes sequenced from cultivated bacteria (see Methods). These procedures resulted in a total of 31 pseudo-MAGs for S. mutans and 15 pseudo-MAGs for S. sobrinus in which over 70% of the reference genome had been unmasked (figure 6e,f; electronic supplementary material, table S6). Most of these pseudo-MAGs were from Chinese samples [47]. The pseudo-MAGs were combined with genomes from cultivated bacteria of the same species from Brazil, the US and the UK as well as other countries  Figure 3. Average percentage abundances in 784 metagenomes by oral source (key legend) of 28 species from six oral complexes described by Socransky et al. [26]. The oral sources are indicated by three mini-histogram bars for each species. Species are ordered from left to right by oral complex, whose colour designation is indicated at the top. Within each oral complex, the species order is by decreasing total abundance.

Discussion
Several years ago, we accidentally became interested in comparing historical and modern genomes reconstructed from metagenomic short-read sequences with draft genomes assembled from high-throughput sequencing of cultivated bacteria. Our initial efforts involved the deployment of individual bioinformatic tools, comparisons of multiple publicly available algorithms and compilation of draft genomes from publicly available sequence read archives of short-read sequences [2]. In parallel, we were also involved in developing EnteroBase, a compendium of 100 000s of draft genome assemblies from multiple genera that can cause enteric diseases in humans, including Salmonella [5,24]. These two projects were synergistic for elucidating the evolutionary history of Salmonella enterica based on metagenomic sequences from 800-year-old bones, teeth and dental calculus [11]. In that case, sequence reads from S. enterica were found in teeth and bone, but not in dental calculus. Our attempts to examine further samples of dental calculus quickly demonstrated that optimized pipelines were needed because manual analyses were too time intensive. However, none of the existing tools were both reliable and sufficiently sensitive for assigning sequence reads from historical metagenomes to the tree of microbial life. We therefore took a step back   .
(e and f ) Read coverage (dots; left) and percentage of the reference genome that was unmasked (≥3 reads; ≥70% consistency) (histogram; right) in S. mutans (e) and S. sobrinus ( f ). Ordered by decreasing percentage unmasking.  and developed SPARSE [32] to satisfy our requirements. SPARSE replaces the current reference databases, which are strongly biased to multiple, closely related genomes from bacterial pathogens, by a representative subset consisting of one genome per ANI95% hierarchical cluster within RefSeq, and assigns sequence reads to these clusters using a probabilistic model. That model penalizes non-specific mappings of reads and hence reduces false-positive assignments. SPARSE was more reliable than multiple other taxonomic classifiers, and both more sensitive and more reliable for identifying low numbers of reads from ancient metagenomes than multiple other pipelines [32]. In parallel, we expanded the capacities of EToKi [5], an efficient backend pipeline for genomic manipulations, such that it can accurately identify individual sequence reads sieved through SPARSE that are more similar to an in-group of reference genomes from the target species than to an out-group of genomes from a closely related, but distinct taxon. Those reads are then used to unmask nucleotides in a reference genome and generate a pseudo-MAG for SNP-based maximum-likelihood phylogenies. Finally, we developed GrapeTree [38], which facilitates the graphic visualization and manipulation of phylogenetic trees based on large numbers of genomes. Here, we demonstrate how to combine all three tools in order to obtain an overview of the microbial flora in samples from human oral saliva, modern dental plaque and historical dental calculus. We also reconstructed genomes of two taxa present at moderate concentrations within the oral microbiome and compare them with conventional draft genomes. The experimental procedures for processing 1016 metagenomes consisted of running SPARSE in the background for 2 months (approx. 100 000 CPU h). The pipelines described here permitted all other procedures and evaluations described here to be completed in less than two weeks. Our traditional understanding of oral ecology is largely based on taxonomic assignments of cultivatable bacteria, often performed by checkboard DNA-DNA hybridization [25]. Currently, 756 species have been cultivated from the human oral cavity and respiratory tract [66]. A subset of 40 are used for checkboard DNA-DNA hybridization [25], of which 28 were used to define the oral complexes that were thought to be of importance for periodontitis [26]. Our comparisons of those data with the results from the metagenomic analyses presented here show that the frequencies of individual taxa determined by the checkerboard assay were inconsistent with the frequencies determined by our metagenomic analyses (electronic supplementary material, figures S5 and S6). The checkerboard assays also lacked 17 common taxa from dental plaque and dental calculus that were found by metagenomic analyses. These results are not unexpected because our metagenomic analyses included saliva samples as well as ancient dental calculus and identified 1591 taxa, many of which have not been cultivated. Furthermore, it is now well established that the frequencies of certain supposed members of the oral complexes differ very dramatically with geographical source [67]. However, we had anticipated that we might be able to expand the compositions of the oral complexes to include previously uncultivated organisms. Instead, we were unable to reliably identify their very existence (figure 4) because clustering of taxa was affected by minor changes in choice of samples and the choice of clustering algorithm (electronic supplementary material, figure S3). We therefore conclude that the existence and composition of the oral complexes needs independent verification by modern techniques and new samples.
The data presented here provide an unprecedented comparative overview of the relative proportions of the predominant taxa in public available metagenomes from the modern and historical oral microbiome. Figure 2 identifies 15 taxa, which are particularly common in historical calculus, 14 others that are preferentially found in modern dental plaque and 11 that seem to be specific for saliva. These associations with a particular source in the oral cavity might be used to identify currently undefined ecological complexes of oral taxa that share a common niche. However, species-level OTUs are likely to be conglomerates of multiple microbial populations, each of which may inhabit a somewhat different ecology. For some organisms such as Salmonella or Escherichia, efforts are currently underway to develop hierarchical clustering of such populations in order to categorize their ecological and pathogenic differentiation [5]. A step in this direction for the oral microbiome is the recognition of ANI95% clusters s8897 and s126097, both of which were preferentially found in saliva. A large study of all streptococci [59] identified multiple other ANI95% clusters within S. mitis but their preferential location in the oral cavity has not yet been addressed. Indeed, little is yet known about the sub-species population structure of almost all of the taxa identified here.
Our more detailed investigation of S. mutans and S. sobrinus may represent a forerunner of future studies on sub-species ecological differences within the oral microbiome. S. mutans and S. sobrinus are commonly associated with dental caries and may play a causal role in that disease [63]. However, once again these taxa were more common in saliva than in dental plaque (figure 6). We chose S. mutans and S. sobrinus for more detailed analysis because sufficient reads were found in multiple metagenomes from modern samples to allow the partial reconstruction of multiple genome sequences (pseudo-MAGs). In addition, multiple draft genomes from cultivated bacteria existed in the public domain, which were available for genomic comparisons. We were also intrigued by the claim that S. mutans was rare in historical plaque [18]. Our data support that claim, and we found only a few historical samples of dental calculus that contained any reads of S. mutans, and none with S. sobrinus. Our data also support prior conclusions of a lack of phylogeographic differentiation within S. mutans [65]. However, although the data are still somewhat limited, S. sobrinus from China tend to cluster distinctly from genomes from Brazil (figure 7). Distinct clustering might reflect phylogeographical signals but other causes of clustering cannot currently be excluded because the Chinese genomes were pseudo-MAGs reconstructed from metagenomes from dental plaque and saliva while the Brazil genomes were from bacteria cultivated from dental plaque. Additional genomes of S. sobrinus from other geographical areas would be needed to determine whether the apparent phylogeographical trends are robust. Such analyses could also be facilitated by creating an EnteroBase for Streptococcus, which could be done relatively easily [59] if there were interested curators and sufficient interest in the Streptococcus community.
In summary, we illustrate the use of a variety of reliable, high-throughput tools for determining microbial diversity within metagenomic data, and for extracting microbial genomes from metagenomes. We illustrate these tools with metagenomes from both modern and historical samples, royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 375: 20190573 and release all the data and methods for further use by others.

Methods (a) SPARSE database update
In its original incarnation in August 2017 [32], SPARSE used MASH [68] to assign 101 680 genomes from the NCBI RefSeq database to 28 732 ANI99% clusters of genomes. By May 2018, 21 540 additional genomes had been added to NCBI RefSeq. These were merged into the existing database in the same manner as previously, by merging that each genome into an existing ANI99% cluster or by creating a new cluster containing one genome if the ANI to all existing clusters was less than 99%. An ANI99% representative microbial database was generated which contained one representative genome for each of the 32 378 ANI99% clusters containing bacteria, archaea or viruses plus a human reference genome (Genome Reference Consortium Human Build 38) such that reads from human DNA could also be called. All the representative genomes were assigned to a superset of 20 054 ANI95% clusters, and this was used for species assignments and genomic extractions as described [32].

(b) SPARSE analyses
'EToKi prepare' was used to collapse paired-end reads and trim all sequence reads. Subsequent SPARSE analyses were performed on all the metagenomes in table 1 and additional metagenomes in electronic supplementary material, figure S7, as described in the SPARSE manual (https://sparse.readthedocs.io/en/latest/). The first step was 'SPARSE predict', which identifies ANI95% groups containing greater than or equal to 10 specific reads. Subsequently, 'SPARSE report -low 0.0001' was used to assign taxon designations to the ANI95% groups and produce a table of all metagenome results (electronic supplementary material, table S3) which lists distinct taxa for each metagenome that accounted for ≥0.0001% of all its reads. Electronic supplementary material, table S3, also includes the designations of oral complexes and other known pathogens according to a manually curated dictionary. Sequence reads were extracted from the metagenomes for assembling pseudo-MAGs with 'SPARSE extract'.
For electronic supplementary material, figures S5-S8, the taxonomic assignments were inversely weighted by genome size in order to render them comparable to DNA-DNA Checkerboard data and output from Metaphlan2, which calculate cell counts. To this end, the number of metagenomic reads assigned to each species within a metagenome was divided by the genome size of the SPARSE reference genome for that species. These data were then expressed as a proportion of the summed data for all microbial species within that metagenome.

(c) Metagenomes lacking reads from the oral microbiome
We tested all metagenomes to identify any that might be grossly contaminated by collating the 50 most abundant microbial species over all metagenomes (electronic supplementary material, table S4A). The percentage of reads in these 50 taxa was summed for each metagenome and expressed as a percentage of all microbial reads. Seven metagenomes (ancient dental calculus: 5; modern saliva: 2; electronic supplementary material, table S2) were excluded because the percentages of those top oral microbes constituted less than 15% of their total microbial reads.

(d) Dimensional reduction of frequencies of reads
Two forms of dimensional reduction of diversity were used to detect source-specific clustering within the SPARSE results.
UMAP analysis was performed with its Python implementation [56], using the parameters min_neighbours = 5 and min_dist = 0.0. PCA was performed using the decomposition.PCA module of the scikit-learn Python library [69]. Optimal k-mean clusters of the first three components from the UMAP analysis were calculated with the sklearn.cluster module of the scikit-learn Python library.
(e) Ranking of microbial species by their associations with source Microbial species were ranked by their weighting according to an SVM classification [58]. A supervised SVM classification of samples was performed using the SVM module of the scikit-learn Python library on the raw SPARSE results (electronic supplementary material,  4b,d). The cleaned, species-specific reads generated from these samples as in Methods (b) were processed with the stand-alone version of EToKi as described in electronic supplementary material, fig. S6 of Zhou et al. 2020 [5] and in greater detail in the online manual (https://github.com/zheminzhou/ EToKi). EToKi assemble was then used to identify genome-specific reads after specifying a reference genome, an in-group of related genomes, and a related but distinct out-group of other genomes. For S. mutans, the reference genome was UA159 (accession code GCF_000007465), the in-group was 194 other S. mutans genomes in RefSeq (electronic supplementary material, of Streptococcus downei from RefSeq spanned 1.16 MB and contained 160 863 core SNPs. These alignments were subjected to maximum-likelihood phylogeny reconstruction by EToKi phylo. Both ML trees were then visualized with GrapeTree [38]. (g) DNA damage patterns for ancient Streptococcus mutans reads SPARSE assigned low numbers of sequence reads to S. mutans in 10 metagenomes from ancient dental calculus ( figure 6; electronic supplementary material, table S3). In order to assess their authenticity, these reads were assessed with MapDamage2 [64] for patterns of cytosine deamination that are characteristic of authentic ancient DNA. To this end, all S. mutans-specific reads were extracted with SPARSE. They were aligned to the S. mutans reference genome UA159 with Minimap2 [36] and reads which were at least 95% identical with the reference genome were used to create BAM alignments. SouthAfr2 contained 11 specific reads according to SPARSE, but only eight survived this filtering step. SouthAfr2 was therefore excluded from further analyses because these were too few reads to provide reliable analyses. The BAM alignments from the remaining nine metagenomes consist of both fully aligned reads (46-72%) and others which were 'softclipped', i.e. terminal bases were not aligned to the reference genome. In order to ensure that these soft-clipped reads were also specific, we compared the alignment scores for all reads against UA159 with the alignment scores against the 62 outgroup genomes in Mutans streptococci (electronic supplementary material,  figure S4A-D).
Data accessibility. The pseudo-MAGs reconstructed from metagenomes for S. mutans and S. sobrinus are freely accessible in tar.gz files containing Datasets_S1 and Dataset_S2 at https://github.com/zheminzhou/ OralMicrobiome, respectively. Python scripts that were used to prepare data for figures 1-5 and S1-S3 are available as Dataset_S3 in the same repository. The taxonomic profiling by SPARSE of all 784 metagenomes is available in electronic supplementary material, table S3. Interactive versions of figure 7 are available at http://enterobase.warwick.ac.uk/a/42277 (figure 7a) and http://enterobase.warwick.ac. uk/a/42279 (figure 7b).