A global metabarcoding analysis expands molecular diversity of Platyhelminthes and reveals novel early-branching clades

Understanding biological diversity is crucial for ecological and evolutionary studies. Even though a great part of animal diversity has already been documented, both morphological surveys and metabarcoding analyses have previously shown that some animal groups, such as Platyhelminthes, may harbour hidden diversity. To better understand the molecular diversity of Platyhelminthes, one of the most diverse and biomedically important animal phyla, we here combined data from six marine and two freshwater metabarcoding expeditions that cover a broad variety of aquatic habitats and analysed the data by phylogenetic placement. Our results show that a great part of the hidden diversity is located in early-branching clades such as Catenulida and Macrostomorpha, as well as in late-diverging clades such as Proseriata and Rhabdocoela. We also report the first freshwater record of Gnosonesimida, a group previously thought to be exclusively marine. Finally, we identified two putative novel freshwater Platyhelminthes clades that branch between well-defined orders of the phylum. Thus, our analyses of several environmental datasets confirm that a large part of the diversity of Platyhelminthes remains undiscovered, point to groups with more potential novel species and identify freshwater environments as potential reservoirs for novel species of flatworms.


Introduction
To understand past and present biological processes and to make meaningful decisions for the future, it is of pivotal importance to decipher extant biodiversity [1]. Accurate biodiversity assessment is difficult because of sampling biases and the limitations of morphology-based taxonomic identification [2]. Sampling biases include restrictions in sampling site accessibility and preferential collection of specimens due to methodological constraints, both of which lead to a non-representative sample of the community under study. Traditional identification methods based on morphology are low throughput, time and resource consuming, require high taxonomic expertise that is particularly rare for most of the groups that are not well studied and fail to cope with cryptic diversity. As a consequence, it is estimated that real extant species diversity probably exceeds by 10-fold the current number of described species [3]. Although the unicellular eukaryotic lineages suffer from this bias more than their multicellular counterparts [4,5], there are groups of animals for which an assessment of diversity is incomplete [6,7].
Platyhelminthes (flatworms) is one of the most diverse [8] and relatively wellstudied animal phyla. Initially considered to be an early-branching bilaterian clade because of their simple morphology, they were studied to understand the origin of bilaterian symmetry. However, recent molecular phylogenies nested them inside the superphylum Lophotrochozoa (Spiralia) [9][10][11] and recognized them as secondarily simplified. In addition, species of flatworms are considered as model organisms to study whole-body regeneration [12,13] and the evolution of development [14]. Flatworms are also biomedically relevant, being 75% of the described species of obligate parasites of vertebrates [15,16]. Ecologically, they are key meiofaunal taxa in both marine and freshwater aquatic environments [17,18]. Nevertheless, flatworms have rarely been taken into consideration in traditional biodiversity studies, given that their morphological identification is tedious, requiring fixation and histological processing [19] or live examination when fixation can destroy their taxonomically informative internal reproductive anatomy [18]. Current estimates for the species richness of this phylum suggest that there is quite a lot of hidden diversity yet to be identified [7].
Metabarcoding emerged as a promising solution to unravelling hidden diversity and has been successfully applied in different groups of organisms and habitats [20][21][22][23][24]. However, to our knowledge, a thorough analysis of metabarcoding data on Platyhelminthes has never been done. To fill this gap, we here analysed different environmental datasets from six marine and two freshwater habitats to both explore the diversity of Platyhelminthes at the level of orders and detect potential novel, undescribed molecular diversity.

Material and methods
We collected a total of 1380 representative sequences of clustered operational taxonomic units (OTUs) identified as Platyhelminthes in six marine and two freshwater environmental surveys. Each survey targeted a different hypervariable region of the 18S rRNA gene (table 1, [25][26][27][28][29][30][31]). We assigned sequences to either groups or taxonomic categories based on BLAST 2.6.0 [32] searches of the SILVA 128 SSU reference database [33,34].
We constructed a reference tree of complete 18S rDNA sequences retrieved from the National Center for Biotechnology Information (NCBI) nucleotide database to use as a backbone for phylogenetic placement. We aimed to include all the known extant diversity of Platyhelminthes focusing on the freeliving representatives. To this end, we conducted a bibliographic search [35][36][37] to select complete sequences of all the known families for each order of the phylum. We restricted our taxon sampling to 455 taxa so that the resulting reference tree could easily be visually inspected, while encompassing the diversity of the extant Platyhelminthes according to the latest complete phylogeny of the phylum [16,38].
We filtered our initial metabarcoding dataset both by alignment (using PaPaRa v. 2.5 [39] to align the query sequences to the reference alignments) and by phylogenetic placement (using the EPA [40] as implemented in RAxML [41]). All trimming was done with trimAL [42]. We removed all sequences that (i) did not align in the correct hypervariable region, (ii) were placed in the outgroup of the reference tree, (iii) had extremely long branches in the best-hit placement tree making the rate of nucleotide substitutions greater than 1, and (iv) had a placement hit both in Platyhelminthes branches but also in the outgroup branches. The filtered dataset (843 OTUs) was placed onto the reference tree using two different phylogenetic placement algorithms, RAxML-EPA [40] and pplacer [43]. We compared the resulting jplace files using compare_jplace_files.cpp as implemented in genesis tool (http://genesis-lib.org/) to confirm that the placement with the highest likelihood-weight ratio (top placement) of both Pqueries was located on the same branch.
We constructed maximum-likelihood trees for all the queries with top placement outside the known Platyhelminthes orders: (i) a tree in which we combined full-length reference sequences with short queries and (ii) a tree in which we manually trimmed the reference alignment to the length of the short queries. The trees were built (i) in RAxML [41] under the GTR + GAMMA substitution model with 1000 rapid bootstrap replicates and (ii) in IQTREE [44] under the TN + F+R8 substitution model with 1000 ultrafast bootstraps and tested tree branches by SH-like aLRT with 1000 replicates. All trees were visualized in iTOL [45].

Results and discussion
We used 18S rDNA metabarcoding data from aquatic environments to expand our understanding of the molecular diversity of Platyhelminthes, search for novel lineages and To this end, we compiled the most comprehensive flatworm metabarcoding 18S rDNA dataset to date, comprising 1380 query sequences from six marine and two freshwater environmental surveys (table 1). We first checked how many of those sequences corresponded to taxa already described and sequenced and how many represented novel taxa. We performed BLASTn searches against the SILVA 128 SSU database to confirm the identity of these potential Platyhelminthes. Among the 1245 query sequences that returned a flatworm sequence as the first hit, 60% had less than 97% sequence identity with the reference sequences (figure 1a). All groups, both parasitic and free living, showed high percentages of sequences with low BLAST identity (less than 97% sequence identity), Proseriata, Prolecithophora and Trematoda being the groups with the highest percentages. For example, 95% of the 248 Proseriata sequences had less than 97% sequence identity with reference sequences.
The quality of the reference database is of pivotal importance to evaluate the number of novel species inside a clade. In biodiversity assessments based on sequence similarity methods, a good reference database includes as many sequences as possible. By contrast, to evaluate the number of novel species using a phylogeny-driven approach, the number of taxa in a good reference tree should be small enough to allow the visual inspection of the results and broad enough to encompass all existing diversity. To this end, we inferred our Platyhelminthes 18S rDNA reference tree based on a broad taxon sampling of 455 complete 18S rDNA sequences including representatives from all major flatworm clades. Although our reference tree did not recover the same topology of orders as in multigene phylogenetic analyses [16,38], most orders were monophyletic. This is Platyhelminthes (1245) Catenulida (137) Macrostomorpha (244) Polycladida (52) Prorhynchida (6) Gnosonesimida (1) Rhabdocoela (373) Proseriata (248) Fecampiida (2) Prolecithophora (47) Tricladida (23) Bothrioplanida (0) Cestoda (18) Monogenea (19) Trematoda (75) 20% 10% 30% 40% 50% 60% 70% 80% 90% 100% 0% important for the subsequent placement analysis as all the queries that fall into the known delimited orders expand the diversity inside these orders and all the queries that fall between well-defined orders represent completely novel molecular diversity. Given that each study targeted different hypervariable regions of the 18S rRNA gene, our initial dataset was a mixture of V4, V7, V8-V9 and V9 18S rDNA queries (table 1). Queries from a hypervariable region must map to a full-length 18S rDNA with minimal ambiguity to serve as a reliable phylogenetic marker. However, in many cases, queries do not align unambiguously because of the fast-evolving nucleotide sites resulting in unreliable trees. To overcome this pitfall, we refined the unfiltered dataset of 1245 query sequences by alignment. More than one-quarter of the initial sequences were removed because of misalignment (table 1). The majority of V7 queries were removed during this filtering step, indicating that the V7 hypervariable region was too short and variable to be useful as a molecular marker for Platyhelminthes. By contrast, all V4 and V9 queries were retained, showing that these variable regions can serve as quality molecular markers for Platyhelminthes.
Our phylogenetic placement analyses showed that the majority of OTUs grouped with free-living taxa (figure 1b). We detected phylogenetic placements in the internal nodes of early-branching clades such as Catenulida, Macrostomorpha, Prorhynchida and Polycladida that potentially indicate novel groups yet to be described. Many OTUs grouped within Polycladida, a well-described clade with more than 800 described species, of which only 30 have representative sequences. Thus, these placements probably reflect a lack of molecular data in the reference database rather than real novel diversity. Many other OTUs were placed within Proseriata and Rhabdocoela, probably representing novel diversity, given that these two clades are well sampled for the 18S rRNA gene.
We then inquired whether we could detect marine OTUs in groups considered freshwater and vice versa (figure 1b,c). As expected, marine groups were only detected in marine samples and freshwater groups in freshwater samples, except for one clade, Gnosonesimida. We recovered the first freshwater record of Gnosonesimida, a group formed by only six described species thought to be exclusively marine [16,18]. Polycladida and Prolecithophora have both freshwater and marine representatives, but we could only detect them in our marine datasets. Even though Neodermata is formed by obligate parasites, we recovered OTUs in the marine water column. Those OTUs were placed inside the three orders of Neodermata, suggesting that free-living stages may have been sampled.
We then analysed potential novel clades within Platyhelminthes. In our best-hit placement tree (figure 1c), we localized OTUs that were placed outside the limits of known flatworm orders (figure 2a) with high likelihood-weight scores and characterized them as 'interesting placements'. We inferred a maximum-likelihood tree from the alignment of the full-length reference sequences and those OTUs with phylogenetically interesting placements (figure 2b). The phylogenetic tree revealed two novel freshwater clades, clade 1 and clade 2, branching as monophyletic clades in between major Platyhelminthes orders. Clade 1 was formed by three sequences that had 92% sequence identity with sequences of the genus Castrada (Typhloplanidae) and clade 2 by 19 sequences with sequence identity to Otomesostoma auditivum and Invenusta aestus (Coelogynoporidae) that ranged between 91% and 94%. We also inferred a maximum-likelihood tree using only the V4 hypervariable region of the reference alignment and the short queries; this tree was not informative because of the weak phylogenetic signal. Even though the exact phylogenetic position of the new clades within Platyhelminthes remains unclear, they certainly form two separate, well-defined groups, probably in early-branching positions.
Phylogenetic placement outperforms the conceptually problematic but often used practice of reconstructing de novo phylogenies from short reads that do not contain sufficient phylogenetic signal to reproduce a reasonable tree. It is a reliable method to classify short DNA sequences, a common output of metabarcoding and metagenomic studies, and has been extensively used for taxonomic assignment in diversity studies. Overall, our analyses show a high diversity of Platyhelminthes in both marine and freshwater environments, with the latter habitat likely containing as yet unnamed taxa. We found that Proseriata and Rhabdocoela are the two flatworm groups with more potential novel species. Our data also show a high novelty of molecular data in Catenulida and Macrostomorpha that may correspond either to unsequenced data or to new taxa. Moreover, we identified, in freshwater environments, two novel clades that group outside the well-known Platyhelminthes orders. While our data demonstrate the utility of metabarcoding analyses in search of novel diversity, we emphasize the need for more traditional taxonomic efforts to have a good understanding of animal diversity.
Data accessibility. Alignments, trees and jplace files are available as electronic supplementary material.
Authors' contributions. K.M. collected the data, conducted the analyses, interpreted the data, designed the figures and wrote the draft and the revised manuscript. A.S.A. collected the data and revised the manuscript. I.R.-T. conceived and designed the study, supervised the work and revised the manuscript. All authors approved the final version of the manuscript and agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.