Aligning evidence: concerns regarding multiple sequence alignments in estimating the phylogeny of the Nudibranchia suborder Doridina

Molecular estimates of phylogenetic relationships rely heavily on multiple sequence alignment construction. There has been little consensus, however, on how to properly address issues pertaining to the alignment of variable regions. Here, we construct alignments from four commonly sequenced molecular markers (16S, 18S, 28S and cytochrome c oxidase subunit I) for the Nudibranchia using three different methodologies: (i) strict mathematical algorithm; (ii) exclusion of variable or divergent regions and (iii) manually curated, and examine how different alignment construction methods can affect phylogenetic signal and phylogenetic estimates for the suborder Doridina. Phylogenetic informativeness (PI) profiles suggest that the molecular markers tested lack the power to resolve relationships at the base of the Doridina, while being more robust at family-level classifications. This supports the lack of consistent resolution between the 19 families within the Doridina across all three alignments. Most of the 19 families were recovered as monophyletic, and instances of non-monophyletic families were consistently recovered between analyses. We conclude that the alignment of variable regions has some effect on phylogenetic estimates of the Doridina, but these effects can vary depending on the size and scope of the phylogenetic query and PI of molecular markers.

JMH, 0000-0003-4147-4037 Molecular estimates of phylogenetic relationships rely heavily on multiple sequence alignment construction. There has been little consensus, however, on how to properly address issues pertaining to the alignment of variable regions. Here, we construct alignments from four commonly sequenced molecular markers (16S, 18S, 28S and cytochrome c oxidase subunit I) for the Nudibranchia using three different methodologies: (i) strict mathematical algorithm; (ii) exclusion of variable or divergent regions and (iii) manually curated, and examine how different alignment construction methods can affect phylogenetic signal and phylogenetic estimates for the suborder Doridina. Phylogenetic informativeness (PI) profiles suggest that the molecular markers tested lack the power to resolve relationships at the base of the Doridina, while being more robust at family-level classifications. This supports the lack of consistent resolution between the 19 families within the Doridina across all three alignments. Most of the 19 families were recovered as monophyletic, and instances of non-monophyletic families were consistently recovered between analyses. We conclude that the alignment of variable regions has some effect on phylogenetic estimates 2017 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.

Introduction
The debate regarding multiple sequence alignment (MSA) construction [1][2][3][4][5][6][7][8] and molecular marker selection [9] for use in phylogenetic estimates has been well established in the literature. There is little consensus, however, on the most accurate and replicable approach when considering MSA assemblies [10]. The crux of the matter is how best to align and represent homology in highly variable regions of sequence data. Most approaches employ a combination of a mathematical algorithm and manual curation of variable regions. Unfortunately, in most instances manual curation is loosely definedwhich makes subsequent studies difficult to replicate-and even though mathematical algorithms allow for studies to be easily replicated, they do not take into account evolutionary history and consider homology exclusively as similarity [3]. In addition, they often fail to accurately align variable regions [11].
The estimation of evolutionary relationships relies heavily on the ability to determine homology between any given set of characters. The wrongful designation of characters as homologous can confound interpretations of evolutionary relationships by creating homoplasies, which results in the loss of phylogenetic signal [12]. Thus, the determination of homology in either morphological character matrices or MSAs (computational or manually curated) has a direct impact on the accuracy and ability to replicate phylogenetic estimates. This has led to differing positions concerning variable regions and their importance in increasing [13,14] or decreasing [15][16][17] phylogenetic signal. These highly variable loop regions become increasingly more difficult to align as sequences become more divergent. This confounds the ability to determine accuracy in MSAs and may result in contradictory estimates of deeper evolutionary relationships. It has been proposed that the large sections created by highly variable loop regions in MSAs should be excluded because of the uncertainty in determining reliable estimates of positional homology [18]. Conversely, it has also been suggested that effects from MSAs should be explored in determining phylogenetic estimates due to their likelihood of containing phylogenetic signal [19,20].
The Doridina suborder currently consists of 19 families and more than 2000 described species. The classification has been divided into five superfamilies (Bathydoridoidea, Doridoidea, Onchidoridoidea, Phyllidioidea and Polyceroidea) defined by morphological variation in gill and feeding structures [21,36,58,59]. Species that possess the ability to retract their gills into fully formed gill pockets, Eudoridoidea (=Cryptobranchia), are divided into two clades based on the presence or absence of radula, Labiostomata (=Doridoidea) and Porostomata (=Phyllidiodea), respectively [36]. Dorid nudibranchs that lack a fully formed gill pocket represent the Anthobranchia (=Phanerobranchia), which is subdivided into the Suctoria (=Onchidoridoidea) and the Non-Suctoria (=Polyceroidea) based on the presence or absence of a buccal pump. Even though morphological [36,59] and molecular [22,23,60,61] analyses have shown the Doridina to be monophyletic, no study has focused on a complete sampling of all currently recognized families. Thus, relationships between families are poorly understood due to conflicting phylogenies and inadequate sampling.
In this study, we use three common MSA methodologies: (i) strict mathematical algorithm; (ii) exclusion of variable or divergent regions and (iii) manually curated. We then discuss issues pertaining to the lack of resolution in phylogenetic estimates and limitations in resolving the base of the Doridina. We also address the possible effects alignment construction may have on phylogenetic informativeness (PI), and how phylogenetic signal may shift given alignment methodology. Lastly, we re-evaluate taxonomic classifications that were consistent between all three analyses.

Taxon sampling
We sampled (121 taxa) representatives from the five superfamilies and all 19 families that currently comprise the Doridina, and mined GenBank for specimens that had at least two sequences. To limit the amount of missing data, we obtained extractions or tissue samples from published specimens to sequence molecular markers that were previously unattained. Pleurobranchaea meckeli and Berthella martensi were chosen as outgroups because the Pleurobranchomorpha has been suggested to be sister to the Nudibranchia [23]. Additional members from the Aeolidina and Dendronotina were also used in testing the monophyly of the Doridina. Specimens, voucher numbers, collection sites and GenBank accession numbers are listed in table 1. Voucher specimens are located in the collections at California Academy of Sciences (CASIZ) and the Museum of National Scientific Center of Marine Biology, Vladivostok (MIMB).

Extraction, PCR protocols and sequencing
Genomic DNA was extracted using the Qiagen DNeasy Blood & Tissue extraction kit (Valencia, CA). Amplification and sequencing for targeted gene fragments 16S, 18S, 28S and cytochrome c oxidase subunit I (COI) followed protocols used by Hallas & Gosliner [62]. Amplified fragments were sequenced at the Center for Comparative Genomics located at the California Academy of Sciences and National Scientific Center of Marine Biology.

Multiple sequence alignments
We assembled and edited sequences in Geneious Pro v. 9.1.7 [63] and BioEdit [64]. We aligned rDNA fragments (16S, 18S and 28S) using three different methodologies to examine conflicts regarding estimated phylogenies. For our first method, we used a computer algorithm E-INS-i in MAFFT (MA) [65]. The E-INS-i algorithm is designed to handle sequence data with several conserved regions embedded in long unalignable gapped regions. Our second method excluded all variable regions from the initial MAFFT alignment of rDNA using Gblocks (GA) [11]. We determined blocks using a less stringent selection by allowing for smaller final blocks, gap positions within the final blocks and less strict flank positions. For our third method, we used the initial MAFFT alignment from method one, but manually curated variable regions (CA). This was done by hand to correct for possible inappropriate alignment of sequence regions. For each alignment method, we concatenated all four targeted molecular markers into single MSAs that resulted in three separate concatenated datasets.

Phylogenetic informativeness
We estimated PI for each molecular marker per alignment [9] through the PhyDesign web interface [66], and estimated each PI profile using the site-rates model HyPhy [67]. First, we generated ultrametric trees by converting our concatenated Bayesian inference (BI) phylogenetic estimates in Mesquite [68]. Owing to the lack of a fossil record for the Doridina, our ultrametric trees are not in known time units. However, it has been shown that PI profiles can be used effectively even if divergence time estimates are absent [69]. Then, we analysed alignments in relation to each of their resulting estimated phylogeny to gain a greater understanding of how PI can change based on the alignment approach.

Phylogenetic analyses
We determined evolutionary models using PartitionFinder v.1.1.1 [70], and partitioned our concatenated datasets by rDNA fragment as well as codon position for COI. We also analysed our nDNA and mtDNA fragments separately to investigate possible conflicting evolutionary histories. We did this by aligning both nDNA and mtDNA datasets using the algorithm E-INS-i in MAFFT, and used the same partitioning scheme as in our concatenated datasets.
We analysed all our datasets using BI and maximum-likelihood (ML). BI searchers were run using MrBayes v. 3.2.1 [71], and convergence was checked in TRACER v.1.5 [72]. The datasets were run for 5 × 10 7 generations with Markov chains sampled every 1000 generations. The standard 25% burn-in was calculated and remaining tree estimates were used to create a 50% majority rule consensus tree

Molecular data
Sequences obtained for phylogenetic analyses and PI profiles are labelled in table 1, and all alignments have been deposited in TreeBASE (http://purl.org/phylo/treebase/phylows/study/TB2:S20396). As expected, our concatenated alignments (MA, GA and CA) varied in length as well as number of parsimony informative characters (table 2). Surprisingly, there were more parsimony informative characters in the MA than the CA. The GA had far fewer parsimony informative characters than the other two alignments generated. Evolutionary model GTR+I+I− was selected by PartitionFinder v. 1.1.1 for each partition based on the Akaike information criterion. Interestingly, specific species of Dendrodoris (D. nigra, D. arborescens, D. guttata and D. fumata) had mtDNA regions that were highly divergent.

Phylogenetic informativeness
The net PI profiles depict the amount of signal through time of each molecular marker in relation to the respective estimated phylogeny (figure 1a-c). Regardless of the alignment, net PI profiles show most phylogenetic signal resides towards the tips of the trees. The amount of overall phylogenetic signal was lowest in our GA estimate, which suggests that information was lost when variable regions were excluded. Lastly, our MA appeared to have slightly more PI than our CA.

Phylogeny
BI posterior probability estimates generally resulted in higher overall support of relationships than ML non-parametric bootstrap values. Branch lengths for basal nodes were extremely short ( figure 2a-c), especially in regions of the tree that resulted in low resolution for relationships between superfamilies (figure 3a-c). There were, however, some general similarities between topologies of both phylogenetic searches and among alignments. The Doridina suborder was recovered as monophyletic with high support in all analyses estimates (CA: pp = 1.00, bs = 91; GA: pp = 1.00, bs = 86; MA: pp = 1.00, bs = 86; figures 4-6). GA was the only analysis that recovered the Aegiridae sister to all other members of the Doridina, but there was no support. Both MA and CA recovered Prodoris sister to the rest of the Doridina.
Onchidoridoidea and Phyllidioidea were the only two superfamilies recovered as monophyletic in any of the three analyses ( figure 3a-c). Onchidoridoidea was recovered as monophyletic only in the CA (pp = 0.99, bs = 46). Phyllidioidea was monophyletic with little to no support in CA (pp = 0.94; bs = 47) and MA (pp = 0.75, bs = 34) analyses. Most families, however, were recovered as monophyletic with high pp and bs support across all three alignments analysed. The Phyllidiidae and Calycidorididae were the only two families that were not supported in all three analyses. In addition, Chromodorididae, Dendrodorididae, Dorididae and Polyceridae were consistently recovered as not monophyletic.  There were only two instances where long branches were evident in the Doridina; Vayssierea sp. and a clade formed by Dendrodoris arborescens, D. fumata, D. guttata and D. nigra. Analysis of nDNA (electronic supplementary material, figure S1) and mtDNA data (electronic supplementary material, figure S2) suggests that the long branches may be an artefact of mtDNA data. The nDNA tree depicts no significantly long branches within the Doridina and recovers a monophyletic Dendrodoris clade, which contradicts our mtDNA tree that recovered a polyphyletic Dendrodoris.
The large spikes observed in rDNA fragments for our MA and CA are probably a result of highly variable regions or ambiguous sequence calls, which ML is poor at estimating [66], thus overestimating PI towards the tips of their respective tree. Furthermore, PI profiles suggest that information was lost when these regions were excluded from our analyses, and resulted in surprisingly high loss of parsimony informative characters (25%-55%) for rDNA markers. The removal of these variable regions has been shown to negatively affect phylogenetic estimates [9] and may explain why some relationships were not consistently recovered. This supports the position that highly variable loop regions can be vital in resolving some phylogenetic relationships [8]. Unfortunately, we were unable to increase the resolution of the dorid tree by any of our three MSA construction methods. All three PI profiles were fairly consistent in depicting similar curves, and as we have been able to show, these markers are more appropriate for phylogenetic estimates at family-or higher-level classifications. Another issue we encountered was the potential noise that was incorporated into our estimates by additional taxon sampling. Even though PI profiles suggest these markers were informative at family level, our increased taxonomic sampling may have hindered our ability to recover consistency across analyses. For example, we only recovered a monophyletic Onchidoridoidea in our CA phylogenetic        [62] recovered a mostly resolved monophyletic Onchidoridoidea with significant pp and bs support. Their taxonomic sampling, however, was much more focused and included histone 3 as an additional molecular marker. These contradictory estimates illustrate issues that can result from inappropriate taxonomic sampling [93] and noise incorporated into analyses with inclusion of highly divergent taxa.
In a few instances, however, our expansive taxonomic sampling has illuminated the relationships of some problematic groups, specifically Aphelodoris, Cadlinidae, Cadlinella, Hexabranchidae and Polyceridae, but in relation to the Onchidoridoidea our estimates contradicted previous highly supported hypotheses. Even though we were able to include Onchimira cavifera, it is relevant to state that we were unable to procure other morphologically unique species that may have affected our ability to resolve some family relationships in the Doridina (e.g. Colga, Goslineria, Hoplodoris, Kalinga, Murphydoris, Otinodoris). These findings illustrate that each phylogenetic query has its own set of challenges and optimal sampling strategy [94], and that the focus for each investigation should be carefully calculated.

Doridina relationships
We were not able to confidently investigate patterns of biogeographical, morphological or chemical evolution due to the lack of resolution at the base of our phylogenetic estimates. The present study, however, offers some consistent new insights into Doridina relationships, in part due to our increased taxonomic sampling.
This work reinforces the conclusion from previous studies that traditional phanerobranch and cryptobranch groupings are not monophyletic [36,59,62]. Even though there was only moderate support in our GA, Cadlinidae does appear to be closely related to at least some members of the Porostomata, despite the ambiguous position of some porostomes such as Dendrodoris and Mandelia. In addition, we consistently recovered Gymnodorididae and Okadaiidae nested within the Polyceridae, which together are closely related to the Chromodorididae and Hexabranchidae. Unexpectedly, Cadlinella was recovered sister to the Hexabranchidae in our phylogenetic estimates. Cadlinella was originally included into the Chromodorididae based on morphological similarities [95] and further supported by molecular studies [84,93]. Our broader taxon sampling, however, consistently recovered Cadlinella sister to Hexabranchus and that both of these taxa, together with the Polyceridae, are closely related to the Chromodorididae. This also is supported by the sperm ultrastructure of Cadlinella, which has been shown to be divergent from members of the Chromodorididae [96].
In addition, the yellowish northeastern Pacific species of Doriopsilla were thought to represent a species complex of closely related taxa, but this assumption was not tested by including any species from outside the complex, other than the outgroup taxon D. spauldingi [97]. In our analysis, we included D. miniata from Japan and D. janaina, another eastern Pacific species that has divergent colouration. In all three of our analyses, the 'species complex' suggested by Hoover et al. [97] includes members of two separate lineages, rather than a single radiation. This suggests that these species with yellowish colouration and white spots evolved similar colouration convergently rather than by means of radiation from a single common ancestor.
Lastly, the evolution of the gill pocket is further confounded by the recovery of Onchimira cavifera nested within the Onchidorididae. Onchimira cavifera was described as having both cryptobranch and phanerobranch characteristics, and hypothesized as a missing link in the current understanding of gill reduction [37,62,98,99]. Onchimira possess all the characteristics of a phanerobranch: buccal pump, rectangular rachidian tooth and hooked shaped first lateral tooth, but also possesses a fully formed gill pocket and retractable gill, which is typical of cryptobranch dorids. Surprisingly, Onchimira is not closely related to the only other two members of the Onchidorididae that possess similar gill structure to the Cryptobranchia, Calycidoris and Diaphorodoris [62], but instead nested within the Onchidorididae. Based on our estimates, it is unclear how or under what conditions the gill pocket might have evolved or was lost throughout the Doridina because of the lack of resolution at the base of the tree.

Molecular evolution
It is unclear why there are such large inconsistencies between mtDNA and nDNA phylogenies regarding Dendrodoris. To confirm if there was sequencing error, we examined additional specimens of D. fumata and D. nigra to compare to those on GenBank [61,100]. Surprisingly, all sequences collected were identical. Our inclusion of D. atromaculata and D. denisoni, however, suggested that there are possible highly divergent regions among mtDNA sequences. We were unable to compare other species of Dendrodoris from Hirose et al. [100] because they only analysed COI. A complete sampling of Dendrodoris is needed to fully comprehend the discrepancies between mtDNA and nDNA sequences. Furthermore, there appears to be no molecular distinction between Aegires citrinus and A. serenae. Both species are clearly defined by morphological characteristics [101], but both the mtDNA and the nDNA suggest they are in fact the same species. Much like in the Dendrodorididae, it is unclear what molecular mechanisms might have influenced our observations. Further investigations are needed, but are beyond the scope of our data.

Conclusion
We decided to take an approach that used three common methods used in MSA construction for Nudibranchia phylogenetics. As expected, our findings suggested that MSA methodology affected phylogenetic estimates of the Doridina, especially regarding how we decided to align highly variable rDNA regions. We were able to show that the most commonly sequenced molecular markers for the Nudibranchia lacked the robustness to resolve the base of the dorid tree, and manipulation of highly variable regions affected our ability to recover consistent phylogenetic estimates. This effect, however, is most probably dependent upon the size and scope of the phylogenetic query and amount of missing data. These markers are better suited for higher-level classifications as suggested by our PI profiles. Even though the base of the Doridina was unresolved, family-level classifications were mostly supported across our three analyses, and families that were recovered as non-monophyletic were consistent between alignments. Our analyses suggest that the exclusion of variable regions may have weakened our ability to resolve the base of the Doridina, but previous studies that used much larger datasets have benefited from removing these regions (e.g. [102,103]).
Even though the focus of the present study was to understand MSA construction, our estimates of the Doridina also give a frame of reference for allowing more intensive queries into specific family evolutions. For example, the evolution of caryophyllidia in the Discodorididae, molecular evolution in Aegiridae and Dendrodorididae, or the relationships pertaining to the Polyceridae, Okakaiidae, Gymnodorididae, Hexabranchidae and Chromodorididae, which are some of the most morphologically unique and chemically distinct families.
Nudibranch studies unquestionably suffer from a lack of abundant and diverse molecular markers. Studies have argued that increasing molecular markers could resolve problematic relationships [94,104], but an increase in molecular markers does not resolve issues regarding homology and variable region alignments. Automated filtering protocols allow for MSAs to be easily replicated and eliminate the uncertainty of manual curation of alignments; however, these methods are not without error. In addition, there has been little consensus on the soundest method of increasing signal to resolve phylogenies [6]. Genomic tools, which only have recently been used to investigate nudibranch [105] and larger opisthobranch phylogenetics [106,107], have potential of resolving dorid relationships. However, genomic applications also suffer from alignment and homology issues [108,109]. Phylogenetic resolution of the Doridina can greatly benefit from a genomic approach, but it is important to emphasize the critical role MSAs and homology have on phylogenetic studies. Owing to the varying size and scope of molecular and taxonomic sampling, we strongly recommend the exploration of multiple MSA construction methods that can aid in the selection of an approach that best suits the data.