Deep gastropod relationships resolved

Gastropod mollusks are arguably the most diverse and abundant animals in the oceans, and are successful colonizers of terrestrial and freshwater environments. Here we resolve deep relationships between the five major gastropod lineages - Caenogastropoda, Heterobranchia, Neritimorpha, Patellogastropoda and Vetigastropoda - with highly congruent and supported phylogenomic analyses. We expand taxon sampling for underrepresented lineages with new transcriptomes, and conduct analyses accounting for the most pervasive sources of systematic errors in large datasets, namely compositional heterogeneity, site heterogeneity, heterotachy, variation in evolutionary rates among genes, matrix completeness and gene tree conflict. We find that vetigastropods and patellogastropods are sister taxa, and that neritimorphs are the sister group to caenogastropods and heterobranchs. With this topology, we reject the traditional Archaeogastropoda, which united neritimorphs, vetigastropods and patellogastropods, and is still used in the organization of collections of many natural history museums. Several traits related to development and life history support our molecular results. Importantly, the time of differentiation of the embryonic 4d cell (mesentoblast, responsible for mesoderm formation), differs between the two major clades, highlighting the degree of conservation and significance of development in the evolution of gastropods, as it is also known for spiralian animals more broadly.


Introduction
Heterogeneity in the stationary frequency of amino acids among samples is one such issue that can artificially 48 group taxa that are actually not closely related based on convergent amino acid composition [13]. Within-site 49 rate variation through time (heterotachy) is another likely violation [14]. Some genes with slow rates of 50 evolution (e.g., ribosomal protein genes) have also been shown to bias phylogenetic inference [15,16], while 51 genes with fast rates and high levels of saturation can cause long-branch attraction [17]. An additional 52 model violation comes from gene tree discordance, not accounted for by concatenation methods, that can 53 be caused by incomplete lineage sorting and be particularly relevant in areas of the tree with short internal 54 branches [18][19][20], such as the radiation of crown gastropods at the Ordovician [12,21]. More commonly 55 considered issues include rate heterogeneity between sites and missing data. 56 Our goal was to resolve between the three hypotheses for the early divergences of gastropods. We present an 57 extended sampling of Neritimorpha and Patellogastropoda by producing new transcriptomes, and complement 58 the dataset with the latest published gastropod transcriptomes and with increased representation for the 59 closest outgroups -bivalves, scaphopods and cephalopods. We build a series of matrices and employ a variety 60 of methods and models to account for the most common and relevant potential sources of systematic error in 61 large datasets, namely compositional heterogeneity, site heterogeneity, heterotachy, variation in evolutionary 62 rates among genes, matrix completeness and gene tree conflict. LG4M,LG4X,LG+C10,LG+C20,LG+C30,LG+C40,LG+C50,LG+C60 -mrate G,R,E). The search for the 118 models LG+C60 (Matrices 1 and 3) and LG+C50 (Matrix 1) required more memory than available, and 119 these models were disregarded for the respective matrices. PhyloBayes was run with the CAT-GTR model 120 on a subset of the concatenated alignments (matrices 1, 2 and 4), discarding constant sites to speed up 121 computation. Our main goal was to resolve the deep nodes of the gastropod tree and distinguish between three hypotheses of 125 the relationships among its five main lineages. All but one of our inference methods and matrices congruently 126 support a clade uniting Vetigastropoda and Patellogastropoda, and Neritimorpha as the sister group to 127 Apogastropoda ( Figure 2). The only exception is the coalescent-based analysis on the smallest dataset of 149 128 genes (Astral, Matrix 4), in which these two key nodes were left unresolved (all tree files are available in the 129 Supplementary Material). Accordingly, the few analyses with lower support on these nodes also refer to the 130 smaller Matrix 4, which is unsurprising given that it comprises fewer informative sites in concatenated analyses, 131 and less genes in the coalescent-based analysis [48]. In summary, the resulting topology is congruent based 132 on an array of analyses testing for the major common sources of systematic error in phylogenomic datasets, 133 including gene tree discordance, compositional heterogeneity, heterotachy, site heterogeneity, variation in 134 evolutionary rates, and missing data.

135
This exact topology for gastropod relationships has been previously recovered by a few molecular [12, 49] and total evidence [6] analyses, with numerous alternatives proposed in the literature [e.g. 5, 6, 10, 137 49-51], even within the same studies. With 17 analyses (combinations of four subsampled matrices, two 138 data types -amino acids and Dayhoff recoding, and four inference methods/models), for the first time 139 we find strong congruence and support for the tree presented in Figure 2. With that we reject the clade  reproductive behaviors and anatomy [57]. For these latter groups, eggs are usually encapsulated, with either 156 direct development or the release of a feeding veliger larva in the water, while embryos of vetigastropods 157 and patellogastropods first develop into a non-feeding trochophore larva in the plankton [58]. In addition, 158 neritimorphs and apogastropods have invaded freshwater and terrestrial environments several times. All of 159 these traits are likely connected, but order of appearance and causal relations remain to be investigated.

160
Important questions that remain regarding major gastropod relationships include the position of 161 Cocculiniformia, Neomphalina and the hot-vent taxa, smaller deep sea clades that have been considered 162 somehow related to vetigatropods, neritimorphs, patellogastropods, or as independent branches in the 163 gastropod tree. They are yet to be sampled in a phylogenomic analyses.

164
Regarding overall mollusk relationships, we recover a well supported clade of gastropods, bivalves and 165 scaphopods in all analyses; however, as in previous phylogenomic efforts [59,60], relationships between these 166 three groups are unstable ( Figure 2). The Dayhoff datasets and most of the ML analyses with the profile mixture model result in a clade of gastropods and scaphopods; while most coalescent-based trees recover a 168 clade of bivalves and scaphopods; and finally, the ML partitioned analyses produce a clade of gastropods and 169 bivalves. Perhaps a way ahead to resolve such hard nodes will be to use other types of data, such as genomic 170 rearrangements and presence/absence of genes from complete genomes.

172
While PhyloBayes converged on Dayhoff-recoded datasets, analyses on the more complex amino acid matrices 173 did not converge for all parameters. The problem was especially pronounced for the large matrices (a summary 174 table for all analyses is given in the Supplementary Material). We observed that convergence issues were 175 mostly due to small differences between chains regarding the position of one or few derived terminals within 176 the outgroups or within apogastropods, whose relationships were not the goal of this study. We suspect this 177 may be caused by a problem in topology proposals for these derived nodes, leading some of the chains to 178 get stuck in local maxima. One example comes from the Dayhoff analysis of Matrix 1: the initial two chains 179 seemed to be very far from topological convergence (maxdiff=1) even after more than 20,000 generations.  Neritimorphs had mostly congruent phylogenies recovered from 28S rRNA data [69] and mitogenomes [70].

Figure1:
Matrices and phylogenetic methods used to infer gastropod relationships. With 50% taxon occupancy, Matrix 1 is the largest, with 1059 genes. Matrix 4 is the subset of the best sampled 149 genes, with 70% taxon occupancy. Genes and species are sorted with the best sampling on the upper left. Matrix 2 is the subset of 635 genes after ordering all genes by evolutionary rate and removing the 20% slowest and 20% fastest evolving genes. Matrix 3 includes the 962 genes that are homogeneous in amino acid composition; genes are ordered by p-value of the homogeneity test. Black cells indicate genes present for each species. Check Methods for details.