Cementing mussels to oysters in the pteriomorphian tree: a phylogenomic approach

Mussels (Mytilida) are a group of bivalves with ancient origins and some of the most important commercial shellfish worldwide. Mytilida consists of approximately 400 species found in various littoral and deep-sea environments, and are part of the higher clade Pteriomorphia, but their exact position within the group has been unstable. The multiple adaptive radiations that occurred within Pteriomorphia have rendered phylogenetic classifications difficult and uncertainty remains regarding the relationships among most families. To address this phylogenetic uncertainty, novel transcriptomic data were generated to include all five orders of Pteriomorphia. Our results, derived from complex analyses of large datasets from 41 transcriptomes and evaluating possible pitfalls affecting phylogenetic reconstruction (matrix occupancy, heterogeneity, evolutionary rates, evolutionary models), consistently recover a well-supported phylogeny of Pteriomorphia, with the only exception of the most complete but smallest data matrix (Matrix 3: 51 genes, 90% gene occupancy). Maximum-likelihood and Bayesian mixture model analyses retrieve strong support for: (i) the monophyly of Pteriomorphia, (ii) Mytilida as a sister group to Ostreida, and (iii) Arcida as sister group to all other pteriomorphians. The basal position of Arcida is congruent with its shell microstructure (solely composed of aragonitic crystals), whereas Mytilida and Ostreida display a combination of a calcitic outer layer with an aragonitic inner layer composed of nacre tablets, the latter being secondarily lost in Ostreoidea.

unigene with a custom Python script, thus removing the variation in the coding regions of Trinity assemblies due to alternative splicing, closely related paralogs, and allelic diversity.
Filtered peptide sequences with all final candidate ORFs were retained as multifasta files.
Asterisks, indicating stop codons, as well as blank lines between individual contigs, were deleted from each file.
(c) Orthology assignment and matrix construction Orthology assignment for the data set assemblies was performed using stand-alone OMA v0.99z.2 [6,7]. The parameters.drw file specified retained all default settings with the exception of "MaxTimePerLevel," which was set at 3600. The all-by-all local alignment process was parallelized across 128 CPUs once all the input pre-processing steps were achieved on a single core (to avoid risk of collision).
The orthogroup selection based on minimum taxon occupancy was executed using a custom Python script. Selected orthogroups for each matrix were aligned individually using MUSCLE version 3.6 [8]. Divergently aligned positions were culled by a probabilistic character masking approach with ZORRO [9], using default parameters and FastTree 2.1.4 [10] to construct guide trees. In all of the alignments, positions that were assigned a confidence score below the threshold of 5 by ZORRO were discarded, using a custom Python script. Trimmed orthogroups were then concatenated using Phyutility 2.6 [11].
In order to assess the effects of rate of molecular evolution and heterotachy on tree topology ten additional matrices were constructed by selecting subsets of Matrix 2 based on evolutionary rate, for which percent pairwise identity was employed as a proxy.
Accumulated conservation values were generated for each locus using Trimal 1.2b (-sct flag). Values were then normalized to account for length. Loci were sorted; the first being the slowest evolving genes (most conserved) and the last being the fastest evolving genes GBlocks v. 0.91b [15]. This ensured that regions with large indels did not affect the calculation of percent pairwise identity.
The package BaCoCa v.1.1r was used to estimate relative composition frequency variability (RCFV) in Matrix 2. RCFV is a measure of the absolute deviation from the mean for each amino acid and for each taxon summed up over all amino acids and all taxa [16].
The higher the RCFV value, the higher the degree of compositional heterogeneity present in that partition. RCFV values were plotted in a heatmap using the R package gplots.
Bootstrap resampling was conducted for 100 independent replicates using RAxML v. 8.2 [18], specifying a model of protein evolution with corrections for a discrete gamma distribution using the WAG model, and were thereafter mapped onto the optimal tree from the independent ExaML searches.
The initial three matrices (Matrices 1, 2 and 3) were analyzed using Bayesian inference with ExaBayes version 1.21 with openmpi version 1.64. ExaBayes uses a sampling approach similar to the one implemented in MrBayes [19], but it is better adapted for large data sets by its ability to parallelize each independent run, each chain, and the data (i.e., unique site patterns of the alignment). ExaBayes implements a Markov chain Monte Carlo (MCMC) sampling approach. We used the amino acid model prior (aaPR), a discrete model prior, which mixes a combination of 18 models of evolution. Four independent Markov chain Monte Carlo chains (MCMC) were run for 1,000,000 generations, sampling every 500 th generation. The first 2,000 trees (25%) were discarded as burn-in for each MCMC run prior to convergence (i.e., when maximum discrepancies across chains < 0.1). Convergence of parameters between runs was estimated using the command postProcParam available in exabayes software. Most parameters showed convergence except for the analysis conducted on Matrix 1. Trees from all runs were combined after burn-in to obtain the consensus majority rule tree topology and node marginal probability (i.e. posterior credibility values of node or confidence support) with the command consense implemented in ExaBayes.
To test for putative gene incongruence, we inferred individual gene trees for each orthogroup included in each of the three initial matrices (Matrices 1, 2 and 3) using RAxML 7.7.5 [20]. PROTGAMMALG4X was selected as the best model of aa substitution. All individual best-scoring trees were concatenated for each matrix and fed into SuperQ v1.1 [21] in order to visualize inter-gene conflicts. SuperQ decomposes all gene trees into quartets to infer a super-network where edge lengths are assigned based on quartet frequencies; it was run using the 'balanced' edge-weight optimization function with no filter. The resulting super-networks were visualized with SplitsTree v.4.13.1 [22].