Proceedings of the Royal Society B: Biological Sciences
You have accessResearch article

Untangling the early diversification of eukaryotes: a phylogenomic study of the evolutionary origins of Centrohelida, Haptophyta and Cryptista

Fabien Burki

Fabien Burki

Department of Botany, University of British Columbia, Vancouver, British Columbia, Canada

[email protected]

Google Scholar

Find this author on PubMed

,
Maia Kaplan

Maia Kaplan

Department of Botany, University of British Columbia, Vancouver, British Columbia, Canada

Google Scholar

Find this author on PubMed

,
Denis V. Tikhonenkov

Denis V. Tikhonenkov

Department of Botany, University of British Columbia, Vancouver, British Columbia, Canada

Institute for Biology of Inland Waters, Russian Academy of Sciences, Borok, Russia

Google Scholar

Find this author on PubMed

,
Vasily Zlatogursky

Vasily Zlatogursky

Department of Invertebrate Zoology, St Petersburg State University, St Petersburg, Russia

Google Scholar

Find this author on PubMed

,
Bui Quang Minh

Bui Quang Minh

Center for Integrative Bioinformatics, Max F. Perutz Laboratories, University of Vienna, Medical University of Vienna, Vienna, Austria

Google Scholar

Find this author on PubMed

,
Liudmila V. Radaykina

Liudmila V. Radaykina

Institute for Biology of Inland Waters, Russian Academy of Sciences, Borok, Russia

Google Scholar

Find this author on PubMed

,
Alexey Smirnov

Alexey Smirnov

Department of Invertebrate Zoology, St Petersburg State University, St Petersburg, Russia

Google Scholar

Find this author on PubMed

,
Alexander P. Mylnikov

Alexander P. Mylnikov

Institute for Biology of Inland Waters, Russian Academy of Sciences, Borok, Russia

Google Scholar

Find this author on PubMed

and
Patrick J. Keeling

Patrick J. Keeling

Department of Botany, University of British Columbia, Vancouver, British Columbia, Canada

Canadian Institute for Advanced Research, Integrated Microbial Biodiversity Program, Toronto, Ontario, Canada

[email protected]

Google Scholar

Find this author on PubMed

    Abstract

    Assembling the global eukaryotic tree of life has long been a major effort of Biology. In recent years, pushed by the new availability of genome-scale data for microbial eukaryotes, it has become possible to revisit many evolutionary enigmas. However, some of the most ancient nodes, which are essential for inferring a stable tree, have remained highly controversial. Among other reasons, the lack of adequate genomic datasets for key taxa has prevented the robust reconstruction of early diversification events. In this context, the centrohelid heliozoans are particularly relevant for reconstructing the tree of eukaryotes because they represent one of the last substantial groups that was missing large and diverse genomic data. Here, we filled this gap by sequencing high-quality transcriptomes for four centrohelid lineages, each corresponding to a different family. Combining these new data with a broad eukaryotic sampling, we produced a gene-rich taxon-rich phylogenomic dataset that enabled us to refine the structure of the tree. Specifically, we show that (i) centrohelids relate to haptophytes, confirming Haptista; (ii) Haptista relates to SAR; (iii) Cryptista share strong affinity with Archaeplastida; and (iv) Haptista + SAR is sister to Cryptista + Archaeplastida. The implications of this topology are discussed in the broader context of plastid evolution.

    1. Introduction

    Reconstructing the tree of life is a challenging task, because the long evolutionary history since the origin of life has often confounded the phylogenetic signal that can be recovered today. Nevertheless, molecular-based phylogenies have made possible profound rearrangements in the tree, most recently using phylogenomics (i.e. the use of genomic-scale datasets with stronger phylogenetic power) [1]. Accordingly, the global tree of eukaryotes has been reshuffled once again, leading to a better understanding of the relationships between the largest assemblages, or supergroups, and the origins of some ‘orphan’ lineages [2]. However, contentious nodes between supergroups remain, as well as a few lingering ‘orphans’. Resolving the positions of these orphans is necessary for understanding their evolution, but also impacts the tree as a whole, because poorly sampled ‘orphan’ groups may lead to instability in the tree.

    One such group lacking proper genomic data is Centrohelida, a monophyletic group of free-living predatory protists mainly found in freshwater and soil habitats, but also increasingly recognized to occur widely in marine environments [3]. With about 90 described species and a vast diversity of environmental sequences [4], centrohelids have traditionally constituted the core of the original phylum Heliozoa, which included a subset of microbial eukaryotes characterized by a special type of pseudopodia, the axopodia. Heliozoa was shown to be a polyphyletic assemblage, and today several relatively minor lineages are scattered across the tree [5].

    Centrohelids, however, have remained one of the last substantially diverse groups of eukaryotes that has eluded phylogenetic placement in the tree of life. Different analyses of the 18S rRNA and a small number of protein-coding genes (actin, α-tubulin, β-tubulin, EF2, HSP70, HSP90) led to placement in various regions of the tree, but never with good statistical support. For example, centrohelids were weakly inferred to branch close to members of the Viridiplantae, specifically glaucophytes [6] or red algae [7]. Other studies showed the centrohelids to share affinities with haptophytes [4,8], or were inconclusive [9]. Even a larger-scale multigene analysis involving 127 genes was unsuccessful at the task, placing centrohelids with low confidence as sister to either haptophytes or the enigmatic telonemids [10]. More recently, the partial transcriptome sequencing for the tiny centrohelid Oxnerella marina was included in a 187 genes dataset, which resulted in a less ambiguous monophyletic grouping with haptophytes [11], reinforcing the phylum Haptista originally proposed on weaker evidence [4,12].

    Beyond their intrinsic interest as a large group of eukaryotes with unknown evolutionary origin, centrohelids also hold some of the clues to better understand a larger and a priori unrelated evolutionary mystery. Owing to their possible link to haptophytes [11], centrohelids may help to shed light on one of the most puzzling aspects of plastid evolution: the origin and evolution of complex red plastids [13,14]. Centrohelids are heterotrophs, and no permanent plastid has ever been observed [15], although kleptoplasty has been reported [16]. Haptophytes, on the other hand, are phototrophs and possess complex plastids derived from an endosymbiotic event with a red alga [17]. Haptophytes represent one of four lineages harbouring such plastids, the others being ochrophytes (photosynthetic stramenopiles), myzozoans (alveolates with plastids: apicomplexans, dinoflagellates and chrompodellids) and cryptophytes (belonging to Cryptista, which also include goniomonads, katablepharids and Palpitia). Whereas the origins of stramenopiles and alveolates are better understood [18,19], haptophytes and Cryptista have notoriously remained challenging to place in the tree. They are sometimes grouped together, along with telonemids and centrohelids [10,11,2023], which resulted in the establishment of Hacrobia [24]. However, haptophytes and Cryptista have also been shown to have polyphyletic origins in several recent multigene analyses [2527]. Thus, untangling the controversial phylogenetic positions of these two groups, along with their closely related plastid-lacking lineages such as centrohelids, is a much-needed step to better explain the observed distribution of red plastids in the eukaryotic tree.

    In this study, we used a phylogenomic approach including a broad sampling of diversity to investigate the deep evolutionary relationships among eukaryotes, with particular focus on centrohelids, haptophytes and Cryptista. For that purpose, we filled an important gap in genome datasets by sequencing high-quality transcriptomes for four centrohelid species, and combined those with recent transcriptomes for a very large diversity of marine microbial eukaryotes (the MMETSP initiative [28]). Cultures for four species were established, each representing a different centrohelid family, namely Raphidiophrys heterophryoidea (Raphidiophryidae), Raineriophrys erinaceoides (Pterocystidae), as well as two undescribed species: Acanthocystis sp. (Acanthocystidae) and Choanocystis sp. (Choanocystidae). Our analyses unambiguously confirm that centrohelids share a common origin with haptophytes. More generally, we present compelling evidence for the phylogenetic position of the centrohelid–haptophyte group and Cryptista, altogether bringing us one step closer to a fully resolved eukaryotic tree of life.

    2. Methods

    Details of experimental procedure for culturing, molecular work, sequencing, assembling and gene preparation are described in the electronic supplementary material.

    (a) Phylogenomic datasets construction

    Following the preparation of 263 genes for phylogenomic analysis (see electronic supplementary material), all taxa were listed with SCaFoS [29], which amounted to 274 taxa. This list was first reduced to 234 taxa after removing all taxa with more than or equal to 20% missing genes. A 234-taxa, 263-gene (234/263) supermatrix was then constructed to infer an initial maximum-likelihood (ML) tree with IQ-TREE v. 1.3.0 [30] under the LG + Γ model. Based on this initial tree, a phylogeny-driven taxon selection approach was applied to reduce further the number of taxa by retaining only representative sequences within strongly supported monophyletic groups (100% bootstrap support), discarding the longest branches and/or least complete sequences. Chimeric concatenated sequences were also allowed by pooling highly incomplete taxa of the same genus (see electronic supplementary material, table S1 for details). This approach led to a final taxon sampling composed of 150 operational taxonomic units (OTUs). Because removal of ambiguously aligned sites is directly influenced by the proportion of gaps, we then re-extracted from the unaligned and untrimmed fasta files the 150 OTUs corresponding to our final selection, which were re-aligned with MAFFT-LINSI v. 7 and trimmed with BMGE v. 1.1 [31] using conservative settings (removal of sites with more than 20%, minimum block size of 8, substitution matrix BLOSUM 75). Finally, from our starting dataset of 263 seed genes, only 250 were retained to enter the final concatenated alignment (55 554 aa positions), which corresponded to genes with less than 50% missing OTUs. See electronic supplementary material, table S1 for details about missing data, and electronic supplementary material, table S2 for complete gene names. These 250 genes, containing up to 150 OTUs, were concatenated into a supermatrix (150/250) with SCaFoS [29].

    From the full 150/250 dataset, two reduced datasets were considered. First, Telonema subtilis and Picomonas sp. were removed (see Results and Discussion for the justification), leading to the 148/250 dataset. This dataset was reduced further by eliminating the 19 047 fastest-evolving positions corresponding to bin10, according to the tree-independent method described in [32]; this dataset was named 148/250-slow.

    (b) Phylogenetic analyses

    Our supermatrices were analysed by ML and Bayesian tree reconstruction methods. ML analyses were performed with IQ-TREE v. 1.3.0–1.3.10 [30]. Gene-partitioned and unpartitioned alignments were analysed; in all cases, the model that best fits the data was determined by IQ-TREE according to the Bayesian information criterion (BIC). The partitioned analysis was applied to the 150/250 and 148/250 datasets, where the best-fit model was chosen according to a greedy strategy that sequentially merges genes from the fully partitioned alignment (250 partitions) until the model fit stops improving. We opted for the new model selection procedure (-m TESTNEW), which additionally implements the FreeRate heterogeneity model inferring the site rates directly from the data instead of being drawn from a gamma distribution [33]. Owing to the large size of the partition schemes, only the top 20% was checked using the relaxed clustering algorithm (-rcluster 20), as described in [34]. For both datasets, the best-fit partitioning scheme contained the original 250 partitions, i.e. no merging was deemed necessary. This partitioning scheme was then used to specify a model for each partition, allowing each gene to have its own rate (-spp). For the unpartitioned analyses of both 150/250 and 148/250 supermatrices, the best-fitted model corresponded to the LG matrix with relative rates estimated from the data using the non-parametric FreeRate model with 10 categories and empirical amino acid frequencies (LG + R10 + F). The best-fitted model for the unpartitioned analysis of the 148/250-slow dataset was LG + R6 + F. A more complex empirical mixture model not evaluated by the selection strategy in IQ-TREE was also tested on all datasets: following recommendation in [35], the LG matrix was combined to an amino acid class frequency mixture model with 60 frequency component profiles plus a class of empirical amino acid frequency of the alignment, and four gamma categories to take into account the across-site rate heterogeneity (LG + C60 + F). To assess branch support, all IQ-TREE analyses used the ultrafast bootstrap approximation (UFboot) with 1000 replicates [36] and the SH-like approximate likelihood ratio test (SH-aLRT) also with 1000 bootstrap replicates [37].

    Bayesian analyses were performed with PhyloBayes MPI v. 1.5a [38], under a site-heterogeneous mixture model combining infinite profile mixtures and exchange rates inferred from the data with the rates across site drawn from a discrete gamma distribution (CAT + GTR + Γ4). Constant sites were removed to decrease computational time (-dc). Three independent Markov chain Monte Carlo (MCMC) chains were run, for at least 3000 generations but up to 7000 for the smaller 148/250-slow dataset. The burnin period was determined after plotting the evolution of the log-likelihood (Lnl) across the iterations, removing the generations anterior to the stabilization of the Lnl. Convergence between the chains was assessed by examining the difference in frequency between all bipartitions (maxdiff). Owing to the large size of our taxon sampling, convergence was generally not globally achieved (maxdiff ≥ 0.46), an issue that has been reported in other taxon-rich phylogenomic studies [11,22]. The discrepancies between the chains mostly concerned nodes not under active discussion in this study, except for the monophyly of Archaeplastida, which was accordingly labelled unsupported; electronic supplementary material, figures S2 and S5 show the trees inferred from each individual chains to allow visual assessment of the discrepancies.

    3. Results

    (a) Improved dataset and model selection

    To place the centrohelids in a broad eukaryotic framework, we took special care to include a very large diversity for all known main lineages. Building on previously published datasets [25,39], we more than doubled the taxon sampling, mostly using recently released high-quality transcriptomes for marine microbial species [28] (electronic supplementary material, table S3). Our carefully curated taxon sampling contained 150 OTUs for 250 genes (55 554 aa positions), globally characterized by only 21% of missing data (electronic supplementary material, table S1). Importantly, the four new centrohelid sequences missed only between 7.4% and 12.9% positions, which corresponded to at least 48 399 aa positions included, representing many fold improvements compared with the 76.3% missing data for the older Polyplacocystis contractilis dataset [10].

    In total, four models of evolution were tested on the different datasets: ML analyses employed a partition approach with 250 gene partitions allowing each gene to have its own model, the LG + Rx + F model and the LG + C60 + F model (electronic supplementary material, table S4); Bayesian analyses were run under the CAT + GTR + Γ4 model. To select the best-fitting model in ML, we followed the BIC score selection criterion, which showed that the LG + C60 + F model consistently achieved better scores than the other two models (electronic supplementary material, table S4). In Bayesian framework, the CAT + GTR + Γ4 model has been repeatedly shown to have a better fit than simpler models based on empirical exchangeability matrices such as LG, or even CAT + Γ4 alone [40,41]. However, the size of our datasets makes comparing the fit of these complex models computationally prohibitive, and thus topologies corresponding to the best-fitting LG + C60 + F model (ML) and the CAT + GTR + Γ4 model (Bayesian) are discussed in the following sections.

    (b) Evolutionary relationships among major eukaryotic lineages

    The LG + C60 + F and CAT + GTR + Γ4 analyses of the complete dataset (150/250) recovered with maximal support (100% UFboot and SH-aLRT; 1.0 PP) a monophyletic assemblage including centrohelids and haptophytes (figure 1; electronic supplementary material, S1). More generally, these analyses recovered many of the major eukaryotic groups, namely Obazoa, Amoebozoa, Excavata and the SAR assemblage (stramenopiles, alveolates, Rhizaria). The association previously suggested between cryptomonads, katablepharids and the marine biflagellate Palpitomonas bilix into the Cryptista clade was supported with 100% UFboot and SH-aLRT and 1.0 PP [25,27,43]. In the LG + C60 + F tree, the Archaeplastida lineages (i.e. green algae and land plants, glaucophytes and red algae) were paraphyletic, with Cryptista branching with green plants and glaucophytes (96% UFboot; 88% SH-aLRT). In the CAT + GTR + Γ4 analysis, the position of Cryptista among the Archaeplastida lineages was unresolved owing to incongruent nodes in the independent MCMC chains (electronic supplementary material, figure S2ac). Telonemids were recovered as sister to SAR (93% UFboot; 99% SH-aLRT; 0.78 PP) and Picozoa as sister to the red algae (93% UFboot; 100% SH-aLRT; 1.0 PP).

    Figure 1.

    Figure 1. Phylogenetic tree of eukaryotes inferred from the complete dataset (150/250). The topology shown corresponds to the ML tree under the LG + C60 + F model, with both ML and Bayesian support value reported. Black dots on branches mean maximal support (i.e. 100% UFboot and SH-aLRT, and 1.0 Bayesian PP; the Bayesian CAT + GTR + Γ4 topology is shown in electronic supplementary material, figure S1). When not maximal, values are indicated only if deemed robust as follows: UFboot ≥ 95%/SH-aLRT ≥ 80%/PP ≥ 0.9. The tree is drawn rooted between Obazoa, Amoebozoa, Collodictyon, Malawimonas and the rest of eukaryotes after [42], though we note that the position of the root is under active debate.

    Following the inference of a close evolutionary link between centrohelids and haptophytes, the next important question is where Haptista goes in the global tree. The analyses of the 150/250 dataset placed Haptista as sister to SAR, a relationship that received no support under the LG + C60 + F model, but 1.0 PP under the CAT + GTR + Γ4 model (figure 1). To investigate this and the deeper structure of the tree in more detail, we reduced our dataset in two successive steps. First, we removed two orphan lineages, T. subtilis and Picozoa, leading to the 148/250 dataset. These enigmatic taxa mirror in many ways the problems we sought to solve here for centrohelids. They are still extremely poorly represented in genomic databases, being the sole representatives of a much higher lineage diversity [44,45], which translates into high proportions of missing data (67% for telonemids and 86% for Picomonas sp.; electronic supplementary material, table S1). Second, we removed from the 148/250 supermatrix the 19 047 fastest-evolving positions using the similarity between characters as an estimate of the evolutionary rates [32], leading to the 148/250-slow dataset. Fast-evolving positions are more likely to concentrate undetected multiple substitutions, even by advanced models of evolution such as the mixture models used here. Removing these positions from large alignments diminishes the amount of undetected multiple substitutions, but maintains enough phylogenetic information to reconstruct even ancient events, so this approach has shown great potential in other phylogenomic studies [26,46].

    The resulting topologies were similar to those based on the full dataset. However, whereas the analyses of the 148/250 dataset did not improve the general statistical support of the tree (electronic supplementary material, figures S3, S4 and S5ac), the reconstructions based on the 148/250-slow dataset led to consistent and more robust topologies (figure 2; electronic supplementary material, S6). Here, Haptista received strong support (98% UFboot; 91% SH-aLRT; 1.0 PP) for its position as sister to SAR, and the Archaeplastida lineages and Cryptista were strongly inferred to share a common ancestor (100% UFboot; 98% SH-aLRT; 1.0 PP). Archaeplastida remained paraphyletic, but this was still unsupported and should be further tested (69% UFboot; 80% SH-aLRT; 0.89 PP). In these analyses, the Archaeplastida–Cryptista grouping branched with SAR + Haptista to the exclusion of all other eukaryotes with maximal support (100% UFboot; 100% SH-aLRT; 1.0 PP).

    Figure 2.

    Figure 2. Schematics of the new backbone for the eukaryotic tree, highlighting the relationships among the main groups. The topology is based on the 148/250-slow supermatrix, and corresponds to both ML and Bayesian reconstructions under the LG + C60 + F and CAT + GTR + Γ4 models, respectively. The complete tree is presented in electronic supplementary material, figure SX. Black dots on branches mean maximal support (i.e. 100% UFboot and SH-aLRT, and 1.0 Bayesian PP). When not maximal values are indicated as followed: UFboot/SH-aLRT/PP. All supergroups indicated by the triangles received maximal support, with the exception of the grouping of Viridiplantae and glaucophytes, which was unsupported (shown by dashed lines). The size of the triangles roughly represents the diversity of taxa included in our analyses, as well as the length of the longest branch in each group. The root is placed in the same position as in figure 1.

    4. Discussion

    (a) Towards resolving the eukaryotic tree

    Over the past decade, several phylogenomic studies have attempted to resolve the deep-level relationships among the main lineages of eukaryotes [18,19,2527,47]. These studies have greatly improved our model for the tree of eukaryotes, but several questions remain unsolved owing to the lack of data from poorly studied groups. Among these unsolved questions, the relationships between centrohelids, haptophytes, Cryptista and the main Archaeplastida lineages (green plants, glaucophytes and red algae) have all proved to be refractory to robust phylogenetic inferences. A combination of three important sources of artefact is most likely to explain the poor resolution for the placement of these lineages: (i) lack of data; (ii) too few representative species with genomic datasets; a (iii) models of evolution that fail to account for homoplasic positions. In this study, we addressed these possible sources of incongruence by (i) sequencing the transcriptome of four centrohelid lineages, (ii) using a considerable amount of newly available taxon diversity, and (iii) reducing non-phylogenetic signal by removing fast-evolving sites and applying site-heterogeneous models of evolution in both ML and Bayesian frameworks.

    Our analyses recovered Haptista with maximal support, regardless of the dataset or the model used, strongly confirming that centrohelids share a direct common ancestry with haptophytes [4,11]. For the deeper relationships among eukaryotic groups, we found that a greater taxon diversity together with the systematic use of site-heterogeneous models, allowing us to take into account site-specific substitution patterns (C60 mixture and CAT models), improves the general statistical confidence of the tree. When combined with a less noisy dataset (removal of the fastest-evolving sites), these models converged towards a similar picture in both ML and Bayesian frameworks (figure 2). In this tree, Haptista are closely related to the SAR assemblage with high support, in agreement with weaker results based on lower taxon diversity and different models [25,48]. Another relationship to receive strong support for the first time is the grouping of Archaeplastida with Cryptista. This affinity between Archaeplastida and Cryptista has been noted before in several nuclear [2527,48] and mitochondrial-based [42] phylogenomic investigations, as well as in many 18S rDNA molecular studies [6], but unlike here, it never received significant support. Taken together, the affinities of Cryptista to Archaeplastida and of Haptista to SAR further diminish the support for Hacrobia, which was initially a less controversial assemblage when poorer taxon sampling was available [10,20,21,24]. Even though recent phylogenomic analyses continued to show a monophyletic Hacrobia, this was with no support [11,22], or with better confidence only when a large part of the diversity was removed [11].

    One group of Cryptista (the cryptomonads) includes lineages with plastids of red algal origin (see below), which may confound our ability to discriminate vertically inherited genes from endosymbiotically derived ones. Indeed, it is at face value possible that the phylogenetic relationship between Cryptista and Archaeplastida observed here and elsewhere [2527,48] is due to undetected red algal genes in phylogenomic datasets, rather than common ancestry. This is formally possible, because endosymbiotic gene transfer (EGT) is common during endosymbiosis, but there are several reasons to suggest this is not affecting our results. First, if it was the case that large numbers of unrecognized red algal genes invaded eukaryotic genomes after endosymbiosis, then one would expect all red algal plastid-containing lineages to contain many such genes, and accordingly, all be attracted to Archaeplastida, not only cryptomonads. Second, large-scale investigations of EGT in various eukaryotes (including the whole genome of the cryptomonad Guillardia theta) have shown that the endosymbiotic contribution to the host genome, although real, is probably less substantial than originally envisioned [4952]. Third, phylogenomic datasets usually consist of highly expressed housekeeping genes that show no sign of widespread red algal signal. Careful inspection of our dataset allowed us to detect various contamination in different lineages, but not specifically from red algae, and suspicious topologies were not included, as in the case of the translation elongation factor 2 [53]. Overall, we observed no genes in our dataset that individually showed a strong affinity between Cryptista and red algae, suggesting that this relationship is a reflection of vertical inheritance rather than owing to a cryptic contamination of endosymbiont genes.

    (b) Implications for plastid evolution

    Beyond these taxonomic considerations, the positions of centrohelids, haptophytes and Cryptista in the tree of eukaryotes have important implications for how we interpret some major evolutionary and ecological transitions in eukaryotic history. The groups investigated here and their relationships to the SAR and Archaeplastida supergroups represent a complex mixture of photosynthetic and heterotrophic eukaryotes, as well as lineages for which we have little evidence as to whether they harbour a plastid or not [13,54]. Many of these lineages possess plastids bounded by three or four membranes, which are the result of eukaryote-to-eukaryote endosymbioses where heterotrophic organisms acquired plastids from red algae [55]. What makes the evolution of complex red plastids so hard to decipher is the apparent discrepancy between plastid and host phylogenies. Plastid phylogenies have generally been consistent with the notion that all red plastids are the product of a single secondary endosymbiosis [17,5658]. This idea of a single origin was first formalized in the chromalveolate hypothesis, which posited that there was a single engulfment of a red alga in a common ancestor of stramenopiles, haptophytes, cryptophytes and alveolates [59]. Host-derived phylogenies, on the other hand, have generally failed to provide any strong evidence that all red-algal-containing lineages (and their associated plastid-lacking relatives) are monophyletic, which is required under the single endosymbiotic origin scenario. However, host phylogenies have thus far not provided any convincing alternative topologies either, making it difficult to see how plastid and host data can be best reconciled.

    In this context, our work can help us understand the evolution of red plastids. Specifically, the strongly supported grouping of Archaeplastida and Cryptista de facto rules out the scenario of a single red plastid origin in a hypothetical ancestor of a unified chromalveolate assemblage (figure 3a). As stated above, the lack of support for the monophyletic origin of red plastids from host data is not new, but this is the first time, to the best of our knowledge, that a phylogenetic tree strongly argues against it. Indeed, had Cryptista branched elsewhere in the tree, a single origin of chromalveolate plastids could be explained by positing additional plastid loss events, however likely that may be. However, because Cryptista branches with the same lineage from which the plastid is derived (i.e. Archaeplastida), a single origin of red plastids is formally impossible, because those plastids would have needed to travel backwards in time to result in this topology.

    Figure 3.

    Figure 3. Scenarios for the origin and evolution of complex red plastids. These scenarios do not refer to any specific taxa, but rather illustrate the various possibilities discussed in the text, and show that the same diversity of plastid types can be generated by different combinations of events. (a) A single secondary endosymbiosis in the ancestor of all red plastid-bearing eukaryotes was followed only by descent with modification, as formalized in the chromalveolate hypothesis; this scenario is not supported by the current analyses. (b) Multiple independent secondary endosymbioses take place with different red algal symbionts, followed by descent with modification; this is compatible with current phylogenetic evidence from hosts, but not with evidence from plastids. (c) A single secondary endosymbiosis takes place, but is followed by serial eukaryote-to-eukaryote endosymbioses; several versions of this scenario have been proposed (see text for references), and they are consistent with current phylogenetic data.

    With what are now robust relationships for both plastids and hosts, how can we best reconcile their apparent conflictual topologies? Two main scenarios exist to explain the origin and present distribution of complex red plastids: (i) independent secondary endosymbioses (figure 3b) and (ii) a unique secondary endosymbiosis followed by additional layers of endosymbioses (i.e. tertiary or quaternary; figure 3c). Even though the first scenario of independent endosymbioses involving different red algae could explain the tree topologies, such a model is unlikely in the light of several other pieces of evidence showing that all or substantial subsets of the ‘chromalveolate’ plastids trace back to a single secondary red algal endosymbiont [23,6062]. Lately, the second scenario of serial endosymbioses (figure 3c) has received increased attention, being now supported by a growing body of empirical data [48,63]. Several versions of this serial endosymbiotic framework for red plastid evolution have been proposed, all involving the idea of one secondary endosymbiosis with a red alga, followed by subsequent eukaryote-to-eukaryote endosymbioses [48,62,64,65]. Recently, an explicit model was devised using regression analyses to measure the expected similarity between genomes of various ‘chromalveolate’ lineages [63]. This approach resulted in a model where cryptophytes first engulfed a red alga, which was then transferred to the ochrophytes by tertiary endosymbiosis, and to the haptophytes by quaternary endosymbiosis [63].

    In this context, our results are compatible with such a ‘cryptophyte-first’ model, although we note that phylogenetic lines of evidence are not compelling by themselves. More generally, our results will need to stand the test of time, as even strongly supported trees can be shown to be misleading with additional data. Moreover, the breadth for plastid genome data has now been far exceeded by nuclear data, so that it is likely that changes to the plastid tree will occur after the addition of new sequences, as recently demonstrated [58]. All of this could ultimately affect our interpretation, but more importantly various kinds of data, not only phylogenetics, will be needed to validate a particular model. Plastids are cellular structures of great complexity that have integrated with their hosts in many ways [54,66]. Serial endosymbiosis is currently known for certain only in a few dinoflagellate lineages, whose endosymbionts display peculiar ways of integrating that are very different from what we observe in lineages like haptophytes, ochrophytes or most alveolates [6769]. Thus, an integrative model of plastid evolution will need to explain many aspects to be comprehensive, from phylogeny to genetics to fine cellular processes.

    5. Concluding remarks

    Our centrohelid transcriptomes fill an important diversity gap in genomic sequencing. In the near future, effort should be made to provide better-quality datasets for taxa that are still evolutionary mystery but are essential to further resolve the tree; telonemids and Picozoa represent obvious targets near to the organisms studied here, but many other enigmatic microbial eukaryotes probably affect other parts of the tree in similar ways. More work is also necessary to determine the relative position of Cryptista to the Archaeplastida lineages in order to assess the monophyletic origin of the primary plastids.

    Data accessibility

    Raw reads are available through GenBank sequence read archive: SRR2170621, SRR2170625, SRR2170626, SRR2170627, SRR2170634.

    Assembled transcriptomes: Dryad data depository (http://datadryad.org) accession http://dx.doi.org/10.5061/dryad.rj87v.

    Untrimmed sequences, trimmed alignments and single-gene trees: Dryad data depository (http://datadryad.org) accession http://dx.doi.org/10.5061/dryad.rj87v.

    Authors' contributions

    F.B. designed the study, participated in the dataset construction, carried out the phylogenetic analyses and drafted the manuscript; M.K. participated in the dataset construction; D.V.T. and V.Z. collected field samples, established cultures, carried out molecular laboratory work and drafted the manuscript; L.V.R. collected field samples and established cultures; B.Q.M. carried out phylogenetic analyses and drafted the manuscript; A.S. and A.P.M. participated in the design of the study and critically revised the manuscript; P.J.K. designed the study and drafted the manuscript. All authors gave final approval for publication.

    Competing interests

    The authors declare no competing interests.

    Funding

    This work was supported by a grant from the Natural Sciences and Engineering Research Council of Canada, and by a grant from the Tula Foundation to the Centre for Microbial Diversity and Evolution. This work was also partially supported by the Russian Foundation for Basic Research (no. 14-04-00554, 15-34-20065, 15-29-02518, 15-04-18101_a) and by a grant from the President of Russian Federation MK-7436.2015.4. The work of D.V.T. was supported by the Russian Science Foundation (no 14-14-00515). B.Q.M. acknowledges financial support to Arndt von Haeseler from the University of Vienna and the Medical University Vienna.

    Acknowledgements

    We thank Compute/Calcul Canada for computing resources and assistance, in particular WestGrid's Orcinus and Calcul Quebec's Guillimin and Colosse facilities.

    Footnotes

    Published by the Royal Society. All rights reserved.

    References