Single cell ecogenomics reveals mating types of individual cells and ssDNA viral infections in the smallest photosynthetic eukaryotes
Abstract
Planktonic photosynthetic organisms of the class Mamiellophyceae include the smallest eukaryotes (less than 2 µm), are globally distributed and form the basis of coastal marine ecosystems. Eight complete fully annotated 13–22 Mb genomes from three genera, Ostreococcus, Bathycoccus and Micromonas, are available from previously isolated clonal cultured strains and provide an ideal resource to explore the scope and challenges of analysing single cell amplified genomes (SAGs) isolated from a natural environment. We assembled data from 12 SAGs sampled during the Tara Oceans expedition to gain biological insights about their in situ ecology, which might be lost by isolation and strain culture. Although the assembled nuclear genomes were incomplete, they were large enough to infer the mating types of four Ostreococcus SAGs. The systematic occurrence of sequences from the mitochondria and chloroplast, representing less than 3% of the total cell's DNA, intimates that SAGs provide suitable substrates for detection of non-target sequences, such as those of virions. Analysis of the non-Mamiellophyceae assemblies, following filtering out cross-contaminations during the sequencing process, revealed two novel 1.6 and 1.8 kb circular DNA viruses, and the presence of specific Bacterial and Oomycete sequences suggests that these organisms might co-occur with the Mamiellales.
This article is part of a discussion meeting issue ‘Single cell ecology’.
1. Introduction
Planktonic photosynthetic eukaryotes of the class Mamiellophyceae (Chlorophyta) include the smallest eukaryotes, with bacterial-sized cells of less than 2 µm diameter. Environmental surveys based on the sequence of the highly conserved 18S ribosomal gene (18S) eukaryotic barcode [1] revealed their worldwide distribution [2–5]. Their ubiquity, high abundance and turnover suggest they sustain the marine ecosystems in many coastal areas [6,7]. Many flagellate species in this class were described in the past century before the advent of molecular biology techniques and their first application to these picoalgae [8]: Monomastix 1912 [9], Micromonas pusilla 1951 [10] and Mamiella gilva 1964 [11], followed by Crustomastix and Dolichomastix 2005 [12]. A non-flagellate member was described 1990 as Bathycoccus prasinos [13] and the non-scaled coccoid Ostreococcus tauri was described in 1995 [14]. A thorough phylogenetic analysis of the nuclear and plastid encoded rDNA operon led Marin and Melkonian to define the class Mamiellophyceae, divided into three orders Mamiellales, Dolichomastigales and Monomastigales [15]. While 22 species are presently described in AlgaeBase [16], recent environmental sequencing suggests that many more species have not been yet isolated [5,17] so that the number of species within this class remains an open question. The relative ease of culture of strains of the picoalgae Bathycoccus, Micromonas and Ostreococcus in Keller's and related media [18] fostered their development as new models for cell biology [19,20] and environmental studies [21,22]. Complete and annotated nuclear genomes have been obtained for eight species to date: O. tauri RCC4221 [23], O. lucimarinus [24], O. sp. RCC809, O. mediterraneus [25], Micromonas pusilla and M. commoda [26], B. prasinos RCC1105 [27] and O. tauri RCC1115 [28].
Despite the progress enabled by laboratory experiments and sequence analyses, the sexual life cycles of these haploid picoalgae, as well as their interactions with bacteria [29] and viruses [30] in situ remain enigmatic. Single amplified genomes (SAGs) assemblies, produced by multiple displacement amplification (MDA) after extraction from sorted single cells have been previously reported to contain foreign DNA [31,32]. They may be viewed conceptually as metagenomes, provided they preserve a molecular signature of the ecological context of the cell. SAGs of cells directly sampled from the environment may not only foster knowledge on diversity in microorganisms [33,34], but also open opportunities for the identification of all kinds of novel biological associations, from predation to parasitism [31,35].
Here, we investigated 12 Mamiellales' SAGs sampled in the Indian Ocean during the Tara Oceans expedition [36] to explore the biological insights these data provide within an ecogenomic framework. Therefore, we analysed the taxonomic affiliations of the assembled sequence data to discuss the power and challenges of using SAGs to retrieve in situ ecological information (i) from the target sequence data, e.g. Mamiellales sequences, about the mating type of the cells and (ii) from the non-target sequences identified.
2. Material and methods
(a) Data
Single-cells were collected during the Tara Oceans expedition [37] from surface waters in the Indian Ocean at stations 39, 41 and 46 [38] and cryopreserved using glycine betaine. Flow cytometric cell sorting, single cell lysis and whole genome amplification by multiple displacement amplification (MDA) [39] were performed at the Bigelow Single Cell Genomics facility (Boothbay, Maine USA), as previously described [40–42]. The resulting SAGs were screened using universal eukaryotic 18S rDNA PCR primers [42]. The 12 SAGs we analysed here were selected for sequencing based on their 18S identities and were sequenced using Illumina HiSeq technology at the Genoscope [33]. All SAGs were multiplexed; SAG1–6 and SAG7–12 were sequenced on two different runs. The raw Illumina data were processed as previously described [36]: adapters and primers used during library construction were removed from the whole reads and both ends were trimmed for low quality nucleotides (quality value less than 20). The longest sequence without adapters and with fewest low quality bases was kept. Sequences between the second unknown nucleotide (N) and the end of the read were also trimmed. Reads shorter than 30 nucleotides after trimming were discarded. These trimming steps were achieved using fastx_clean, an internal software based on the FASTX library. The reads and their mates that mapped onto run quality control sequences (Enterobacteria phage PhiX174 genome, NC_001422.1) were removed using SOAP aligner. The total number of paired-end reads from the processed raw reads ranged between 19 and 36 millions, with an average of 26 million reads per SAG.
(b) SAG assembly and taxonomic affiliation of contigs
Individual pair-end read files were assembled using SPAdes v. 3.9.0 [43] and the final assembly statistics were generated with Quast v. 4.6.3 [44]. Contigs smaller than 400 bp were discarded from downstream analysis. Genome completeness was assessed with BUSCO v. 3.0.2 using the 303 eukaryotic single copy orthologues dataset [45] and compared to the values obtained on the available closely related sequenced genomes (table 1).
sample—station | Bathycoccus—39 |
Micromonas—46 |
Ostreococcus—41 |
|||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
SAG | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
assembly size (Mb) | 3.25 | 4.42 | 3.36 | 3.72 | 0.89 | 0.56 | 0.58 | 0.53 | 0.85 | 3.46 | 1.54 | 2.71 |
N50 (kb) | 13.6 | 17.4 | 19.5 | 20.2 | 3.17 | 2.47 | 9.67 | 6.91 | 6.44 | 37.5 | 17.8 | 36.4 |
GC% | 45.7 | 47.0 | 46.6 | 46.5 | 40.0 | 39.3 | 41.6 | 39.1 | 52.2 | 57.1 | 52.1 | 56.6 |
contigs | 3157 | 3129 | 3461 | 2913 | 2864 | 1894 | 1043 | 1788 | 2914 | 2479 | 2207 | 2364 |
contigs (>400 bp) | 1048 | 1241 | 990 | 991 | 691 | 470 | 272 | 275 | 410 | 737 | 546 | 655 |
reference genome | B. prasinos RCC1105 | M. commoda RCC299 | O. tauri RCC4221 | |||||||||
nuclear (GC%) | 15 Mb (48.0) | 21 Mb (63.8) | 13 Mb (59.0) | |||||||||
mitochondria (bp) (GC%) | 43 614 (40.1) | 47 425 (34.6) | 44 237 (38.2) | |||||||||
chloroplast (bp) (GC%) | 72 700 (43.3) | 72 585 (38.8) | 67 681 (39.9) |
The SAG assemblies were screened for contigs containing the 18S using a blastn homology search using a custom database of 18S extracted from GenBank complemented with 18S genes of Mamiellales extracted from the genome sequence (electronic supplementary material, table S1). Contigs with a 18S hit were analysed with RNAmmer v. 1.2 (http://www.cbs.dtu.dk/services/RNAmmer/) and the predicted 18S sequences were aligned with Mamiellales 18SrDNA sequences using MAFFTv. 7.305b [46]. Alignments were trimmed with trimAI v. 1.2 (-automated1 method) [47] and maximum-likelihood phylogenetic trees were built under GTR model using RAxML v. 8.2.9 [48] and plotted with FigTree v. 1.4.3 (http://tree.bio.ed.ac.uk/software/figtree/).
Taxonomic affiliations to separate non-target contigs (foreign DNA) from target (Mamiellophyceae) were performed as follows. Each contig was aligned against GenBank using DIAMOND (blastx option e-value 1 × 10−5 maximum aligned sequences 100) and the 10 best hits for each contig were kept. The taxonomic affiliation of each contig was retrieved using a custom perl script that fetched the GenBank accession number from the DIAMOND output, and retrieved the taxonomical nodes from the GenBank taxonomy database for each hit (script available upon request). Manual inspection of taxonomic affiliation of contigs containing several open reading frames (ORFs) led to the correction of the taxonomic affiliation for a few contigs affiliated to Archaea on the base of the best blast hit, while most ORFs along the contig had a best hit against bacterial sequences. As the complete genome and organellar sequences of Ostreococcus, Bathycoccus and Micromonas species are present in GenBank, contigs were considered as target if at least one of the 10 best hits belonged to a Mamiellophyceae. We considered following high-order taxonomic ranks: Archaea, Bacteria, Fungi, Mamiellophyceae, Metazoa, Oomycetes, Other (when several taxa were mixed in uneven proportions within the 10 best hits), Other protists, Plants (Streptophyta), Unassigned contigs and Virus. To estimate the number of reads recruited to each taxonomic rank in each SAG assembly, we aligned the reads against each assembly with BWA [49] and extracted the read coverage of each contig estimated from the SAMtools mpileup and depth suite [50].
The sum of contig lengths affiliated to a given group was plotted with ggplots in the R environment [51].
(c) Cross-contamination filtering
To remove contigs derived from cross-contamination during the sequencing step, all SAG assemblies from the same Illumina run were compared against all others with blastn (e-value cut-off 1 × 10−5). This led to the inclusion of one SAG from a MAST lineage (TOSGAG23-1, GenBank accession ERX1271190) that was multiplexed with SAGs 7–12. Each contig was flagged as candidate contaminant if its alignment length with a contig from another SAG was equal to or higher than 95% of the length of its best hit, with a nucleotide identity higher than 99.5%. The number of reads recruited to each candidate contig was used to define the source of the contamination: the SAG containing the contig with the highest read coverage was defined as the source SAG, and the corresponding contigs from the other SAGs were considered as contaminants and discarded.
(d) Annotation of non-target contigs
Because the virus contigs detected in SAGs 8 and 12 had best hits with circular viruses, the contigs were first circularized with the geneious program v. 10.0.5 [52], and ORFs were predicted with Glimmer [53].
The conserved motifs of the viral replicase gene [54,55] were manually searched by inspection of the predicted amino acids sequence located in the N-terminal endonuclease domain and in the C-terminal superfamily 3 helicase domain using MAFFT v. 7.305b (--localpair –maxiterate 1000). The best blastp hits to these ORFs were complemented with sequences retrieved from GenBank and the Ocean Atlas genes database [56] (electronic supplementary material, table S2). Multiple sequence alignments were performed with MAFFT v. 7.305b (--localpair --maxiterate 1000) [57], and trimmed with trimAI v. 1.2 (−automated1 method) [47]. For each alignment, protein evolution models were selected using ProtTest v. 3.4.2 [58] with the Akaike Information Criterion (AIC) (LG+I+G for ORF1 and RTREV+I+G+F for ORF2). Phylogenies were built by the maximum-likelihood method implemented in RAxML v. 8.2.9 [48], and plotted in FigTree v. 1.4.3 (http://tree.bio.ed.ac.uk/software/figtree/).
(e) Screening assemblies for information about mating type
In Ostreococcus species, population genomics analyses have revealed two anciently diverged 500 kb genomic regions with supressed recombination, which are candidate mating type regions [28]. As in many bipolar mating type systems, orthologous genes on this mating type locus (MT) cluster phylogenetically by mating type, not by species [28]. This is as expected if the origin of mating types predates speciation within this genus, as previously observed in the Volvox genus [59]. We took advantage of this deep divergence between 23 orthologous gene families of the two mating types, gametologs [60], to screen the SAG assemblies for homologous sequences. Briefly, orthologous gene families were obtained following an ‘all-against-all’ protein sequence similarity search, performed with blastp (maximum e-value 1 × 10−4) and gene families were delineated using OrthoFinder (v. 2.1.2) [61]. The dataset consists of 23 genes in the eight sequenced Mamiellophyceae genomes [28] and contains gene families consisting exclusively of gametologs in the eight sequenced Mamiellales genomes: O. tauri RCC4221 [23], O. tauri RCC1115 [28], B. prasinos RCC1105 [27], M. commoda RCC299, M. pusillla CCMP1545 [26], O. lucimarinus [24], O. spp. RCC809 and O. mediterraneus RCC2590 [25]. Protein coding sequences of these genes in the Ostreococcus SAG assemblies were extracted manually from the tblastn nucleotide alignments.
(f) Biogeographic occurrence of Mamiellales SAG-associated sequences
The non-targeted bacterial sequences (electronic supplementary material, table S3) were searched against 957 manually curated partial assemblies (metagenome-assembled genomes, MAG) reconstructed from 93 TARA Oceans metagenomes [62], including those stations from the North Indian Ocean where the Mamiellophyceae SAGs were isolated. A match against a MAG was defined as an alignment of 90% of the length of SAG-associated bacterial contig with a nucleotide identity higher than 98%. Nucleotide identity and total alignment length for each SAG-associated bacterial contig were computed from blastn alignments [63].
The geographical occurrences and distribution of the non-target virus sequences were obtained by searching for homologous sequences to the two predicted protein-coding genes, a capsid and a replicase, in the OCEAN GENE ATLAS [56] using blastp (e-value 1 × 10−5). Amino acid sequences for each hit were downloaded from the database (electronic supplementary material, table S2). This atlas contains a gene catalogue from 243 metagenomes and 441 metatranscriptomes from the Tara Oceans expedition, as well as the metagenomes from the Global Ocean Sequencing (GOS) expedition [64].
3. Results
(a) Completeness of target assemblies
Phylogenetic analysis of 18S data retrieved from the de novo assembly was consistent with prior taxonomic affiliations. Ostreococcus SAG 18S sequences retrieved from the de novo assembly clustered with O. spp. RCC809. However, while SAG9, 10, 11 sequences were 100% identical to RCC809 (and RCC143), the 18S sequence of SAG11 had one difference from RCC809. The 18S of the Bathycoccus SAGs were 100% identical to the 18S sequence from the Bathycoccus prasinos RCC1105 reference genome and the 18S found in Micromonas SAGs were 100% identical to the recently described M. bravo species [65] (figure 1). We could not retrieve the 18S rDNA sequence from the assemblies in SAG4 and SAG6–7.
Figure 1. Maximum-likelihood phylogeny (under GTR model) of 18S rDNA sequences retrieved from the SAG assemblies of Bathycoccus sp. (SAG1–3), Micromonas sp. (SAG5 and 8) and Ostreococcus sp. (SAG9–12). Mamiellales reference genomes are marked with a red star, and SAGs are marked with orange star. Rapid bootstrap support values are indicated in the key (100 replicates), represented by coloured circles in branches. The final alignment contained 1750 nucleotides.
The SAG target assemblies recruit the overwhelming majority of reads, from a minimum of 82% of reads in SAG8 to 99.96% in SAG9 (table 2). Target assemblies are fragmented and summed up to a maximum of 4.2 Mb (SAG2), corresponding to a maximum of one quarter of the expected genome size (SAG2, Bathycoccus, SAG10, Ostreococcus). This modest genome representation is also reflected by the BUSCO analysis, with a maximum of 28.7% of represented genes (SAG9), to 1% in SAG6, where the target sequences are overwhelmingly represented by the chloroplast and mitochondrial sequences (table 2). As a comparison, the same BUSCO analyses of the available O. tauri, B. prasinos and M. commoda genomes gave 86%, 84% and 89% respectively, suggesting that BUSCO analysis gives a conservative estimation of genome completeness in the Mamiellales. The relatively low target nuclear genome recoveries are in line with previous studies, which showed that the combination of sequences obtained from several SAGs may increase target genome recovery from 20% to 70% [42].
T.-O.a |
SAG | target sequences |
non-target sequences |
BUSCO | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
assembly total length (% of reads recruited) |
total length (% of reads recruited) |
|||||||||||
total |
mitochondria |
chloroplast |
total |
filtered |
(%) | |||||||
×103 bp | (%) | bp | (%) | bp | (%) | bp | (%) | bp | (%) | |||
39 | SAG1 | 3001 | (99.89) | 40 481 | (1.46) | 33 423 | (0.03) | 101 751 | (0.09) | 29 645 | (0.03) | 11.9 |
39 | SAG2 | 4166 | (99.47) | 49 421 | (0.05) | 21 954 | (0.01) | 89 862 | (0.28) | 22 815 | (0.07) | 17.2 |
39 | SAG3 | 3130 | (99.91) | 46 597 | (0.11) | 35 348 | (0.05) | 90 824 | (0.06) | 46 751 | (0.03) | 15.2 |
39 | SAG4 | 3509 | (99.96) | 54 044 | (0.08) | 38 960 | (0.04) | 118 507 | (0.02) | 32 028 | (0.005) | 16.8 |
46 | SAG5 | 441 | (90.32) | 54 577 | (3.36) | 51 112 | (65.97) | 228 342 | (2.15) | 122 933 | (1.16) | 2.0 |
46 | SAG6 | 271 | (98.26) | 37 972 | (20.66) | 64 962 | (48.70) | 135 783 | (0.26) | 33 796 | (0.06) | 1.0 |
46 | SAG7 | 255 | (96.12) | 34 404 | (0.28) | 53 356 | (73.30) | 136 620 | (2.02) | 26 962 | (0.39) | 2.3 |
46 | SAG8 | 187 | (82.95) | 44 825 | (14.67) | 42 081 | (0.21) | 144 189 | (6.09) | 31 813 | (1.34) | 3.0 |
41 | SAG9 | 712 | (99.86) | 43 157 | (1.58) | 40 737 | (0.15) | 41 906 | (0.08) | 6307 | (0.01) | 28.7 |
41 | SAG10 | 3180 | (99.88) | 47 961 | (1.11) | 93 684 | (0.79) | 55 576 | (0.01) | 11 764 | (0.002) | 22.8 |
41 | SAG11 | 1225 | (92.74) | 28 820 | (19.66) | 101 901 | (7.98) | 142 887 | (1.78) | 20 493 | (0.25) | 7.6 |
41 | SAG12 | 2250 | (98.85) | 43 041 | (0.15) | 84 203 | (10.60) | 79 669 | (1.11) | 11 876 | (0.16) | 15.5 |
There was a large variation in the proportion of reads recruited to the mitochondrial (0.05% to 26%) and chloroplast (0.01% to 73%) genomes. Importantly, the assembly lengths were always large enough (greater than 20 kb, table 1) to annotate several complete protein-encoding genes. These sequence data are sufficient for the unambiguous identification of contigs from both the mitochondrial and plastid genomes. However, because the percentage of recruited reads also depends on the copy number of the chloroplast and mitochondria genomes within a cell, we first estimated the nuclear to plastid or to mitochondria ratios in Ostreococcus from available data. In O. tauri, this relative copy number can be inferred from the ratio of the nuclear [28] versus organellar genome coverage [66] obtained from the same Illumina run of a haploid culture. Mitochondrial genome copy number was estimated to range between 2 and 5, chloroplast copy number ranged between 3 and 7, using the available Illumina datasets obtained from 13 independent O. tauri cultures (electronic supplementary material, table S4). The organellar genomes represent less than 0.5% of the nuclear genome sizes in Mamiellophyceae (table 1), but may reach 3.5% with the maximum organellar genome copy number.
If the proportion of reads reflects the proportion of DNA in the sampled cell, the per cent of reads recruited to the organellar genomes may thus be expected to range from 0.01% to 3.5%. This is the case in all Bathycoccus SAGs and two Ostreococcus SAGs (table 2).
The complete mitochondrial (mtDNA) or chloroplastic (cpDNA) genomes could not be assembled from individual SAG data (geneious inbuilt assembler with minimum contig overlap 100 bp and a minimum overlap identity 90%). However, running the assembler on all contigs affiliated to cpDNA or mtDNA within each genera led to large (greater than 30 kb) assemblies of 3 mtDNA and 2 cpDNA. Manual assembly guided by the synteny with the available reference genomes (table 1) led to two complete assemblies, and three partial assemblies, summarized in table 3. To compare the mitochondrial genome assembly obtained for Bathycoccus SAGs with the available reference mitochondrial genome of Bathycoccus prasinos RCC1105 [27], the amino acid identity was calculated along a concatenation of six orthologous genes outside the inverted repeat region (nad5, nad4, coc3, cox2, atp6 and nad1), as previously described [1]. Intraspecific diversity within Mamiellophyceae organelles is as yet unknown for most species, but in O. tauri intraspecific amino acid divergence is below 0.1% [66]. The amino acid identity over the 1789 amino acid-long alignment was 81%, supporting the previously published conclusion based on the analysis of the combined nuclear genome assembly from SAG1–4, that the Bathycoccus sequenced belong to a novel, as yet uncultured, Bathycoccus species [33].
chloroplast genome (cpDNA) |
mitochondrial genome (mtDNA) |
|||
---|---|---|---|---|
assembly size (Kb) | assembly name | assembly size (Kb) | assembly name | |
SAG1–4 | — | — | 48.8 | Ba_mt_TOSAG39 complete |
SAG5–8 | 54.7 | Mi_cp_TOSAG46 partial | 33.3 | Mi_mt_TOSAG46 partial |
SAG9–12 | 68.9 | Os_cp_TOSAG41 complete | 39.1 | Os_mt_TOSAG41 partial |
(b) Mating type inference in Ostreococcus SAGs
Contigs from the target fraction in the SAGs assemblies were searched for genes located within the candidate MT– and MT+ gene families defined in the available reference genomes. Since the sequences of the two mating types have not yet been identified in Bathycoccus and Micromonas, the inference of mating types is not yet possible in these genera. The amino acid sequence of orthologous genes could be retrieved for six out of the 23 gene families (figure 2a). This low gene number is not surprising given the low nuclear genome coverage of each SAG (table 1), but gene family 16 (GF16—encoding for a diacylglycerol glucosyltransferase) had matches with three out of the four Ostreococcus SAGs. Comparison of the amino acid sequence of these genes with genes from Ostreococcus sp. RCC809 indicates that the four Ostreococcus SAGs have the same mating type as RCC809, that is MT+ (figure 2b).
Figure 2. Mating type inference from amino acid identity with MT+ and MT− orthologous gene families (GF) in Ostreococcus SAGs. (a) Expected taxonomic affiliation for MT− and MT+ strains. (b) Presence/absence matrix of hits to gene families in each SAG. (c) Maximum-likelihood phylogeny (LG+G model) of the SAGs sequence corresponding to GF16 with orthologous genes from Ostreococcus species.
(c) Taxonomic diversity of non-target contigs
The non-target assemblies recruited 0.01–6% of the reads (table 1). Taxonomic affiliation of total non-target contigs is provided in figure 3a and varies greatly between SAGs. For eight SAGs, most non-target reads were recruited to Bacteria, for two SAGs, most non-target reads were recruited to Oomycetes, and for two SAGs, most reads were recruited to sequences of viral origin. Sequence comparison between non-target assemblies revealed 100% identical sequences shared between SAGs coming from different stations but sequenced on the same Illumina lane, a suspicious signature suggesting cross-contamination during the amplification and sequencing step [67]. This led us to define a cross-contamination filter procedure for all SAGs sequenced within the same run. To find out which of the SAGs was the most likely source of the contamination, we used the number of reads recruited to non-target contigs. The SAG containing the most sequenced reads was considered as the source SAG of the non-target contig. The taxonomic affiliation of non-target contigs following filtering is presented in figure 3b. Basically, this led to a two- to sevenfold reduction of the total non-target assembly length depending on the SAG (table 2). Following cross-contamination filtering, the maximum per cent of reads recruited to non-target sequences dropped from 6% to 1% in SAG8, and became almost negligible (0.002%) in SAG10. The taxonomic affiliation of these filtered non-target sequences could be divided into three groups: Bacteria, Oomycetes/Fungi and Viruses, in order of decreasing frequency.
Figure 3. Taxonomic affiliations of non-target sequences. (a) Per cent of raw read number. (b) Per cent of reads following cross-contamination read filtering.
Most non-target reads were recruited to Bacteria for 11 SAGs and to virus for one SAG (SAG12) (figure 3b). The non-target bacterial contigs had a large phylogenetic spread (figure 4a) and while the most common phylum belonged to the Proteobacteria, there was no evidence of any specific bacteria—SAG association from this analysis. To gain more information about the origin of the 83 non-target bacterial sequences, we estimated their similarity with 957 manually curated partial assemblies (MAGs) reconstructed from 93 TARA Oceans metagenomes [62]. Over one third (32 out of 83) of the bacterial contigs had more than 98% identity over 90% of their length with a TARA MAG, suggesting indeed that these bacteria belong to the most abundant bacteria sequenced within the TARA Ocean sampling (figure 4a). However, only three bacterial contigs had a hit against a MAG from the North Indian Ocean, while most hits were against MAGs from the Mediterranean Sea and South East Pacific Ocean. Therefore, except for the three bacterial contigs assembled from SAG3 and SAG4, the bacteria sampled together with the SAG do not represent the most dominant bacteria at the sampling site.
Figure 4. Taxonomic affiliation inferred from best blast hits (1 × 10−5 cut-off) of Bacterial contigs (a) at level of the phylum; Oomycetes contigs at level of family (b) (sum of contig lengths per taxa in each SAG is indicated by bubble size). Length of contig recruited to TARA MAGs from TARA metagenomes is indicated by bubble colour.
The cross-contamination filtering removed most of Oomycetes affiliated non-target contigs (figure 3b), and the source was one uncultured marine Stramenopile (MAST) SAG [42] sequenced in the same lane as SAG7 to SAG12. The class Oomycetes is also from the Stramenopiles group and forms a distant, yet phylogenetically related clade to MASTs, which currently lack representation in reference databases [68]. Following the cross-contamination filtration step, non-target Oomycete contigs were affiliated to four families (figure 4b): Pythiaceae (two contigs), Albuginaceae (three contigs) and the more abundant Peronosporaceae (30 contigs) and Saprolegniaceae (33 contigs). Given the available genome size estimations in Oomycetes [69] or MASTs [42], which are larger than 20 Mbp—the size of the largest target Mamiellales, Micromonas (table 1)—the quantity of Oomycetes DNA present within the SAGs precludes any evidence of co-sampling, but rather suggests random capture of DNA during the cell sorting process.
The least frequent non-target sequence group was affiliated to viruses. Analysis of the contigs affiliated to viruses in SAG8 (Micromonas) and SAG12 (Ostreococcus) led us to assemble two novel circular viral genomes, which we named ‘Mamiellales SAG-associated circular virus’ (MACV). Genome sizes varies between 1586 bp (SAG12—Ostreococcus figure 5a) and 1788 bp (in SAG8—Micromonas), both coding for two ORFs, a replicase gene (ORF1) and putative capsid gene (ORF2). The two genomes are 100% identical, except for a 202 bp deletion in the shorter version, leading to smaller ORF1 sequence. The experimental analysis performed would not discriminate whether these sequences arose from single or double stranded DNA. The read coverage of the SAG8 and SAG12 viruses is 50× and 7893× respectively. Assuming MDA amplification did not induce coverage biases between the virus and nuclear genomes, the same average coverage would be expected for viral and nuclear nucleotides. Compared to the average coverage of the nuclear genome, the SAG8-associated virus had a 10-fold lower coverage, while the SAG12-associated virus had a fivefold higher coverage. This supports the notion that the virus was present in multiple copies (five copies) within the cytoplasm of SAG12, suggesting an infected state, while there were fewer copies, and may have been under-amplified, in SAG8. The majority of circular DNA viruses were ‘circular replication-associated protein encoding single-stranded DNA’ (CRESS) viruses [70]. They have been found within diverse hosts and marine metagenomes, even though there is not much information regarding their infection strategies. To explore the similarity of the Mamiellales-associated circular virus to other CRESS DNA viruses, we performed maximum-likelihood phylogenetic analyses of their predicted capsid (figure 5c) and predicted replicase (figure 5d) proteins, together with their retrieved blast hits (e-value cut off 1 × 10−5) both in GenBank and in the TARA-Oceans Gene Atlas. Capsid homologous sequences were found in seven TARA-Oceans metagenomes (figure 5b), including station 41 in the Indian Ocean, the same station where the Ostreococcus SAG12 was sampled. The phylogenies of both genes suggested that Mamiellales-associated circular viruses formed a well-supported monophyletic clade, which included circo-like viruses, metagenomic sequences and plant ssDNA viruses (Yerba mate virus and Banana bunchy top virus). Sequences from other ssDNA viruses clades, such as genes encoded by Bacilladnaviruses (from Diatoms) belonged to divergent branches, suggesting that Mamiellales-associated circular viruses might form a divergent group of a novel ssDNA virus family.
Figure 5. Genome structure, biogeography and phylogeny of Mamiellales-associated circular virus. (a) Genome assembly of 1586 bp virus encoding ORFs; ORF1 is a replicase and ORF2 a putative capsid protein. (b) Biogeographic distribution of putative homologous sequences encoding to ORF2 (putative capsid) in the Ocean Gene Atlas (OGA) database per TARA station (triangles) and number of hits per station (bubble size) (b). (c) Maximum-likelihood phylogeny of ORF2 (putative capsid—RTREV+I+G+F protein evolution model) and (d) ORF1 (replicase—LG+I+G protein evolution model) with homologous sequences retrieved from GenBank and OGA database.
4. Discussion
SAGs have so far been previously used to increase our knowledge of the uncultivable multitude of marine microorganisms [40,68]. The use of SAGs to discover novel species does not necessitate high genome coverage as only a handful of highly conserved genes is informative to delineate a species. However, partial genome coverage may constitute a greater challenge for the use of SAGs in evolutionary studies. To increase the genome coverage of a target species, data from genetically close SAGs may be merged together [33,42]. This approach has been successfully applied by Vannier et al. [33] to the four Bathycoccus SAGs and resulted in an assembly estimated to contain 64% of the complete genome. The amino acid identity between this novel assembly and the reference B. prasinos RCC1105 genome led the authors to conclude that these four SAGs belonged to a cryptic novel Bathycoccus species, TOSAG39, which is supported here by the phylogenetic divergence of their mitochondrial genome sequences. Although combining SAGs produces a more complete genome, individual information about the environment of the sorted single cells, which could indicate potential biological interactions, or even stages of infections in the case of parasites, may be lost.
Although the individual SAGs were estimated to represent a maximum of 27% of the complete genome sequence in Ostreococcus, a search for the presence of 23 gene families located in the mating-type region in Mamiellales enabled us to retrieve orthologous genes for three Ostreococcus SAGs. Amino acid identity between the SAG sequences and the sequences from available genomes indicated that the candidate mating type of the three Ostreococcus SAGs was MT+, like the candidate mating type of the closest reference genome from strain O. sp. RCC809 [28]. The relative prevalence of the MT+ and MT− in natural Ostreococcus blooms remains an open question. To the best of our knowledge, the co-occurrence of strains of the two mating types from the same sample has not yet been reported, but the maximum number of strains isolated from the same sampling point is less than 3 [71]. The minimal frequency of sexual reproduction has been estimated indirectly in O. tauri from the population polymorphism spectrum, 1 sexual meiosis for 100 000 clonal divisions [28]. This low rate may be explained either by a low encounter rate of MT+ and MT− strains in their natural environment, or by a low rate of outcrossing. The identification of informative sequences from the MT locus in three of the four SAGs demonstrates the utility of the single cell genomics approach for estimating the frequencies of the two mating types within a bloom, provided that a sufficient number of cells is analysed. While the presence of identical MTs in three SAGs is consistent with clonal reproduction within a bloom, the sample size was too small for a proper estimation of the level of clonality within a bloom.
The estimations of the percentages of reads recruited to the mitochondrion and chloroplast genomes in each SAG attest the ability of single cell sequencing to identify novel candidate associations in the smallest eukaryotic microalgae of the order Mamiellales. Indeed, while organellar copy numbers may reach several thousands of copies in humans [72] or Arabidopsis [73], bacterial-sized Ostreococcus have a reduced number of organellar genomes, with an average copy number per cell estimated at 4 and 5 for the mtDNA and cpDNA, respectively. As a consequence, the systematic identification of mtDNA and cpDNA sequences from the SAGs provides support to the ability of this strategy to detect genomic sequences of any intracellular symbionts, provided there is no DNA amplification bias as a consequence of differences in genome architecture or cell wall structure between the symbiont and the organellar genome. Alternatively, these sequences might arise from random capture of amplified dissolved DNA in the marine seawater around the isolated picoalgal cells. In a typical sorting run the droplet containing a single cell may contain about 2.8 nl [74], or 2.8 × 106 µm3. The sample volume within a drop is around 1/1000 of the total volume, or 2.8 × 103 µm3, whereas a single Ostreococcus cell contains about 0.5 µm3 of cytoplasm, giving about 6 × 103 fold more seawater than cell volume, ample space for the ‘phycosphere’ [75] or dissolved DNA [76] that may be sequenced with the targeted algal cell.
To distinguish between random capture of dissolved DNA and intracellular symbionts, relative read coverage of the non-target to the target genomes provides a rule of thumb about the likely origin of the non-target sequences. In the present data, we found that most of the non-target sequence data can be traced back to cross-contamination from another SAG, sequenced in the same Illumina run. Filtering out cross-contaminants, the amount of non-target DNA identified—except for the virus sequence—is not compatible with co-sampling of a Mamiellales and a bacteria or Oomycete cell, and rather suggests random capture of dissolved DNA. These associated non-target sequences may nevertheless provide information about the diversity of microorganisms present in the cell's environment. Indeed, Mamiellales are auxotrophic for B vitamins [77] and have been found to co-occur with diverse bacteria in laboratory cultures [29,78], though no obligate-specific association has been yet observed. The taxonomic diversity of the candidate-associated bacterial contigs retrieved from the SAG data is thus not surprising.
The presence of a novel CRESS-like DNA virus came as a surprise. Single-stranded DNA (ssDNA) viruses with circular genomes, with sizes from approximately 1700 bases, are the smallest viruses known to infect eukaryotes [79]. Several previous metagenomic studies have identified many novel genomes similar to ssDNA circular viruses through data-mining of public marine metagenomes [79,80], but with no hints about their putative hosts. Previously described CRESS viral hosts range from plants [81] (Geminiviridae, Nanoviridae), bleached kelp macroalgae [82], vertebrates [83] (Circoviridae), fungi, insects (Genomoviridae) [84] and crustaceans [85]. In protists, CRESS viruses were discovered in SAGs targeting Picobiliphytes [31] and in Bacillariophyceae diatoms, where they have been isolated and their lytic effect could be demonstrated [86]. The Mamiellales-associated circular viruses described here are among the smallest viral genomes described to date infecting the smallest photosynthetic eukaryotes.
In conclusion, sorting of these 12 Mamiellales SAGs by flow cytometry relied on the choice of appropriate volumes to favour isolation of one cell, and the ‘phycosphere’ around cell had been subjected to turbulent mixing with the cytometer sheath fluid, thus limiting the suitability of this technique for identification of cell-to-cell interactions. Independent ways of finding such associations are necessary, one of which may be to pick off single cells using micromanipulation (see [35] in this issue), but it remains very technically challenging to perform this for picoeukaryotes that are less than 2 µm in size, such as those in the Mamiellales. Conventional culture-based approaches of strains and viruses are required to demonstrate the outcome of interactions by co-culture [87] and remain an essential tool for deciphering both intercellular and virus–host interactions in aquatic environments and interpreting environmental sequencing data. Nevertheless, we showed here that—provided careful filtration of contaminations between datasets is performed—SAGs provide in situ ecological information that would be lost in conventional isolation and culture processes that select for the fastest growing cells in laboratory culture medium.
Data accessibility
SAGs assemblies analysed in this manuscript have been submitted to GenBank under BioProject PRJNA549236 with accession numbers VIAS00000000 (SAG1); VIAT00000000 (SAG2); VIAU00000000 (SAG3); VIAV00000000 (SAG4); VIAW00000000 (SAG5); VIAX00000000 (SAG6); VIAY00000000 (SAG7); VIAZ00000000 (SAG8); VIBA00000000 (SAG9); VIBB00000000 (SAG10); VIBC00000000 (SAG11); VIBD00000000 (SAG12). Mitochondrial assemblies have been submitted to GenBank under accession numbers MN243825 (Os_mt_TOSAG41), MN243826 (Mi_mt_TOSAG46), plastidial assemblies under accession numbers MN243827 (Os_cp_TOSAG41) and MN243828 (Mi_cpTOSAG46). Mamiellales associated circular virus have accession numbers MN365966 (MACV1_SAG12) and MN365967 (MACV2_SAG8).
Authors' contributions
L.F.B. and G.P. performed bioinformatics analyses. N.P. and M.E.S. performed single cell sorting and prior species' taxonomic affiliation. K.L. performed SAG sequencing. G.P., L.F.B. and N.G. designed the study. All authors contributed to manuscript writing and editing.
Competing interests
We declare we have no competing interests.
Funding
L.F.B. was funded by the EU Horizon 2020 research and innovation programme, under the Marie Skłodowska-Curie grant agreement no. H2020-MSCA-ITN-2015-675752.
Acknowledgements
We thank Tom Delmont and Olivier Jaillon from the Genoscope for information about the SAGs, Lilian Caesar for help with the figures, and the Genotoul platform (www.genotoul.fr) for cluster availability for bioinformatics analysis and Genoscope for sequencing. We are grateful to Fabien Burki for making a perl script for extracting taxonomic affiliation from DIAMOND outputs available. Tara Oceans (which includes both the Tara Oceans and Tara Oceans Polar Circle expeditions) would not exist without the leadership of the Tara Expeditions Foundation and the continuous support of 23 institutes (http://oceans.taraexpeditions.org). We further thank the commitment of the following sponsors: CNRS (in particular Groupement de Recherche GDR3280 and the Research Federation for the study of Global Ocean Systems Ecology and Evolution, FR2022/Tara Oceans-GOSEE), European Molecular Biology Laboratory (EMBL), Genoscope/CEA, The French Ministry of Research and the French Government ‘Investissements d'Avenir’ programmes OCEANOMICS (ANR-11-BTBR-0008), FRANCE GENOMIQUE (ANR-10-INBS-09-08), MEMO LIFE (ANR-10-LABX-54), and PSL* Research University (ANR-11-IDEX-0001-02), Gordon and Betty Moore Foundation (award #3790) and the US National Science Foundation (awards OCE#1536989 and OCE#1829831) to M.E.S., as well as the Ohio Supercomputer for computational support. We are also grateful for the support and commitment of agnès b. and Etienne Bourgois, the Prince Albert II de Monaco Foundation, the Veolia Foundation, Region Bretagne, Lorient Agglomeration, Serge Ferrari, Worldcourier, and KAUST. The global sampling effort was enabled by countless scientists and crew who sampled aboard the Tara from 2009–2013, and we thank MERCATOR-CORIOLIS and ACRI-ST for providing daily satellite data during the expeditions. We are also grateful to the countries that graciously granted sampling permissions. The authors declare that all data reported herein are fully and freely available from the date of publication, with no restrictions, and that all of the analyses, publications and ownership of data are free from legal entanglement or restriction by the various nations whose waters the Tara Oceans expeditions sampled in. The results and conclusions of this study do not represent the opinions or position of the U.S. National Science Foundation. This article is contribution number 91 of Tara Oceans. We also acknowledge three anonymous referees for constructive comments on a previous version of this manuscript.