Exploring microbial dark matter to resolve the deep archaeal ancestry of eukaryotes

The origin of eukaryotes represents an enigmatic puzzle, which is still lacking a number of essential pieces. Whereas it is currently accepted that the process of eukaryogenesis involved an interplay between a host cell and an alphaproteobacterial endosymbiont, we currently lack detailed information regarding the identity and nature of these players. A number of studies have provided increasing support for the emergence of the eukaryotic host cell from within the archaeal domain of life, displaying a specific affiliation with the archaeal TACK superphylum. Recent studies have shown that genomic exploration of yet-uncultivated archaea, the so-called archaeal ‘dark matter’, is able to provide unprecedented insights into the process of eukaryogenesis. Here, we provide an overview of state-of-the-art cultivation-independent approaches, and demonstrate how these methods were used to obtain draft genome sequences of several novel members of the TACK superphylum, including Lokiarchaeum, two representatives of the Miscellaneous Crenarchaeotal Group (Bathyarchaeota), and a Korarchaeum-related lineage. The maturation of cultivation-independent genomics approaches, as well as future developments in next-generation sequencing technologies, will revolutionize our current view of microbial evolution and diversity, and provide profound new insights into the early evolution of life, including the enigmatic origin of the eukaryotic cell.

A total of 118 Mbp, 506 Mbp, and 21 Mbp of raw Illumina MiSeq amplicon data were generated for Loki's Castle, Yellowstone, and LHC4 samples, respectively. The 5' and 3' regions of the amplicons were extracted by scanning for respective forward and reverse primers in read pairs using a custom Python script. Amplicon reads with average Phred quality below 30 were discarded from the analysis.
As the amplicon size was roughly 900 bp, the read pairs could not be merged and were analyzed separately. In this analysis, only the 5' end of the amplicon extracted from both read pairs was used for estimation of OTU and phylum-level diversity. UPARSE pipeline [4] was used for further quality filtering, clustering of OTUs and chimera filtering. Resulting OTUs were classified by searching against Silva 16S rRNA data (release 119) [5] using BLASTn [6] with maximum E-value cutoff of 1e -5 and minimum identity threshold of 85%.

Analysis of SAG data
SAG data from YNP was generated by first obtaining cell fractions from hot spring sediment samples using Nycodenz gradient centrifugation, FACS for sorting individual cells into 384-well microtiter plates, alkaline lysis and multiple displacement amplification (MDA) at the Bigelow Single Cell Genomic Center. Sequencing libraries and reads were generated by the SNP&SEQ sequencing facility at Uppsala University. Total raw Illumina HiSeq data (paired-end 2x100 bp) generated for the two SAGs ranged from 1.5 Gbp for MCG SAG (10Y13-A3) to 2.0 Gbp for MCG SAG (10Y13-F10).
Detailed methods for single-cell isolation and generation of SAG data for the Korarchaeon from LHC4 sample has previously been described in a study by Dodsworth et al [7]. One microliter of the original SAG DNA of the Korarchaeon was used to perform another MDA reaction to generate additional SAG DNA for library preparation. MDA reaction was performed using Qiagen REPLI-g Mini kit and purified with QIAamp DNA Mini kit (Qiagen, Hilden, Germany

Generation of Loki's Castle metagenome
Loki's Castle metagenomic reads were generated from sediment samples collected via gravity coring method [1]. Briefly, community DNA from the 7.5 g of raw sediment materials was extracted using FastDNA spin kit and quantified using fluorescent nanodrop ND-3300 instrument (Thermo Scientific) to measure dsDNA concentration. Sequencing library for Illumina HiSeq instrument was then generated using Nextera library preparation kit from Illumina. Amount of raw data generated by the HiSeq 2500 instrument was about 13.1 Gbp. Detailed assembly and further downstream analysis of the Loki's Castle metagenomic data was described in Spang et al [1].

Binning of metagenomic contigs using PhymmBL
Contigs belonging to target organisms were identified using supervised binning with PhymmBL [10], which was trained with available reference genomes (for instance clean contigs from Korarchaeon SAG was used to train PhymmBL). Additional manually added training sets included manually curated, contamination-free SAG assemblies. A training set comprising contigs of Lokiarchaeota present in Loki Castle sediment samples was generated using the following approach: Phylogenetic trees of individual taxonomic marker genes were constructed and manually inspected. Comparison with the archaeal species tree allowed identification of contigs belonging to Lokiarchaeota (e.g. placement of genes from metagenomic contigs as deep-branching members of TACK superphylum indicated that these originated from Lokiarchaeota). Following several rounds of verification of this training set (see Spang et al for details [1]) these contigs were then used for supervised binning using PhymmBL. Only contigs larger than 1 kbp were analysed.

Co-assembly of new Korarcheota
A total of 15.2 million quality-trimmed and adapter-removed Illumina MiSeq reads from the SAG data and 237974 Illumina HiSeq reads mapped against contigs classified as Korarcheota by PhymmBL were assembled with SPAdes [8] using a set of k-mer sizes (21,33,55,75,101,125) and the "--sc" and "--careful" flags to handle single-cell data.

Estimation of genome completeness
Using a set of 162 marker genes known to be present in single copies in most archaea [11], genome completeness and redundancy was estimated by counting these marker genes in the three SAGs and three metagenomic bins in this study. This was accomplished by running PSI-BLAST of alignment profiles of marker genes from a representative set of archaea including all major lineages against the proteomes of these SAGs or metagenomic bins.

Phylogenetic analyses
Phylogeny of 16S rRNA genes. A total of 338 16S rRNA gene sequences representing major taxonomic clades from the TACK superphylum and members of deeply branching archaea from Silva release 119 were selected. Five taxa from Euryarchaeota were chosen as outgroup clade. Sequences were aligned with MAFFT L-INS-i [12] and columns with gaps if present in >50% of the taxa were removed with trimAl tool [13]. Maximum likelihood phylogeny was constructed using RAxML version 8.0.22 [14], using GTRGAMMA model of nucleotide substitutions and 100 bootstrap replicates.

Phylogenetic analysis of concatenated marker proteins.
A set of 36 conserved single-copy marker genes were identified in the SAGs using PSI-BLAST and aligned with those from a select group of archaea, bacteria, and eukaryotes using the MAFFT L-INS-i tool. Alignments were trimmed using trimAl to remove column positions containing gaps in more than 50% of the taxa and subsequently concatenated. Maximum likelihood phylogeny was inferred using RAxML with 100 bootstrap replicates. Bayesian phylogenetic analyses were performed using Phylobayes [15], utilizing CAT+GTR model of amino acid substitution and 4 independent chains for ~10,000 generations.
Consensus tree was generated with bpcomp tool from Phylobayes package to discard the first 2000 generations from the chains and sampling trees once every 50 generations. Bootstrap values from RAxML maximum likelihood tree were mapped on to corresponding positions in the Phylobayes consensus tree using "sumtrees.py" tool from DendroPy package [16].