Using viromes to predict novel immune proteins in non-model organisms

Immunity is mostly studied in a few model organisms, leaving the majority of immune systems on the planet unexplored. To characterize the immune systems of non-model organisms alternative approaches are required. Viruses manipulate host cell biology through the expression of proteins that modulate the immune response. We hypothesized that metagenomic sequencing of viral communities would be useful to identify both known and unknown host immune proteins. To test this hypothesis, a mock human virome was generated and compared to the human proteome using tBLASTn, resulting in 36 proteins known to be involved in immunity. This same pipeline was then applied to reef-building coral, a non-model organism that currently lacks traditional molecular tools like transgenic animals, gene-editing capabilities, and in vitro cell cultures. Viromes isolated from corals and compared with the predicted coral proteome resulted in 2503 coral proteins, including many proteins involved with pathogen sensing and apoptosis. There were also 159 coral proteins predicted to be involved with coral immunity but currently lacking any functional annotation. The pipeline described here provides a novel method to rapidly predict host immune components that can be applied to virtually any system with the potential to discover novel immune proteins.


Introduction
Comparative immunology uses a variety of invertebrate and vertebrate species to better understand the evolution and origin of immunity [1]. These model organisms have provided fundamental insights into evolutionarily conserved immune mechanisms. For example, the initial concept of self versus non-self recognition was formed by experimentation with starfish, and studies in chickens led to the identification of separate B and T cell lineages [2,3]. While these discoveries have been invaluable in understanding immune mechanisms, the majority of comparative immunology is based on data from only 3 of the 30 extant animal phyla [2]. A broader representation of phyla will help to fully appreciate the complexity of immunity.
Currently, the establishment of molecular tools is infeasible in terms of time and money for most of the metazoan tree of life. For example, continuous cell lines have been a long-term goal of coral reef biologists for decades. While there have been reports of primary cell cultures, standardized cell lines have yet to be established and genetic manipulation of corals remains elusive [4][5][6][7]. To investigate the immune systems of corals, as well as the remaining twenty-seven other animal phyla, a broad and rapid approach is required [2]. Here, we propose a comparative genomic approach in which host viral communities are used to predict host immune proteins. Viruses are dependent on a host cell to complete their life cycle. Therefore, they must replicate and package their genomes intracellularly while avoiding the host immune response. To establish and maintain control of host cells, viruses mimic host proteins [8][9][10][11]. For example, herpes viruses encode proteins that possess sequence similarity to human cytokines and can modulate cytokine signalling in host cells during infection [12]. If a herpesvirus-encoded cytokine was compared with the human proteome using BLAST, then the human-encoded cytokine would be identified without any a priori knowledge of the human immune system. Expanding this observation, we hypothesized that an in silico comparison of all viral gene products to the predicted host proteome would identify both known and unknown immune-associated proteins.
As a proof of principle, in silico constructed viromes were used to successfully identify known human immune proteins. This method was then applied to reef-building corals to identify proteins potentially involved in coral immunity. Most of these proteins were homologues to human proteins, but a subset of 159 coral proteins were predicted to be involved in novel coral immune functions. This method generates putative host immune proteins to be further tested with the goal of discovering new types of immunology.

Results
(a) Known viruses versus the well-characterized human immune system As a test, the proposed method was applied to viruses known to manipulate the human immune system (figure 1). A mock virome was created by fragmenting 16 fully sequenced viral genomes into 200 base pair segments. The resulting in silico virome consisted of 8542 DNA segments (figure 1; electronic  supplementary material, table S1a). This mock human virome was then compared with the human proteome using tBLASTn, with an e-value of 1 Â 10 24 as a cutoff for significance yielding 36 human proteins matching viral DNA segments (electronic supplementary material, file S1). The function of these proteins includes cell cycle control, cytokines, cytokine receptors, chemokines, chemokine receptors, apoptosis and complement activation (figure 2; electronic supplementary material, table S1b).

(b) Coral derived viromes versus predicted coral proteomes
Fourteen viromes were generated from four species of coral; Porites rus (four colonies), Acropora sp. (three colonies), A. yongeii (three colonies) and Pocillopora verrucosa (four colonies) [13]. Sequence data quality control was performed using PRIN-SEQ and DeconSeq, resulting in 1 048 627 good-quality sequences with an average size of 315 base pairs (electronic supplementary material, table S2) [14,15]. For this analysis, the viromes were combined in silico into one coral virome and compared with the predicted proteome of A. digitifera using tBLASTn. An e-value of less than 1 Â 10 24 was used as the cutoff for significance [16] (see electronic supplementary material, file S2, for full tBLASTn results). This resulted in a total of 60 191 virome-derived DNA sequences that significantly matching 5 863 predicted coral protein sequences. To reduce the complexity of this dataset, only coral proteins that matched to at least five of the virome fragments were considered for further analyses (2 503 proteins total; figure 3; electronic supplementary material, table S3) [17]. The potential functions of this group of proteins was predicted by comparing to the human proteome using BLASTp (electronic supplementary material, file S3), as well as a functional domain analysis using the conserved domain database (CDD) (electronic supplementary material, file S4) [18]. This showed that 83 proteins related to PAMP sensing and apoptosis (figure 4; electronic supplementary material, figure S1 and table S4a) [18]. Electronic supplementary material, table S5, provides descriptions of protein domains classified as PAMP sensing [19][20][21][22][23] and apoptosis [24][25][26][27][28][29][30]. Compared with the remainder of the coral proteome, this group of host proteins with homology to viral proteins was significantly enriched for PAMP sensing and apoptosis-associated domains (figure 4b; electronic supplementary material, (c) Bioinformatic prediction of novel immune proteins The pipeline described in figure 3 resulted in 159 coral proteins that could not be annotated using either the human proteome or the CDD. To determine the predicted cellular localization (i.e. membrane bound or cytoplasmic) transmembrane regions were predicted using the TMHMM server resulting in 25 proteins possessing 1-14 transmembrane regions (electronic supplementary material, figure S2 and file S5) [31]. Based on the pipeline used in this study a general experimental approach is proposed to identify novel immune components of non-model organisms ( figure 5). Viral nucleic acid is isolated from host tissue and compared with the host transcriptome to identify protein candidates that are predicted targets of viral proteins and therefore predicted to be involved with host immunity. This candidate group of immune-associated proteins is then compared with a protein domain database and the proteome of an established model organism to remove proteins whose function can already be predicted. Proteins that lack similarity to either database represent the final group of candidate immune genes with completely unknown function that can be further explored using in vitro and in vivo experimentation (figure 5).

Discussion (a) Summary
The pipeline described here combines domain-based protein annotations with novel annotations generated by viral communities to predict the host immune repertoire. In animals domain-based methods are biased towards the three major phyla investigated thus far (i.e. Chordata, Nematoda and Arthropoda); therefore, we cannot exclusively rely on those strategies to elucidate the immune systems of uncharacterized phyla [2]. Viral communities provide a domain-independent approach to predict host immune proteins that supplements existing domain-based protein annotation methods.

(b) Viral manipulation of apoptosis and intracellular pathogen sensing
The domain-based annotation of the coral proteome indicates the coral immune system is highly complex and in some instances more complex than humans [16,32,33]. As expected, many of the components predicted to be involved with coral immunity based on the presence of conserved domains are also predicted targets of the viral community. With the rise of metagenomics, it has become clear that many viruses are persistent and do not cause any known pathologies. For example, the virome of apparently healthy humans includes members of Herpesviridae, Polymovaviridae, Papillomavirade, Adenoviridae, Anelloviridae and Parvoviridae [34][35][36][37]. To maintain viral infection over time, persistent viruses will often express multiple proteins that possess sequence similarity to host proteins involved with programmed cell death (apoptosis) and PAMP sensing [9,38]. Figure 4 and electronic supplementary material, figure S1 suggest that, similar to humans, the viral communities associated with apparently healthy coral interact with apoptosis and PAMP sensing. Specifically, the tumour necrosis factor (TNF) signalling pathway and nod-like receptors (NLRs) are predicted targets of coral-associated viruses.
The TNF signalling pathway acts as a central mediator of apoptosis and appears to be functionally conserved from corals to humans [33,39]. TNF-receptor-associated factors (TRAFs) are critical adaptor proteins that bind to the intracellular portion of TNF receptors regulating cell survival and cytokine production [40]. Based on the total number of viral sequences matching to coral proteins homologues of TRAF6, TRAF4 and multiple TNFs were in the top 10% of all host proteins, suggesting that the TNF-signalling pathway is a major target of the coral virome (figure 4; electronic supplementary material, figure S1). In humans, viruses belonging to Herpesviridae, Adenoviridae, Reoviridae and Retroviridae have been shown to interact with TNF-mediated apoptosis [8]. Metagenomic evidence in corals indicates that similar to humans, the abundance of herpes viruses changes in response to physiological stress and is associated with bleaching events [41][42][43][44][45]. Future work should focus on which families of viruses interact with the coral TNF-signalling pathway and whether related viral manipulation strategies exist in corals and humans.
NLRs are intracellular signalling molecules that play a central role in the detection of PAMPs [46]. The genome of the reef-building coral A. digitifera encodes 500 predicted NLRs compared with only 22 found within the human repertoire [32]. In total, 398 viral hits were distributed across 25 NLRs including many containing a glycosyl transferase domains (electronic supplementary material, figure S1). Corals are the first metazoans described thus far to contain   the NLR-glycosyl transferase domain combination; however, the function any coral NLR has yet to be determined. The pipeline described here provides a starting point in the selection of specific NLRs that warrant additional investigation. Taken together, domain-based protein annotation provides a predicted framework of the coral immune system and, based on those annotations, coral-associated viruses are modulating pathogen sensing and the host apoptotic response.

(c) Beyond conserved domains-novel predictions from viral communities
Domain-based annotation allows for the rapid prediction of host immune proteins. However, the databases used to generate those annotations are largely based on experimental evidence from only three animal phyla (i.e. Chordata, Nematoda and Arthropoda) [2]. While many protein domains are expected to be conserved across phyla, relying exclusively on domainbased annotations will fail to identify immune proteins that lack previously characterized domains. The pipeline described here identified 159 predicted coral immune proteins that could not be annotated using the CDD and lack homology to any human protein. In addition to missing completely novel immune domains, comparing the host proteome to the CDD will also fail to identify proteins involved with immunity that lack canonical immune domains (electronic supplementary material, table S5). In total, 275 coral proteins could be annotated with the CDD but did not contain any PAMP sensing or apoptotic domains or possess any similarity to human proteins (electronic supplementary material, file S5). This group of proteins, as well as the 159 proteins that do not match any domain in the CDD, may be involved in coral-specific immunological processes. The pipeline described here provides a more comprehensive prediction of the host immune repertoire by combining the power of existing databases with new predictions generated from resident viral communities.

(d) Caveats and future work
One limitation of the proposed bioinformatic pipeline is its reliance upon amino acid alignments long enough to produce a significant e-value using tBLASTn. For example, some viruses hijack cell regulation using short elements termed eukaryotic linear motifs (ELMs) that are only two to eight residues in length [47,48]. Therefore, the short alignments between viral ELMs and their target host proteins would probably not produce significant e-values and fail to be identified. It is also important to remember that this method generates hypotheses that require direct biological and biochemical validation. This validation would ideally be performed within the organism of interest; however, if molecular techniques remain unavailable then a hybrid approach may be taken to provide a more rapid turnaround of hypothesis testing. For example, the established model organism Nematostella vectensis (sea anemone), a fellow anthozoan related to corals, has well-developed molecular methods that could be used to test the function of predicted coral immune proteins [16,49]. While this study has focused on characterizing metazoan immunity, the pipeline proposed here could also elucidate interactions between bacteria and their resident bacteriophage communities to identify novel bacterial immune proteins. Combining these data with metazoanvirus predictions would provide a broad understanding of immune processes occurring within the entire holobiont. The pipeline presented here predicts immune system structure by combining model organism-based databases with a domainindependent approach based on resident viral communities. This method can be applied to any holobiont with the potential to discover novel immunological processes.  Figure 3. Bioinformatic pipeline used to identify viral gene segments with sequence similarity to host proteins: uncharacterized viruses versus uncharacterized immune system (e.g. corals). Coral viromes extracted from four species of coral were compared with the predicted Acropora digitifera proteome through tBLASTn analysis resulting in 2503 protein hits. Coral proteins were classified as 'hits' if five or more unique viral sequences matched with e-value less than 1 Â 10 24 (see electronic supplementary material, table S3 for a summary of viral hits to coral proteins and electronic supplementary material, file S2 for full tBLASTn results). To determine whether the function of the 2503 coral proteins could be predicted based on sequence similarity to human proteins they were compared with the human proteome using BLASTp with an e-value cutoff of 1 Â 10 25 resulting in 2018 coral proteins with positive matches to human proteins (see electronic supplementary material, file S3 for full BLASTp results). To determine whether any of the remaining 585 proteins function could be predicted based on the presence of conserved domains they were analysed using the conserved domain database (CDD) resulting in 426 coral proteins with positive matches to a known domain (see electronic supplementary material, file S4 for full results of CDD analysis). Within this group of proteins 138 were considered 'viral-like' based on the presence of domains associated with viral proteins. In total 159 coral proteins possessed no similarity to either database and therefore represent predicted immune proteins with unknown function. rspb.royalsocietypublishing.org Proc. R. Soc. B 283: 20161200 samples of A. yongeii were generously donated by the Birch Aquarium, San Diego, CA. Sampling locations included the back reef of the Richard Gump Research Station (LT) and Tema'e beach (TA). Briefly, coral tissue was homogenized in the field with a mortar and pestle and 12 ml of 0.02 mm filtered seawater was added until all coral tissue was removed from the skeleton. Next, the coral-tissue slurry was placed in a 15 ml falcon tube followed by the addition of 1 ml of chloroform and stored at 48C for three weeks. To isolate VLPs from coral tissue, established ultracentrifugation protocols were used [13,50]. Briefly, 600 ml of 50 nmol l 21 dithiothreitol was mixed with 8 ml of coral slurry by vortexing, incubated at 378C for 1 h and centrifuged for 15 min at 3000 r.p.m. In total, 7 ml of the supernatant was then transferred to the top of a CsCl step gradient containing 1.0 ml of each CsCl solution at densities of 1.7 g ml 21 , 1.5 g ml 21 and 1.35 g ml 21 .

Material and methods
The step gradient containing the coral slurry was then spun for 2 h at 22 000 r.p.m. using an SW-41 Ti rotor (Beckman Coulter) and 1.5 ml was extracted from the 1.35-1.5 interface using a 24gauge needle with an upward-facing tip. Next, the sample was treated with DNase at a final concentration of 100 units ml 21 and incubated at 378C for 1 h. Finally, DNA was extracted from the VLPs using standard CTAB/formamide methods [13]. Two quality control approaches were taken to ensure viral purity of the final DNA product. First, VLPs were visualized before and after ultracentrifugation using standard SYBR-gold staining techniques, followed by a 16S and 18S PCR to ensure the isolated DNA was not contaminated with cellular genetic material [13].
To obtain sufficient nucleic acid for library preparation viral DNA was amplified using a modified linker amplified shotgun library (LASL) method [51]. Briefly, initial DNA concentrations were determined using the Qubit Fluorometric Assay (Life Technologies) and 5 ng of total DNA was combined with PCR-grade water to a final volume of 50 ml. Diluted DNA samples were briefly vortexed and transferred into 50ml Covaris tubes for shearing using   , table S4a for summary table, table S4b for statistical  analysis and table S5 for domain descriptions).  Figure 5. Proposed pipeline to predict immune proteins in uncharacterized systems. Viruses are extracted from the non-model organism of interest using a virus-like particle extraction protocol of choice and nucleic acid is sequenced. If a gene model is not available total RNA is also extracted from host tissue and an assembled transcriptome is prepared. Viral gene segments are compared with the translated transcriptome through tBLASTn to identify matches to host proteins. These proteins are further analysed through comparison with a well-characterized immune system using BLASTp (e.g. human or mouse) and domain prediction database (e.g. conserved domain CDD, Pfam). Proteins that lack hits to either database represent predicted immune genes with unknown function and are selected for further biochemical investigations using in vitro and in vivo experimentation.
rspb.royalsocietypublishing.org Proc. R. Soc. B 283: 20161200 the Covaris M220 Focused Ultrasonicator (40 s at 9 W). Sheared DNA was then end-repaired and ligated to LASL Linker A (Linker A FWD: 5 0 -P-GTATGCTTCGTGATCTGTGTGGGTGT-3 0 , Linker A REV: 5 0 -CCACACAGATCACGAAGCATAC-3 0 ) followed by size selection with a target size of 500 base pairs using Pippin Prep (Sage Science) [51]. For each biological sample, four PCR reactions were prepared using a barcoded primer and 17 cycles of PCR were performed (PfuTurbo Cx Hotstart DNA Polymerase, Agilent). Replicates of the same samples were combined, purified and separated into four aliquots for further amplification with three cycles of reconditioning PCR. Finally, all barcoded samples were quantified and pooled for library preparation and sequencing on MisSeq platform using the MiSeq Reagent Kit v3 600 cycles chemistry (Illumina).

(b) Quality control of sequence data
Preceding downstream analysis quality control of sequence data was performed using a variety of bioinformatic tools. Paired end reads were first joined using COPE [52], low-quality sequences (less than 100 base pairs in length, mean quality score less than 25, or containing more than 10% N's) were removed using PRIN-SEQ and any remaining bacterial or human sequences were removed with DeconSeq [14,15] (see electronic supplementary material, table S1 for the number of sequences removed at each pre-processing step).

(c) Bioinformatic pipeline used to analyse viromes
Viromes were compared with the predicted coral proteome [53] using standalone tBLASTn analysis with computational-based statistics turned on (D) and low complexity regions of proteins masked (seq) [17]. Predicted coral proteins were classified as 'hits' if five or more viral sequence fragments matched a coral protein with an e-value less than 1 Â 10 24 . Predicted coral protein hits were then compared with the human proteome using BLASTp analysis with an e-value cutoff of less than 1 Â 10 25 and the CDD with an e-value cutoff of 0.01 [18].
Ethics. Permission to conduct this research was granted by the