Genome-wide analyses of the bHLH superfamily in crustaceans: reappraisal of higher-order groupings and evidence for lineage-specific duplications

The basic helix-loop-helix (bHLH) proteins represent a key group of transcription factors implicated in numerous eukaryotic developmental and signal transduction processes. Characterization of bHLHs from model species such as humans, fruit flies, nematodes and plants have yielded important information on their functions and evolutionary origin. However, relatively little is known about bHLHs in non-model organisms despite the availability of a vast number of high-throughput sequencing datasets, enabling previously intractable genome-wide and cross-species analyses to be now performed. We extensively searched for bHLHs in 126 crustacean species represented across major Crustacea taxa and identified 3777 putative bHLH orthologues. We have also included seven whole-genome datasets representative of major arthropod lineages to obtain a more accurate prediction of the full bHLH gene complement. With focus on important food crop species from Decapoda, we further defined higher-order groupings and have successfully recapitulated previous observations in other animals. Importantly, we also observed evidence for lineage-specific bHLH expansions in two basal crustaceans (branchiopod and copepod), suggesting a mode of evolution through gene duplication as an adaptation to changing environments. In-depth analysis on bHLH-PAS members confirms the phenomenon coined as ‘modular evolution’ (independently evolved domains) typically seen in multidomain proteins. With the amphipod Parhyale hawaiensis as the exception, our analyses have focused on crustacean transcriptome datasets. Hence, there is a clear requirement for future analyses on whole-genome sequences to overcome potential limitations associated with transcriptome mining. Nonetheless, the present work will serve as a key resource for future mechanistic and biochemical studies on bHLHs in economically important crustacean food crop species.

WHC, 0000-0001-5832-2000; AGL, 0000-0001-8960-8095 The basic helix-loop-helix (bHLH) proteins represent a key group of transcription factors implicated in numerous eukaryotic developmental and signal transduction processes. Characterization of bHLHs from model species such as humans, fruit flies, nematodes and plants have yielded important information on their functions and evolutionary origin. However, relatively little is known about bHLHs in non-model organisms despite the availability of a vast number of high-throughput sequencing datasets, enabling previously intractable genome-wide and cross-species analyses to be now performed. We extensively searched for bHLHs in 126 crustacean species represented across major Crustacea taxa and identified 3777 putative bHLH orthologues. We have also included seven whole-genome datasets representative of major arthropod lineages to obtain a more accurate prediction of the full bHLH gene complement. With focus on important food crop species from Decapoda, we further defined higherorder groupings and have successfully recapitulated previous observations in other animals. Importantly, we also observed evidence for lineage-specific bHLH expansions in two basal crustaceans (branchiopod and copepod), suggesting a mode of evolution through gene duplication as an adaptation to changing environments. In-depth analysis on bHLH-PAS members confirms the phenomenon coined as 'modular evolution' (independently evolved domains) typically seen in multidomain proteins. With the amphipod Parhyale hawaiensis as the exception, our analyses have focused on crustacean 2018 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.

Background
The basic helix-loop-helix (bHLH) transcription factor superfamily is among one of the most ancient gene families shared by eukaryotic organisms [1][2][3]. Members of this superfamily possess two conserved yet functionally distinct domains made up of a basic DNA binding domain (E box) at the aminoterminal followed by an HLH domain at the carboxy-terminal where the latter confers the possibility for hetero-or homodimerization with other proteins [4,5]. Both domains are required to form active DNA binding complexes whereby the functionality and precision in regulating gene expression networks are determined by the formation of multiple dimer combinations through the HLH domains [6]. This dimerization potential is, arguably, an effective mechanism for gene regulation considering that each dimer pair probably has specific genetic targets. For instance, bHLHs in animals are intimately linked to the regulation of a multitude of physiological and developmental processes. Some of these include cell cycle regulation, neurogenesis, haematopoiesis, myogenesis, differentiation, apoptosis, juvenile hormone signalling in arthropods and sensing of extracellular stimuli [1,2,[7][8][9][10][11][12][13][14][15].
Given their importance as critical transcriptional regulators in major biological processes, it is perhaps not surprising that many studies have focused on the identification and characterization of bHLH orthologues across various plant and animal species. From the first bHLH motif identified, the murine E12 and E47 transcription factors [16], full genetic complements of bHLH proteins have since been reported in various model organisms owing to the availability of complete genome sequences [3,[17][18][19][20][21][22][23][24][25]. Phylogenetic analyses on bHLH orthologues from model bilaterian species (Homo sapiens, Drosophila melanogaster and Caenorhabditis elegans) and early branching metazoans (the demosponge Amphimedon queenslandica and cnidarians Nematostella vectensis and Hydra magnipapillata) have convincingly demonstrated that diversification of metazoan bHLHs could have arisen in two steps: the first occurring before the divergence of demosponges from other animals and the second before the divergence of cnidarians and bilaterians [3,21]. Moreover, the increasing repertoire of bHLH proteins have enabled the classification of bHLH orthologues into six higher-order groups (A, B, C, D, E and F) based on distinct structural features [1,2,19,26,27].
Major efforts have thus far focused on the classification of bHLH proteins in model organisms. Beyond canonical model species, some headway has been made in elucidating bHLH genes in major metazoan lineages, which include bilaterians (deuterostomes and protostomes) and basal metazoans (hydra and sponge) [3]. Yet, relatively little is known about bHLHs in one of the most important groups of animals that represent a significant portion of aquatic sources of proteins, i.e. crustaceans. To date, studies in crustaceans are not only limited to just a few species but also to specific bHLH groups [13,14]. Systematic and cross-species characterization of crustacean bHLH proteins focusing on major food crop species from the order Decapoda is therefore necessary. In fact, numerous studies have begun to shed light on the importance of group C members, collectively known as bHLH-PAS, as environmental sensors [28]. For example, members of the crustacean bHLH-PAS family have been shown to regulate growth and reproduction through juvenile hormone signalling (Methoprene-tolerant, MET) [15,29], locomotion through circadian timekeeping (aryl hydrocarbon receptor nuclear translocator-like protein 1, BMAL1) [30,31] and response to changes in oxygen tension (hypoxia inducible factor 1, HIF-1) [32,33]. This holds much relevance for shrimp farming industries that are constantly facing the pressures of fluctuating environmental conditions that could potentially compromise aquaculture yield [32,34,35].
Here, we perform a cross-species characterization of the crustacean bHLH superfamily to address this major deficit in the field. The large number of recent crustacean transcriptomic datasets in public repositories has now permitted in-depth studies on key food crop species from the order Decapoda and other species across the broader Crustacea, hence offering important insights into the diversity of crustacean bHLHs (electronic supplementary material, table S1). Using sequence, motif and domain similarity-based approaches, we have conservatively identified 4113 bHLH orthologues, which include 3777 orthologues from 126 crustacean species and 336 orthologues from six other non-crustacean arthropod species (electronic supplementary material, tables S1 and S2 and file S1). Within this key dataset, we define higher-order bHLH groups in three major decapod species using phylogenetic-based approaches and annotate bHLH-PAS proteins in decapods and basal crustaceans (branchiopods and copepods).

Transcriptome datasets and query sets
We retrieved complete transcriptome datasets for 126 crustacean species available at the time of manuscript preparation from the European Nucleotide Archive (https://www.ebi.ac.uk/ena). Six noncrustacean arthropod proteomes were retrieved from Uniprot (http://www.uniprot.org/). A complete list of accessions used in this study is provided in the electronic supplementary material, table S1. We retrieved a list of query sequences used in subsequent homology searches from Uniprot and GenBank.

Identification of bHLH orthologues
Based on a previously published workflow [36], we used multiple Basic Local Alignment Search Tool (BLAST)-based approaches such as BLASTp and tBLASTn with varying Blocks Substitution matrices to identify bHLH orthologues. The BLAST results were filtered by e-value of <10 −6 , best reciprocal BLAST hits against the GenBank non-redundant (nr) database and redundant contigs having at least 95% identity were collapsed using CD-HIT (https://github.com/weizhongli/cdhit). We then used HMMER employing hidden Markov models (HMM) profiles [37] to scan for the presence of bHLH Pfam domains [38] on the best reciprocal nr BLAST hits to compile a final non-redundant set of crustacean and arthropod bHLH orthologues. Pfam annotations and associated e-values are provided in electronic supplementary material, table S2.

Multiple sequence alignment and phylogenetic tree construction
Multiple sequence alignments of bHLH protein sequences were performed using MAFFT [39]. Phylogenetic trees were built from the MAFFT alignment using RAxML WAG + G model to generate best-scoring maximum-likelihood trees [40]. Bayesian inference trees were constructed using MRBAYES [41]. GENEIOUS was used to generate graphical representations of Newick trees [42].

Assignment of bHLHs into higher-order groups/families
Considering the large number of putative bHLHs identified, independent analyses on genes from individual species are required. We therefore selected three decapod species for further analyses and assignment of bHLHs into higher-order groups. These species represent three dominant families of economically important food crops: Pacific whiteleg shrimp Litopenaeus vannamei (Penaeidae, 87 genes), freshwater crayfish Cherax quadricarinatus (Parastacidae, 65 genes) and mud crab Scylla olivacea (Portunidae, 66 genes). Maximum-likelihood trees were generated from multiple sequence alignments with bHLHs from Homo sapiens to define orthologous groups [21]. Overall Amphipoda (64) Isopoda (27) Mysida (1) Euphausiacea (2) Decapoda (19) Notostraca (1) Cladocera (2) Diptera (3) Geophilomorpha (1) Ixodida ( Regier et al. [44] Scorpiones (1) Siphonostomatoida (2) Cyclopoida (2) Harpacticoida (3) Calanoida ( bHLHs could be confidently assigned to previously described higher-order groups (A, B, C and E) (figure 2; electronic supplementary material, figure S1). Group A (proteins that bind CACCTG or CAGCTG E boxes) and group E (proteins that bind CACGCG or CACGAG N boxes) bHLHs are monophyletic in all three decapod species (figure 2a-d; electronic supplementary material, figure S1); a reappraisal of previous observations on bHLHs from human [21] and sponge [3]. Group B (proteins that bind CACGTG or CATGTTG E boxes) and group C (proteins that contain PAS domains) bHLHs are probably paraphyletic as they constitute members from other groups (figure 2b-d) [21]. We observed no evidence of group F bHLHs (figure 2a) and only one instance of a group D member from C. quadricarinatus (figure 2d,e). This could suggest that group D and F bHLH members are present but not represented in the transcriptome datasets, that members we have found have evolved convergently and/or a complex phenomenon of loss among certain crustacean taxa.

Evidence for lineage-specific duplications of bHLHs in Branchiopoda and Copepoda
Our analyses of the full bHLH gene complement from the genomes of seven species have revealed that arthropods/crustacean, on average, possess 56 bHLH genes: Parhyale hawaiensis (57 genes), Ixodes scapularis (  underpinning its intriguing phenotypic plasticity and remarkable ability to adapt to major ecological challenges [45]. The expansion of bHLHs as an adaptation mechanism to adverse environmental conditions can similarly be explained in copepods. These marine planktonic species serve as models for ecotoxicology and environmental genomics due to their high tolerance to a wide range of salinity and temperature [46][47][48][49]. As many bHLH members function as sensors of the environment, expansion through gene duplication would serve as an important evolutionary force for functional divergence to genetically adapt and cope with environmental stressors [50].

Monophyly of β-class bHLH-PAS proteins in decapods and basal crustaceans
The and circadian regulation. Many well-known PAS proteins have been shown to also possess a bHLH domain where they are collectively termed as bHLH-PAS [11,28]. The bHLH-PAS family is further classified into two subgroups: α-class and β-class [11,51]. The α-class proteins act as archetypal sensors of tissue-specific or environmental signals, e.g. single-minded (SIM-development), HIF (oxygen sensing), neuronal PAS domain proteins (NPAS-development) and circadian locomotor output cycles protein kaput (CLOCK-circadian timekeeping) and MET (juvenile hormone signalling) [11,[51][52][53]. Unable to dimerise among themselves, α-class members require β-class proteins, aryl hydrocarbon receptor nuclear translocators (ARNT), as their broad-spectrum binding partners [11,51]. We perform phylogenetic analysis on bHLH-PAS members identified from decapods and basal crustaceans (branchiopods and copepods; figure 3; electronic supplementary material, figures S4 and S5). We have also included bHLH-PAS genes identified from other arthropod species (insects, tick, scorpion and centipede) and previously annotated proteins from Uniprot (figure 3; electronic supplementary material, figures S4 and S5) [21]. Consistent with other reports, we demonstrate that crustacean bHLH-PAS orthologues form two distinct phylogenetic groups, α and β (figure 3; electronic supplementary material, figures S4 and S5). The β-class ARNT proteins (including the BMAL family) form a well-supported monophyletic group, whereas α-class proteins are polyphyletic: HIF and SIM members form a single cluster while MET and CLOCK form separate monophyletic groups (figure 3; electronic supplementary material, figures S4 and S5). Our observation is consistent with previous reports suggesting that bHLH-PAS paraphyly could be explained through modular evolution [54]. Modular evolution by domain shuffling is commonly seen in bHLH proteins that also possess other highly conserved domains such as PAS. As evidenced by the lack of sequence similarity in flanking regions, it is likely that the association between PAS and bHLH domains occurred multiple times independently through domain duplication or insertion [19,54].

Conclusion
Although a majority of our analyses have focused on transcriptome datasets that are dependent on tissue-type and the differential expression of transcripts at the point of sample collection, the power of detection is supported by the fact that (i) the transcriptomes were sequenced to considerable depth (electronic supplementary material, table S1) and (ii) we do find consistent detection of bHLH gene families in whole-genome sequences (electronic supplementary material, table S2). We are unable to entirely rule out the possibility that some genes would remain undetected, hence potential caveats associated with transcriptome mining must be taken into consideration. In summary, we identified 3777 putative bHLH orthologues from 126 crustacean species representing major Crustacea taxa. We also annotated 336 bHLHs from six additional non-crustacean arthropods sampling across broad taxonomic range to include genomes of emerging model species (centipede and scorpion). We observed evidence for lineage-specific gene expansions in branchiopod and copepod suggesting a mechanism for genetic adaptation to adverse environmental factors commonly encountered by these species. Phylogenetic analyses on decapod bHLHs recapitulated the evolutionarily conserved higher-order orthologous groupings seen in other metazoans. Further analysis on group C bHLH-PAS members revealed that although β-class members are monophyletic, this is not true for the remaining α-class members.