Exosomes in mammals with greater habitat variability contain more proteins and RNAs

Factors determining habitat variability are poorly understood despite possible explanations based on genome and physiology. This is because previous studies only focused on primary measures such as genome size and body size. In this study, we hypothesize that specific gene functions determine habitat variability in order to explore new factors beyond primary measures. We comprehensively evaluate the relationship between gene functions and the climate envelope while statistically controlling for potentially confounding effects by using data on the habitat range, genome, body size and metabolism of various mammals. Our analyses show that the number of proteins and RNAs contained in exosomes is predominantly associated with the climate envelope. This finding indicates the importance of exosomes to habitat range expansion of mammals and provides a new hypothesis for the relationship between the genome and habitat variability.


Introduction
The evolution of species habitat variability is an important topic for a wide range of research fields, particularly in the context of predictions related to biodiversity and climate change [1][2][3]. Thus, understanding factors that determine habitat use is relevant for advancing these fields. In particular, it is important to identify the molecular mechanisms that contribute to determining a species habitat range, because the behaviour of a species may result from complex biological systems.
Genetic studies are useful in this context. Specifically, previous studies have shown the importance of genomic properties. For example, genome size and the number of total genes increase with environmental variability because organisms need more functional (e.g. metabolic) genes in order to adapt to changing environments (e.g. nutrient variability) [4,5].
2017 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.

Habitat variability
Our method was similar to that reported in a previous study [8] which considered the climate envelope as a score of habitat variability. The climate envelope indicates the temperature range, precipitation and other climatic parameters in the habitat area of mammals and is calculated based on 19 bioclimatic variables (www.worldclim.org/bioclim) using principal component analysis (PCA). The spatial data on the habitat areas of terrestrial mammals were downloaded from the International Union for Conservation of Natural Resources Red List website (www.iucnredlist.org/technical-documents/spatial-data) on 26 August 2016. Climate data within habitat areas at a spatial resolution of 10 min of a degree were downloaded from the WorldClim database (v. 1.4, release 3) [26] (www.worldclim.org) using R software (v. 3.3.1) (www.R-project.org) and R-package raster (v. 2. [5][6][7][8]. The PCA showed that the first two principal components (PCs) explained 99.6% of the total variance; the contributions of PC1 and PC2 were 98.1% and 1.5%, respectively. We considered 14 715 (=981 × 15) cells to weigh the relative contributions of PC1 and PC2 when estimating the climate envelope. The climate envelope was defined as the number of overlapping points in the 14 715 cell grids.
We also considered a different definition of habitat variability: habitat diversity. According to a previous study [8] habitat areas using the Brillouin Index, a diversity measure which is robust to sample size. The Köppen-Geiger climatic classification map (data observed between 1976 and 2000) required to calculate the habitat diversity was downloaded online (koeppen-geiger.vu-wien.ac.at/shifts.htm) on 3 January 2016.

Genome, metabolic rate and body mass
We selected mammalian species whose genomes were available in the Kyoto Encyclopaedia of Genes and Genomes (KEGG) database [27] for use in the KEGG BRITE Functional Hierarchy (see §2.3). The genome statistics (i.e. genome size, G (bp) and the number of total genes, N g ) and protein sequence data of the species were obtained from the KEGG database on 25 August 2016. The number of duplicated genes, N d , was estimated according to a previous study [8]. We performed an all-against-all Basic Local Alignment Search Tool search for all protein sequences. Homologous genes in the same species whose E < 10 -5 and query coverage > 30% were defined as duplicated genes. We collected data on mass-specific metabolic rates B c (W g −1 ) and body mass M (g) from our previous studies [12,16]. The data for 32 mammalian species are available in the electronic supplementary material, table S1.

Functional categories of genes
In our previous studies [12,16], we used the KEGG BRITE Functional Hierarchy [27] (www.kegg.jp/ kegg/brite.html) for calculating the number of genes in functional categories (NOGFs). In this study, we did not consider Gene Ontology (GO) [15] as a definition of functional category because there were fewer organisms whose GO annotations were completed, compared to the KEGG BRITE Functional Hierarchy. We obtained data on the relationships between functional category and gene identifiers of species, S, from the KEGG FTP website (ftp.bioinformatics.jp/kegg/brite/organisms/S/) on 25 August 2016, where S corresponds to the KEGG organism identifier (www.genome.jp/kegg/catalog/org_list.html). In total, we considered 384 functional categories (electronic supplementary material, table S1).

Statistical analysis
To evaluate the contribution of each factor to the climate envelope, we performed a phylogenetic multivariate analyses using R (v. 3.3.2). Ideally, we would have used a direct phylogenetic regression model (e.g. phylogenetic generalized least squares) for all 389 explanatory variables. However, we could not perform such an analysis in this study because of the combinatorial explosion in model selection and the multicollinearity that mainly arises from gene overlap among functional categories or the hierarchical organization of functional categories [12,16]. In fact, we were unable to perform the phylogenetic generalized least squares using the gls function in the R-package nlme (v. 3.1-128) because of singularities in the regression model. As mentioned in our previous study [12], this is a particular problem with GO analyses. To avoid this problem as much as possible, we considered parameter selection using a phylogenetic version of the least absolute shrinkage and selection operator (LASSO) method and completed the statistical analysis using phylogenetic generalized least squares.
To identify candidates of functional categories associated with the climate envelope, we first considered the LASSO method that would be useful both for parameter selection and for regularization in order to increase the interpretability of the regression model for finding significant variables [28]. To remove any phylogenetic effects from the association between biological variables, phylogenetically independent contrasts (PICs) of the variables were computed from phylogenetic trees using the pic function in the R-package ape (v. 3.5). According to previous studies [8,16,29] (and also for comparison with the previous study [8]), the mammalian phylogenetic tree (electronic supplementary material, figure S1) used in the analyses was constructed using the matrix extracellular phosphoglycoprotein precursor gene, downloaded from the KEGG database on 29 August 2016. We performed the parameter selection using the LASSO method using the cv.glmnet and glmnet functions in the R-package glmnet (v. 2.0.5).
To control for confounding effects, we directly performed a phylogenetic regression analysis using the gls function of the R-package nlme and the phylogenetic tree. Specifically, we constructed full models encompassing number of genes in exosomes (NOGFs) of the identified candidates, N g , N d , G, M and B c , and selected the best model using the sample-size-corrected version of the Akaike information criterion (AICc). To avoid model selection bias, we also adopted a model-averaging approach [30,31]. We obtained averaged models in the top 95% confidence set using the model.avg function in the R-package MuMIn (v. 1.15.6). Climate envelope was square-root transformed for all analyses. The contribution (i.e. non-zero estimate) of NOGF to the climate envelope was completed only when the associated p < 0.01.

Results
After performing the LASSO method, only the functional category of Exosome was selected (electronic supplementary material, table S2). In particular, we found that the number of genes in exosomes (NOGF Exosome ) was positively associated with the climate envelope ( figure 1; the coefficient of determinant in the linear regression R 2 = 0.54, the associated p-value = 2.4 × 10 -6 ). As shown in figure 2, however, the climate envelope was still positively correlated with the number of total genes (figure 2a;    These results indicate that the spurious contribution of exosomal genes to the climate envelope remains possible. To control for potentially confounding effects, we performed a phylogenetic multiple regression analysis. The full model, best model and averaged model indicated that the number of genes in the functional category of Exosome (i.e. exosomal genes) were the main contributor to the climate envelope (table 1).

Discussion
The main result of this study is the association between the number of genes that code for proteins and RNAs in exosomes and habitat variability in mammals. The effect of the exosome is independent of the other genomic factors, physiological parameters and phylogenetic relationships. It is the number of exosomal genes rather than the other genomic and physiological factors that affects species habitat variability. A previous study [8] emphasized the importance of gene duplication to habitat variability, but limited to Euarchontoglires. However, the contribution of exosomes to habitat variability was concluded for all mammals. This fact also indicates a greater importance of exosomes to habitat range expansion.
We also confirmed that habitat variability is positively associated with the number of total genes although the number of total genes was a confounding variable. These results are near-consistent with the findings of previous studies [4,5]: that the redundancy of genes enhance biological robustness and that it increases species habitat variability. However, we could not conclude any relationship between genome size and habitat variability. This may be because the previous studies focused on prokaryotes whereas this study focused on mammals. On the other hand, we concluded that metabolic rate and body size hardly influence species habitat variability. This result suggests that the hypothesis that metabolic rate and body size determine habitat variability because of the wider (relative) space use of animals with smaller body sizes and faster mass-specific metabolic rates is barely accepted.
Although the role that exosomes play in biology is not yet well understood [32], the observed association may be because exosomes act as carriers for intercellular communication [33]. Exosomes are small endocytic vesicles that can enhance communication by transporting bioactive molecules (e.g. proteins and RNAs, including non-coding RNAs) between different cells. Cell-to-cell communication allows for the coordination of cell functions, thus plays an important role in the development and environmental adaptation of multicellular organisms [34]. The number of proteins and RNAs involved in exosomes is linked to the strength and promptness of cell-to-cell communication. That is, more regulatory genes involved in exosome circulation help to fine-tune communication between cells in a wider spectrum of environmental conditions. From these facts, we can hypothesize that the number of exosomal genes influences habitat variability in mammals. The contribution of exosomal genes can also be discussed in a different context. Exosomes play an important role in diseases (cancer, in particular) by transporting bioactive molecules such as non-coding RNAs [35,36]. Specifically, microRNAs, which are a type of non-coding RNA, help to confer robustness to biological processes by maintaining transcriptional processes [37]. Such biological robustness may be linked to habitat variability. This also indicates the importance of exosomes to habitat variability because exosomes also transport microRNAs between cells.
Our study has some general limitations. For example, we only considered organisms for which genome sequences were completed. Moreover, our findings depended significantly on the quality of genome annotation. In particular, there are limitations to our phylogenetic comparative analysis. A phylogenetic comparative analysis assumes a Brownian motion-like evolution of biological traits on a phylogenetic tree with accurate branch lengths, which may result in a misleading conclusion. For example, statistical power decreases when a dataset is reduced in size following phylogenetic corrections [38]. In particular, our dataset contained only a few samples for mammals, so it may fall into that situation. Thus, the continued sequencing of genomes from a wide range of organisms will be important. In addition to this, the effects of genetic variation were not considered in this study because of the limited amount of available data. The development of high-throughput sequencing techniques may resolve these problems.
The definition of habitat variability is also controversial. In accordance with previous studies [7,8], we also considered climate envelope as a definition of habitat variability. However, it was limited to the context of climatic niche width. We may need to consider alternative definitions. We obtained similar conclusions when using habitat diversity [8] as defined by the Köppen-Geiger climate classification, which is based not only on temperature and precipitation but also on vegetation, instead of the climatic envelope. This was particularly true for the positive relationship between habitat diversity and the number of exosomal genes (R 2 = 0.42, p = 8.7 × 10 -5 ; electronic supplementary material, figure S2). This result reinforces the importance of exosomes to habitat variability.
To avoid inevitable technical problems [12,16] (e.g. combinatorial explosion and singularities in the regression model), we used the LASSO regression. However, our analysis might not be complete. We might have overlooked important gene functions because LASSO may pick only one or a few of the variables that are related to some extent to the response variable and then shrinks the rest to 0 when the explanatory variables are highly correlated [28]. To avoid this limitation, for example, we may have considered the elastic-net and relaxed LASSO. We did not use these methods because they required higher computational costs (i.e. more parameters needing to be tuned) and because the interpretation of the model can become difficult. Alternative statistical methods may be required for further investigation.
Despite these limitations, we believe in the importance of our finding (i.e. exosome hypothesis). This finding enhances our understanding of the evolution of habitat ranges in mammals. Moreover, it indicates the possibility of estimation and evaluation of not only disease but also species habitat variability by using exosome sequencing. Exosome sequencing is well used in medical sciences [39] (e.g. medical diagnosis). Similarly, we may be able to perform a diagnostic on ecosystems (e.g. biodiversity and extinction risk) through this sequencing. Sequencing analyses are now beginning to be applied in ecology (e.g. in population ecology [40] and identifying species-species interactions [41]). We suggest that the finding of this study be applied to these research fields.
Data accessibility. The datasets supporting this article have been uploaded as the electronic supplementary material. The datasets are also deposited at Dryad (http://dx.doi.org/10.5061/dryad.15500) [42].
Authors' contributions. K.T. conceived and designed the study. K.T. and M.I. prepared the data and performed data analysis. K.T. and M.I. interpreted the results. K.T. drafted the manuscript. K.T. and M.I. gave the final approval for publication.