Hidden in plain sight: what remains to be discovered in the eukaryotic proteome?

The first decade of genome sequencing stimulated an explosion in the characterization of unknown proteins. More recently, the pace of functional discovery has slowed, leaving around 20% of the proteins even in well-studied model organisms without informative descriptions of their biological roles. Remarkably, many uncharacterized proteins are conserved from yeasts to human, suggesting that they contribute to fundamental biological processes (BP). To fully understand biological systems in health and disease, we need to account for every part of the system. Unstudied proteins thus represent a collective blind spot that limits the progress of both basic and applied biosciences. We use a simple yet powerful metric based on Gene Ontology BP terms to define characterized and uncharacterized proteins for human, budding yeast and fission yeast. We then identify a set of conserved but unstudied proteins in S. pombe, and classify them based on a combination of orthogonal attributes determined by large-scale experimental and comparative methods. Finally, we explore possible reasons why these proteins remain neglected, and propose courses of action to raise their profile and thereby reap the benefits of completing the catalogue of proteins’ biological roles.

VW, 0000-0001-6330-7526; MAH, 0000-0003-4148-4606 The first decade of genome sequencing stimulated an explosion in the characterization of unknown proteins. More recently, the pace of functional discovery has slowed, leaving around 20% of the proteins even in well-studied model organisms without informative descriptions of their biological roles. Remarkably, many uncharacterized proteins are conserved from yeasts to human, suggesting that they contribute to fundamental biological processes (BP). To fully understand biological systems in health and disease, we need to account for every part of the system. Unstudied proteins thus represent a collective blind spot that limits the progress of both basic and applied biosciences. We use a simple yet powerful metric based on Gene Ontology BP terms to define characterized and uncharacterized proteins for human, budding yeast and fission yeast. We then identify a set of conserved but unstudied proteins in S. pombe, and classify them based on a combination of orthogonal attributes determined by large-scale experimental and comparative methods. Finally, we explore possible reasons why these proteins remain neglected, and propose courses of action to raise their profile and thereby reap the benefits of completing the catalogue of proteins' biological roles.

Slow progress in characterizing unknowns
When the first eukaryotic chromosome (chromosome III of Saccharomyces cerevisiae) was sequenced in 1992, the most surprising discovery was that previously undetected protein-coding genes outnumbered mapped genes by a factor of five [1,2]. Researchers had generally assumed that few proteins remained to be discovered, especially in an organism as intensively studied as yeast. The completion of the S. cerevisiae genome sequence in 1996 confirmed that more than half the genes lacked any indication of their biochemical activity or broader biological role [2]. Over the ensuing two decades, complete genome sequences have become available for over 6500 eukaryotic species [3]. At first, characterization of newly discovered genes progressed rapidly in model species, as researchers supplemented classical biochemistry and forward genetics with reverse genetics, homology modelling and large-scale systematic techniques to study novel genes. Complete genome sequences also allowed the deployment of large-scale systematic techniques [4]. More recently, however, progress has slowed, even in well-studied species such as the budding yeast S. cerevisiae and the fission yeast Schizosaccharomyces pombe. Figure 1 shows protein characterization over the past 28 years for these two yeasts. Notably, the proportions characterized in fission yeast (84%) and budding yeast (82%) have only slightly increased in the past decade (from 80%, as noted by Peñ a-Castillo & Hughes [11], and 77%, respectively). Across all studied eukaryotic species, the proportion of characterized proteins has & 2019 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.
reached a plateau around 80%, and exhibits a long-tailed distribution, with the biological roles of the remaining 20% still elusive.
Here, we use a simple yet powerful metric based on Gene Ontology (GO) biological process (BP) terms to define characterized and uncharacterized proteins for human and the two model yeasts. We then combine our GO-based classification with information about taxonomic conservation using fission yeast to identify a set of broadly conserved, but unstudied, proteins. We classify the fission yeast conserved but unstudied protein set based on a combination of orthogonal attributes (e.g. taxonomic conservation, mutant viability, protein sequence features, localization). Finally, we explore possible reasons why these proteins remain neglected, propose courses of action to raise their profile among bench researchers and bioinformaticians, and posit the benefits of completing catalogue of proteins' biological roles.

Defining unknown metrics: what counts as 'known'?
To estimate more precisely the proportion of a proteome that is characterized, and to provide inventories of uncharacterized gene products, the 'known' category must be rigorously defined. However, the gradual accumulation of data of many different types, from diverse experimental and computational methods and multiple sources, makes it challenging to draw a clear line between 'known' and 'unknown'. For example, in 2004 Hughes et al. [12] observed that the then-current Yeast Proteome Database (YPD) listed 80% of S. cerevisiae genes as 'known', but also noted that by more stringent criteria based on GO annotation then in the Saccharomyces Genome Database (SGD), 30-40% remained unknown and others only poorly understood. Knowledge acquisition is necessarily a continuum-different experiments are performed at different scales (e.g. high-versus lowthroughput) and yield results at different levels of biological detail (e.g. detecting DNA repair versus distinguishing mismatch repair from base excision repair) and confidence (stemming from variation in the quality of assays and the number of replicates performed). For these reasons, the characterization status of gene products does not fall on a simple linear scale. Biologists often make qualitative judgement calls to designate individual gene products as 'novel', 'barely characterized' or 'relatively well characterized'. While this serves the purposes of individual researchers working on a gene-by-gene basis, a more quantitative and objective approach is required to summarize the status of functional characterization for an entire proteome, and to facilitate cross-species comparisons.

Metrics to describe functional characterization levels
To develop workable metrics for the status of the functional annotation of a given proteome, we have exploited GO annotation [13], and illustrate this scheme using the proteomes of S. cerevisiae, S. pombe and human as examples. Since the functional attributes of gene products of diverse species are routinely described using GO, these metrics are widely applicable. The GO molecular function (MF) ontology describes molecular-level activities of gene products (such as catalytic, transporter and receptor activities). GO BP refers to ordered assemblies of MFs representing physiological roles of gene products (e.g. involvement in cytokinesis or DNA replication). GO cellular component (CC) provides the locations of gene products (organelles, complexes, etc.). Determining each annotation type relies on different experimental techniques, yielding complementary results and insights. We might know a gene product's MF or its localization (CC), but not the physiological context in which that product acts (BP).

Annotation coverage for proteins by Gene Ontology aspect
One simple way to quantify the degree to which the function of a given gene product has been characterized is to report annotation to one, two or all three GO aspects (MF, BP, CC). GO aspect coverage provides a simple metric that is accessible for any species by counting gene products with (or without) annotation to each aspect. Figure 2 shows coverage by ontology aspect for human, fission yeast and budding yeast proteins. From this viewpoint, we can quickly assess the number of gene products annotated to all three aspects of GO (usually well characterized), or none at all (uncharacterized), and assess what types are absent for a species.

Known physiological function: Gene Ontology slim process coverage
Although informative, the activities captured by MF and the localizations described by CC make only a limited contribution to knowledge about a gene product's characterization if taken in isolation. Although some MF terms (such as those describing transcription factor activities, substrate-specific transmembrane transporter activities, and some specific catalytic activities) implicitly refer to physiological roles, biologically informative gene characterization usually requires additional data to place the activity into a broader physiological context. By contrast, BP annotation provides this context, and thereby reveals more about the role of a gene product in an organism's biology, and provides a useful benchmark for preliminary characterization. As an example, knowing that a protein is a kinase (MF) is not very informative until we find that a deletion mutant of the gene encoding that protein is defective in meiotic nuclear division (BP) and the protein itself localizes to chromatin (CC). To use this information as a measure of the progress in a protein's functional characterization, we use the annotation overviews provided by tailored GO term subsets known as 'GO slims' [17]. We created a BP slim set covering as many annotated gene products as possible, while remaining informative about the physiological context in which they operate. As our starting point, we took the fission yeast GO BP slim developed at PomBase over 8 years [18], which provides excellent coverage of informative cellular processes for fission yeast (99.4% of annotated proteins with a known process), and minimizes overlap between terms. The PomBase slim aims to demonstrate the distribution of processes within distinct 'modules' of biology (cytokinesis, tRNA metabolism, DNA replication etc.) [19,20], and therefore excludes overly general BP terms, such as 'metabolism' or 'cellular component organization', that would increase coverage at the expense of specific context. Terms that recapitulate activities in the MF ontology (e.g. 'protein phosphorylation') or describe phenotypic observations but do not correspond to a specific physiological role for a gene product (e.g. 'response to chemical') are also excluded. Following the same principles, we extended the 53-term PomBase slim into a generalized process slim of 117 terms to use in cross-species analysis, as summarized in figure 3a.
For any annotation-based metric, it is important to distinguish unknown (or unstudied) from unannotated gene products. Here, unknown gene products are defined as those that have been evaluated by curators and have no annotation to any BP slim terms (these gene products are annotated to the root term 'BP' with the evidence code 'no data (ND)' [22,23]). Unannotated are those not explicitly indicated as unknown but which, nevertheless, have no annotation from experiment or inference. Because all fission yeast and budding yeast genes have been systematically assessed using all available data, any gene products lacking specific GO annotations can confidently be deemed to have unknown biological roles. For the human proteome, manual inspection of the unannotated proteins revealed that many can actually be annotated to a BP based on experimental data in the literature or by homology-based inference, and thus classified as characterized. To make this knowledge available, we manually curated 931 GO annotations for 502 human proteins from 310 publications, including BP assignments for 238 previously unannotated proteins. These annotations will be submitted to the Gene Ontology Consortium for inclusion in the human GO annotation dataset. Figure 3b shows the proportions of the S. pombe, S. cerevisiae and human proteomes that are known (i.e. annotated to informative BP terms), unannotated, annotable, or unknown. See electronic supplementary material data for GO slim term IDs (electronic supplementary material, table S1), input protein lists (electronic supplementary material, tables S2 -S4), GO slim outputs (electronic supplementary material, tables S5 -S7), unknown gene lists (electronic supplementary material, tables S8 -S10), GO terms considered for the slim but not included (electronic supplementary material, table S11) and human GO annotations generated by this study (electronic supplementary material, table S12).

Why do these proteins remain unstudied?
Our GO slim-based characterization metric confirms the impression from simpler metrics that, for the two model yeasts and human, about 20% of proteins lack physiologically informative descriptions. Why do so many proteins, many of them conserved, remain unstudied? Below, we consider biological and sociological/cultural factors that contribute to the apparent lack of interest in these unknown proteins. royalsocietypublishing.org/journal/rsob Open Biol. 9: 180241

Biological bias
One factor influencing gene characterization is simply how easily one gene's contribution to an organism can be detected. Deletion mutants of essential genes have a clear phenotype that indicates an important function-for yeasts, the failure to grow on rich media. As a consequence, these genes, and the core processes in which they participate, are well characterized. For example, only 24 of the genes in the fission yeast unknown set are essential in rich media. Changes in cell morphology are also readily identifiable phenotypes. Visual screening and analysis of the fission yeast genome-wide deletion collection for morphology phenotypes under standard laboratory conditions found obvious abnormalities for only 10% of 643 genes of unknown function [24]. The most commonly used experimental conditions, designed as they are to maximize cell growth, can hide environment-dependent roles. Many more of the 676 currently uncharacterized ( per figure 3b) fission yeast genes are associated with growth or viability phenotypes upon specific chemical challenges (26.1%) than under standard laboratory conditions (3.6%) (PomBase [19,20], queries run 11 November 2018). In budding yeast, only 34% of all deletion mutants display a growth phenotype under standard laboratory conditions, whereas 97% of all genes are essential for optimal growth in at least one condition when assayed under multiple chemical or environmental perturbations [25].
Protein characterization has traditionally emphasized core BP over those that reflect interactions with the environment. However, analysis of the proteins that have recently (2016-2018) been removed from the fission yeast 'conserved unknown' set because their functions have been determined reveals that they most often participate in environmentresponsive processes such as signalling, detoxification, proteostasis, lipostasis and mitochondrial organization (table 1). Many of these functions are associated with the age-related accumulation of damaged or misfolded proteins, which become debilitating over time. In humans, such functions are implicated in neurodegenerative diseases, such as Alzheimer's and motor neuron diseases [26], which underscores the importance of making strenuous efforts to elucidate the functions of the remaining 'conserved  Figure 3. GO slim analysis of budding yeast, fission yeast and human proteins. (a) Generic GO biological process slim set creation flowchart. The fission yeast GO biological process slim [18] was applied to human and S. cerevisiae protein sets, and then iteratively extended to improve coverage by adding terms. All evidence codes were included except 'reviewed computational analysis' (RCA), which yields a higher rate of false positives than the others. Some processes were swapped (e.g. 'cytoplasmic translation' in the fission yeast slim for more general 'translation') to accommodate the less specific annotation available in other species. The fission yeast slim also omits overly broad terms (e.g. 'metabolism') and terms representing activities (molecular functions) in the biological process ontology (e.g. 'phosphorylation') because they do not add information about physiological roles; these terms were also excluded from our generic slim set even if inclusion would have increased coverage. (Terms specifically considered but omitted from the generic slim are listed in electronic supplementary material, table S11). At convergence (the point where no additional informative terms could be identified for gene products with biological process annotations), proteins annotated to slim terms were classified as 'known' (4393 S. pombe; 4936 S. cerevisiae; 16354 human). The remaining proteins with uninformative processes were classified as unknown, along with those already identified as unknown by annotation to the root node with evidence code ND (no data). Manual assessment of the remaining human proteins with no GO biological process annotation added 266 proteins, bringing the 'known' total to 16620. Final 'unknown' protein totals are 676 in S. pombe, 978 in S. cerevisiae and 3117 in human. The set of GO slim terms is available in electronic supplementary material, table S1. (b) Proportions of proteins with known GO slim biological role. For all three species, 'known' proteins have annotation to at least one term from the GO slim set (see A), and 'unknown' proteins do not. Because the human proteome includes some proteins that lack annotation in the GO database, the proportions of unannotated proteins that we found to be known (i.e. annotatable) and unknown are indicated separately. All protein datasets exclude dubious proteins and transposons. Analysis was performed using GOTERMFINDER [21], with GO data from 25 September 2018 and the GO slim created as described in A. Input protein lists are available in electronic supplementary materials, tables S2 (S. pombe), S3 (S. cerevisiae) and S4 (human). GOTERMFINDER output is available in electronic supplementary material, tables S5 (S. pombe), S6 (S. cerevisiae) and S7 (human).
royalsocietypublishing.org/journal/rsob Open Biol. 9: 180241 unknowns'. It should also be pointed out that the ecology of yeasts is poorly understood, and we postulate that many unknown proteins function in aspects of life that are not normally probed in the laboratory (e.g. interactions with pathogens, or survival within insect vectors). We anticipate that a greater variety of experimental conditions will supply more information about the unknown gene products catalogued here to reveal the roles of many gene products in processes fundamental for human health.

Research bias
In fission yeast, the majority of new knowledge over the past decade provides increasing detail for previously described proteins. This bias towards studying already-known proteins is not peculiar to yeast research. In their essay 'Too many roads not taken' [27], Edwards et al. observe that 70% of human protein research still focuses on the 10% of proteins known before the human genome was sequenced. Although few studies have explored the causes of the observed emphasis on known proteins, we can identify a number of plausible contributing factors, which are largely borne out by a recent large-scale analysis of publications on human genes [28]. First, the complexity of biology demands that investigators narrow their study targets to a manageable range. Researchers with established interests in specific topics thus naturally focus their work where they have deep knowledge, and extend their studies to novel genes only if a strong lead emerges, for instance, from work in another species or from a data-mining approach that implicates them in BP already under investigation. Indeed, Stoeger et al. [28] find that research in model organisms strongly influences the initial study of individual human genes. Both papers also highlight pragmatic considerations, notably the availability of research tools, and socio-political factors including career timelines, funding priorities, and peer review, all of which exacerbate the tendency to avoid the wholly unknown. Risk-averse funders and reviewers tend not to favour long-range strategies aimed at genes without an existing functional context for fear of diverting resources towards targets whose significance is not guaranteed. Without a shift in perspective, proteins without any existing functional annotation will continue to be neglected, to the detriment of basic and applied biomedical research. Stoeger et al. [28] note that current research is not only slow to cover novel genes but also 'can significantly deviate from the actual biological importance of individual genes'.

Classifying the conserved unknown proteins, or: what lies undiscovered?
The stubborn core of remaining proteins of unknown function are often dismissed as species-specific, but we have often found otherwise, and we can no longer afford to sweep these proteins under the carpet. Therefore, to provide further insight into why some gene products elude physiological characterization, we present a case study using the set of 410 fission yeast proteins of unknown physiological role that have orthologues outside the Schizosaccharomyces clade (the 'conserved unknowns'). We classify these 410 proteins according to a range of orthogonal biological attributes (including taxonomic conservation, identification of a catalytic fold or domain, cellular localization, viability). Figure 4 presents the subset of 200 proteins in this group which are conserved outside fungi (in vertebrates, archaea or bacteria; details for the full set of 410 proteins are provided in  figure S1 and table S13). The number of conserved unknown proteins that play an essential role under permissive growth conditions is disproportionally low (4.4%; 18/410, versus 1278/4376 (29.2%) known, inferred or published), and almost half of these 18 essential proteins are localized to the mitochondria or the endoplasmic reticulum. Only 8 essential genes are conserved in vertebrates (all are organellar), and only a single protein out of 53 conserved between yeast and prokaryotes (but not vertebrates) is essential. A substantial proportion of the 200 proteins conserved outside the fungi (76) are absent from Saccharomyces due to the well-documented lineage-specific gene losses in its evolutionary history [34]; characterized gene products with this taxonomic distribution are most highly enriched for chromatin organization and mRNA metabolism. Unknown mitochondrial and endomembrane system proteins are enriched for proteins with transmembrane domains (59/114). Unknown nuclear proteins are predicted to include more transferases than other enzymatic activities; the set of nuclear proteins shared only with eukaryotes includes 19 domains of unknown function (DUFs) with no currently identifiable catalytic domain, and 12 proteinprotein interaction domains (e.g. WD, ankyrin or TPR) that frequently function as scaffolds for protein complexes [30]. This multi-factorial classification can support the prediction of likely physiological roles that can be experimentally tested.

Strategies to link unknown genes to broad cellular roles
Despite almost a century of gene product-specific biochemical and genetic interrogation, and two decades of postgenomic research, a large number of proteins conserved from yeast to human still have no known biological role. Broadly conserved unknown eukaryotic proteins can be assumed to have important cellular roles conserved over 500 million years of evolution. It is, therefore, remarkable that this gene set has hardly reduced over the past decade. In this period, familiar genes have been studied in ever-greater depth, presumably at the expense of the characterization of genes of hitherto unknown function (e.g. over     [20,29]. PomBase also maintains manually curated lists of orthologues between S. pombe and S. cerevisiae, and between S. pombe and human, three eukaryotic species separated by approximately 500 -1000 million years of evolution. In combination, these inventories can be used to identify conservation across taxonomic space at different levels of specificity. Of the fission yeast 'unknown' proteincoding genes, 410 are conserved outside the Schizosaccharomyces clade. Of these, 210 are present either in fungi and vertebrates, or in fungi and prokaryotes (data from PomBase manual assignments, queried on 31 July 2018). Proteins were classed as catalytic (i.e. having an identifiable catalytic fold) or non-catalytic (no currently identifiable catalytic fold) based on protein domain, fold, clan or superfamily membership, using InterPro [30] and GO [13] assignments. Cellular locations using GO annotation are available for most of the unknown proteome based on a genome-wide localization study and inference from other models [31]. Viability data come from large-scale screens reported by Kim et al. [32] and Chen et al. [33]. The fission yeast 'conserved unknown' protein set [18] is reviewed continually for new functional data.
royalsocietypublishing.org/journal/rsob Open Biol. 9: 180241 33 000 papers with 'p53' in the title have been published since 2007). Assuming a diminishing return for studies on highly characterized proteins, investigations on unstudied proteins will have a relatively higher impact that is likely to outweigh the considerable initial efforts required to place them within the context of current knowledge.
To jump-start renewed progress in unknown gene characterization, two major stumbling blocks must be overcome. One is to identify the cohort of genes of unknown function, and the other is to develop mechanisms to bring the proteins that they encode to the attention of interested researchers. Here, we provide a framework that uses a generally accessible set of criteria based on manually curated data to identify and classify unstudied proteins, which could easily be extended with additional criteria for further annotation specificity in future iterations. The construction of inventories of unknown proteins will ultimately depend on accurate and complete functional annotation of all genes of the major model species.
Commentators on genome-scale research have long recognized that, in order to fully describe an organism's protein complement, it will be necessary to deploy parallel experimental and computational methods, at both large and small scales [4,12]. Understanding how investigatory biases and the characteristics of particular gene sets have converged to prevent characterization will help us to identify the most promising routes to uncovering unknown functions. These, and other factors that contribute to the neglect of the characterization of conserved gene products that are likely to have novel biological roles, deserve further in-depth consideration. It is likely that, to fill the persistent knowledge gaps represented by the roughly 20% of proteins that remain uncharacterized, a creative combination of existing and emerging experimental and in silico methods will be required, as well as an increased awareness among the scientific community of the value of a full proteome description. Because basic knowledge at the cellular level provides the building blocks of translational research, drug discovery, personalized medicine, metabolomics and systems biology, comprehensive proteome characterization underpins the success of numerous and diverse endeavours across all of the biological and medical sciences.
Data accessibility. This article has no additional data.