Term Matrix: a novel Gene Ontology annotation quality control system based on ontology term co-annotation patterns

Biological processes are accomplished by the coordinated action of gene products. Gene products often participate in multiple processes, and can therefore be annotated to multiple Gene Ontology (GO) terms. Nevertheless, processes that are functionally, temporally and/or spatially distant may have few gene products in common, and co-annotation to unrelated processes probably reflects errors in literature curation, ontology structure or automated annotation pipelines. We have developed an annotation quality control workflow that uses rules based on mutually exclusive processes to detect annotation errors, based on and validated by case studies including the three we present here: fission yeast protein-coding gene annotations over time; annotations for cohesin complex subunits in human and model species; and annotations using a selected set of GO biological process terms in human and five model species. For each case study, we reviewed available GO annotations, identified pairs of biological processes which are unlikely to be correctly co-annotated to the same gene products (e.g. amino acid metabolism and cytokinesis), and traced erroneous annotations to their sources. To date we have generated 107 quality control rules, and corrected 289 manual annotations in eukaryotes and over 52 700 automatically propagated annotations across all taxa.


Introduction
The Gene Ontology (GO; http://geneontology.org) is the most widely adopted resource for systematic representation of gene product functions [1][2][3]. The core of the GO resource consists of two components: the Gene Ontology itself, and a set of annotations that use the ontology to describe gene products.
The ontology is a structured vocabulary that defines 'terms' that represent biological structures or events, and the relations between them, in three interconnected branches: molecular function (MF; molecular-level activities of gene products), biological process (BP; larger-scale biological 'programs' accomplished by multiple molecular activities) and cellular component (CC; the cellular locations in which a gene product performs a function). The ontology is structured as a graph, with class-subclass (is_a) relationships within each branch, and relationships of additional types ( part_of, regulates, occur-s_in, etc.) [4] within and between the three branches. Every GO term has a human-readable text definition, and a growing number have logical definitions that explicitly refer to terms in GO and other Open Biomedical Ontology (OBO) ontologies [2,3,5,6]. (More formally, logical definitions use equivalence axioms expressed in OWL, the Web Ontology Language [7], to 'specify necessary and sufficient conditions for class membership' for an ontology term.) Such definitions facilitate ontology structure maintenance and quality control.
GO annotations associate gene products with GO terms, with supporting evidence, a citation, additional metadata and optional annotation extensions [8,9]. (Note: annotations may use identifiers for genes as proxies for their products, and we use 'genes' for simplicity in the remainder of this report.) The GO annotation corpus is widely used for a variety of genome-scale analyses, including broad characterization of whole genomes, interpretation of high-throughput transcriptomic and proteomic experiments, network analysis, and more [10][11][12][13][14]. In many cases, functional studies use subsets of the ontology (sometimes known as 'GO slims'), that exclude highly specific terms and take advantage of the fact that annotations are propagated over transitive relations (e.g. is_a, part_of ) in the ontology.
Manual curation of primary literature describing experimental results supplies the most precise annotations. Experiment-based annotations are then propagated to genes from additional species by methods that include manual phylogeny-based transfer as well as computational methods using sequence models, orthology inferences or keyword mappings [2,[15][16][17].
As a human endeavour, manual literature curation is imperfect, prone to errors in interpreting published experimental results or in choosing applicable ontology terms. In particular, the language used in publications is often less specific, or more prone to multiple interpretations, than the precisely defined ontology terms used in annotations. Furthermore, because manually curated annotations are widely propagated to support computed annotations, any inaccuracy in core manual annotation risks being transferred and amplified. Efficient ways to identify and correct errors are therefore highly valuable.
GO and model organism database (MOD) curators have developed a set of best practices to guide manual annotation, encompassing recommendations for interpreting various experimental results, selecting appropriate GO terms, applying evidence, and using annotation extensions [8,18,19]. Once created, annotations are subject to a series of automated quality control (QC) checks that flag errors for correction, such as incorrect term-evidence combinations, missing metadata or file format problems [3]. Nevertheless, there is still ample scope for additional QC measures to improve the accuracy of the GO annotation corpus. Accordingly, we have developed a novel approach to annotation QC based on our observations of patterns of co-occurrence of different biological process terms used to annotate the same genes.
In biology, each gene may be involved in a wide variety of processes, and some have multiple functions; these are represented in GO as multiple annotations for a single gene. Owing to spatial, functional or temporal constraints, however, certain combinations of functions or processes are not likely to be carried out by the same genes. We can therefore identify pairs of GO terms that are unlikely to be correctly annotated to the same genes, and thus provide a flag for potential misannotation. This work describes the development and initial implementation of a protocol that generates co-annotation QC rules from the identified pairs of GO terms to which the same gene should not be annotated, and then applies the rules in QC procedures to detect and correct annotation errors, and to prevent new occurrences of similar errors, thus yielding a higher quality annotation corpus.

Term Matrix annotation query tool
For each pair of GO terms analysed, annotations were retrieved from the GO database by querying for gene products annotated to 'Term1 AND Term2' directly or by transitivity (i.e. inferred over transitive relations in the ontology; by default, is_a and part_of are included). We developed a new tool, Term Matrix, which queries all pairwise combinations of a specified set of GO terms. Users can filter annotations by organism or annotated entity type (gene, protein, ncRNA, etc.) and can opt to include or exclude the regulates relations (regulates, positively_regulates, negatively_regulates) when traversing the GO graph to retrieve annotations inferred by transitivity. Results are displayed in a grid-based view (the 'matrix') that shows the number of gene products annotated to each pair of GO terms. Clicking the annotation count retrieves the annotation details for manual inspection.
Term Matrix uses the JavaScript D3 library. The code is released under the BSD 3-Clause 'New' or 'Revised' License, the same as the parent AmiGO application [20]. The tool works by querying the AmiGO Solr index, which uses precomputed graph closures that enable fast calculation of intersection counts for any term pair. Term Matrix is available directly [21], and accessible from the GO tools menu on the GO website [22].

Annotation and ontology review
For pairs of GO terms with few co-annotated gene products-defined operationally as cases where shared gene products represent only a small fraction of the total annotated to either single term-annotations and the cited sources were manually inspected to identify errors in manual literature curation or in mappings used to generate automated annotation. Where specific annotations appeared correct, the ontology was inspected for erroneous relationships.
We conducted several case studies, described below, to assess the effectiveness of our annotation validation process; the outcomes of the studies are discussed in the Results section.  [40][41][42][43][44][45][46][47][48][49][50] selected to optimize coverage of informative cellular-level processes-were retrieved and assessed as described above. Before the Term Matrix tool became available, annotations to 'Term1 AND Term2' combinations were retrieved by querying fission yeast annotations locally in PomBase [24] (or its predecessor, GeneDB). Queries included annotations propagated over the is_a, part_of and regulates relations in the go-basic version of the ontology [25]. Annotations were corrected, and queries re-run, iteratively.

Cohesin complex annotation case study
We retrieved annotations to the GO CC term 'cohesin complex' (GO:0008278), a complex required for chromosome cohesion, combined with each of 35 GO BP terms, for all species in the GO database. Queries included the is_a, part_of and regulates relations for BP ontology traversal.

Cross-species Gene Ontology subset case study
For cross-species analysis, we combined each of five selected terms ('amino acid metabolism' (GO:0006520), 'cytoplasmic translation' (GO:0002181), 'ribosome biogenesis' (GO:0042254), 'tRNA metabolism' (GO:0006399) and 'DNA replication' (GO:0006260)) with each term in a subset of 40 of the fission yeast GO slim BP terms. For six species (S. pombe, Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Mus musculus and Homo sapiens), we retrieved annotations in Term Matrix for each GO term combination, as described for fission yeast annotations. To avoid inclusion of genes involved in processes that have indirect effects, the regulates relations were not used for ontology traversal in this case study.

Rule generation for annotation validation
Co-annotation QC rules were generated using the annotations retrieved in Term Matrix for the cross-species case study described above. After the correction of annotation errors, term pairs with no annotated gene products in common (mutually exclusive processes) were used to establish a set of rules capturing 'Term1 is not usually coannotated with Term2' statements. Rules are expressed in a simple tab-delimited text format, as described in table 1. The set of rules that have been incorporated into GO's annotation validation pipeline [3] is available at GO's GitHub site [26], which also includes the runner code and additional tests and documentation [27]. Reports for the currently deployed GO release are available from 9 August 2018 onwards.

Fission yeast genome-wide annotation evaluation
Schizosaccharomyces pombe has 5067 protein-coding genes, of which 4337 have annotation to a GO BP term more specific than the 'root' term ('biological process', GO:0008150) [28]. The depth and breadth of fission yeast annotation make it an excellent system for studying co-occurrence of BP annotations. PomBase, the S. pombe MOD, maintains the fission yeast GO slim, a BP GO subset that classifies 99% of fission yeast protein-coding genes of known biological process into broad categories. Pairs of terms from the fission yeast GO slim with co-annotations were evaluated over time and visualized as described in the Methods. We thus identified term pairs that are rarely used to annotate genes in common, and then inspected the annotations individually.
We observed that the number of annotated genes shared by GO term pairs reflected biology: whereas large intersections between gene sets such as those annotated to 'transcription' (GO:0006351) and 'chromatin organization' (GO:0006325) are readily explained, biologically unrelated processes such as 'tRNA metabolic process' (GO:0006399) and 'protein folding' (GO:0006457) tended to yield few or no shared genes. Figure 1 illustrates the scale of annotation changes over time using annotation matrix 'snapshots' based on data from 2012 and 2020 for 21 of the term pairs studied (before 2012, individual annotation error corrections were not systematically recorded). Individual annotation corrections derived from this analysis since 2012 are included in the electronic supplementary material, table S1. Table 1. Rule file format. (Two mandatory columns contain the GO identifiers (IDs) for the pair of mutually exclusive terms, and remaining columns allow optional identifiers for exceptions to the rule (see 'Allowable annotation overlaps' in the main text). For example, line 1 consists only of 'GO:0006399 GO:0006457' in columns 1 and 2, and states that the GO terms 'tRNA metabolic process' (GO:0006399) and 'protein folding' (GO:0006457) should not both be associated with a single gene. Column 3 may contain one or more pipe-separated IDs for GO terms that allow correct use of an otherwise mutually exclusive pair. In line 2, 'GO:0006399 GO:0006310 GO:0045190' states that genes may be annotated to both 'tRNA metabolic process' (GO:0006399) and 'DNA recombination' (GO:0006310) only if they are annotated to 'isotype switching' (GO:0045190). Similarly, column 4 allows identifiers for individual gene products or for specific PANTHER families that cover entire orthologous groups, where annotation to both terms in a pair has been confirmed as accurate. In line 3, 'GO:0002181 GO:0006605 WB:WBGene00006946' states that C. elegans prx-10, but not other genes, may be annotated to both 'cytoplasmic translation' (GO:0002181) and 'protein targeting' (GO:0006605) owing to a tandem gene fusion in C. elegans.)

Annotation error types
Our work correcting fission yeast annotation errors led us to identify several classes of systematic error.
(1) Annotation of indirect effects: in manual curation, incorrect annotations often arise when a phenotype is taken to mean that a missing/altered gene product normally participates directly in the process assessed, or measured, by the analysis, but is later shown to reflect a downstream effect of the mutation. For example, fission yeast Brr6 was originally thought to be involved in nucleocytoplasmic transport on the basis of the phenotype of the S. cerevisiae gene. Subsequent work showed that Brr6 in fact acts directly in nuclear envelope organization, and effects on nuclear transport lie downstream [28,29]. In light of the most up-to-date knowledge, annotating Brr6 to 'nucleocytoplasmic transport' (GO:0006913) would be misleading. Likewise, perturbed DNA replication (GO:0006260) can indirectly lead to problems with chromosome segregation (GO:0007059), owing to the presence of DNA structures that cannot be separated (e.g. [30]). A chromosome segregation phenotype alone therefore does not suffice to confidently annotate a gene product as involved in chromosome segregation. In more extreme cases, downstream effects of mutations can sometimes lead to erroneous annotation of genes that do not normally influence a process, even indirectly. Cell cycle arrest phenotypes often give rise to this type of error, because the arrest may result from mutations that cause problems that a functioning checkpoint can detect but not correct. For example, decreased expression of the ribosome processing protein SLBP results in slowed cell growth and an accumulation of cells in S phase. From these phenotypes, SLBP was erroneously annotated to terms 'DNA replication' (GO:0006260) and 'cell cycle phase transition' (GO:0044770), despite playing no role in either process in a normally functioning cell. Indirect effects are frequently seen, and most at risk of misinterpretation, in high-throughput datasets where candidate genes may be annotated without data from follow-up validation experiments. (2) Term interpretation and usage: errors in manual curation can arise from misinterpretation of experimental results or the meaning of a GO term. For example, we found 12 examples of genes annotated to 'transmembrane transport' (GO:0055085) where 'nucleocytoplasmic transport' (GO:0006913) would instead be correct, because during nucleocytoplasmic transport the lipid bilayer is not traversed. Occasionally enzyme activities are misinterpreted; e.g. S. cerevisiae KTI1, an oxidoreductase involved in tRNA wobble uridine modification, was annotated to 'electron transfer activity' (GO:0009055). This MF term specifically represents the action of an electron acceptor and electron donor in an electron transport chain, and is linked directly to the biological process 'electron transport chain' (GO:0022900); it is more specific than the oxidoreductase activity of KTI1. (3) Mappings: because manually assigned experimental annotations provide the main source of data to create automated annotations, all types of annotation error described above can result in the incorrect association of GO terms to InterPro signatures (InterPro2GO mapping) [2,31] or UniProt keywords [2,32]. In addition, other error types specifically affect annotation derived from automated mappings. First, irrelevant terms can be propagated, either via matches to domains found in proteins from species in which a process, activity, or cellular location does not exist, or via transfer of a very specific GO term instead of a less precise, but more broadly applicable, GO term (in our study, 13 families had mappings which were only true for a subset of entries; these were excluded from the annotation error count). Second, mappings derived from protein family fission yeast annotation intersections January 2012 January 2020 royalsocietypublishing.org/journal/rsob Open Biol. 10: 200149 membership can be affected by false positive family assignments. (4) Ontology structure: incorrect paths in the ontology can cause erroneous inferences to 'ancestor' terms from correct annotations to 'descendant' terms. For example, the parent 'citrulline biosynthetic process' (GO:0019240) has been removed from 'protein citrullination' (GO:0018101), because protein citrullination describes the conversion of an amino acid residue in a protein into citrulline, not the synthesis of free citrulline. (5) Advances in biology: older findings can be supplanted by new knowledge, especially paradigm shifts in biology. For example, the Elongator complex was long thought to act as a histone acetyltransferase (GO:0004402), based on assays that have since been shown to be artefacts (e.g. [33]). Instead, the Elongator complex is actually involved in tRNA modification (GO:0006400), and all observed phenotypes can be attributed to this role [34,35]. Annotations need to be adjusted to reflect this new knowledge.
For errors affecting automated annotations based on mappings (InterPro and SPKW), we report two counts. One represents errors found in the entire set of 800 million annotations provided by the Gene Ontology Annotation (GOA) project at EBI (hereafter the 'Gene Ontology Consortium (GOC) database'). The second refers to a filtered set of roughly 8 million annotations (the 'GOC database') with narrower taxonomic coverage used by the GO Consortium's tools, including AmiGO (the GOA database has annotations for 150 million gene products from almost 1.2 million species, whereas the GOC database covers 1.5 million gene products from 4600 species).

Allowable annotation overlaps
After correcting errors in manual annotations, mappings used for automated annotation, and ontology relationships, many GO term pairs had no annotated gene products in common. The exceptions all fall into one or more of the following types:

Cohesin complex
We next conducted two case studies to investigate whether the use of co-annotation analysis for annotation QC would hold for species other than the well-annotated fission yeast. In the first, we examined co-annotations to the GO CC term 'cohesin complex' (GO:0008278) with each of 35 selected GO BP terms. Erroneous annotations fell into the same categories identified for fission yeast annotations. Figure 2 shows co-annotation counts before and after corrections, and a breakdown of annotation errors by type and database. Across multiple MODs plus UniProt, 35 experimental annotations were deleted (listed in the electronic supplementary  material, table S2). Finally, one InterPro2GO mapping affecting over 7000 computationally inferred annotations was removed (see electronic supplementary material, table S3).

Gene Ontology biological process subset
In the second cross-species study, we narrowed the number of species to six (fission yeast, budding yeast, worm, fly, mouse and human; see Methods), but broadened GO coverage to pairwise combinations of five core cellular level biological processes ('amino acid metabolism' (GO:0006520), 'cytoplasmic translation' (GO:0002181), 'ribosome biogenesis' (GO:0042254), 'tRNA metabolism' (GO:0006399) and 'DNA replication' (GO:0006260)) against a set of 40 core cellular level biological process GO terms (electronic supplementary material, table S4). As a result, 182 manual annotations were corrected or removed directly. Correcting 19 ontology paths affected over 45 000 annotations in the GOC database and over two million annotations in the GOA database. Fifty-four InterPro2GO mappings were corrected, affecting over 5900 annotations in the GOC database and over 380 000 inferred annotations across all species in the GOA dataset (calculated from family size in Inter-Pro version 77). Over 1800 annotations present in both the GOC and GOA sets that are phylogenetically inferred for key GO species using the PAINT annotation transfer system [16]

A workflow for annotation quality control
Following successful detection and correction of annotation errors in our case studies, we have developed shared coannotation rules that form the basis of a pipeline for annotation QC. The 'Matrix QC' workflow is a multi-step, ongoing, and iterative process, summarized in figure 3.
(1) A set of GO term identifiers is used as input for the Term Matrix tool to provide visualization and access to genes with annotations shared between pairs of GO terms (annotation intersections). Early iterations use selected GO terms, and use the Term Matrix option that excludes regulates relations when traversing ontology paths (i.e. gene products annotated to a term that is connected to one or both queries via the regulates relation will not appear in the intersection set). Annotation outliers, defined as intersecting sets with low numbers of annotated gene products, are critically inspected for validity. Annotation errors are identified and corrected, usually by assessing the original experimental data. Table 2 summarizes the corrections we made to manual annotations, mappings and ontology paths as part of establishing the Matrix QC workflow. Annotation intersections which yield empty sets can royalsocietypublishing.org/journal/rsob Open Biol. 10: 200149 be used to generate co-annotation QC rules of the form 'genes annotated to process A are not usually annotated to process B'. (2) New and existing annotations that violate annotation co-annotation QC rules are reported to contributing databases via standard GO Consortium QC pipelines. (3) Upon reviewing reported errors, the contributing database may either make corrections or provide evidence that validates annotations. For valid annotations in intersections, the co-annotation QC rules are extended to include additional specifications that will allow only valid annotations to pass (exceptions may be specified at the level of species, protein families or individual gene products). Rules can also be modified to account for new biological knowledge, for example by allowing co-annotation of specific sub-processes, or where the annotated gene product matches additional criteria such as being annotated to a particular molecular function or cellular component.

Using co-annotation for quality control
Biological data can sometimes be subject to variable interpretation, and the state of biological knowledge is constantly changing, posing challenges for the accurate and up-to-date characterization and curation of genes and their products. We have developed a QC pipeline for GO BP annotation based on observed patterns of co-occurrence of GO terms used to annotate the same genes. Annotation to both of a selected pair of GO terms, designated 'annotation intersection', should occur only where the processes actually overlap, genes are shown to have multiple functions, or the same function is used in more than one process. We have corrected numerous errors in annotations and ontology relationships, generated rules describing annotation intersections expected to be null, and incorporated the rules into an iterative QC pipeline. The new system provides for the detection and correction of existing annotation errors, as well as prevention of similar errors entering the GO annotation corpus.
Our work demonstrates that the inspection of annotations co-annotated to multiple processes can identify annotation outliers, systematic mapping errors, and ontology problems for validation or correction. The incremental creation of coannotation QC rules covering all annotation space will create a robust mechanism for the validation and improvement of the annotation corpus over time, because potential errors will be identified, and the flagged annotations validated or corrected upon submission.

Propagating error correction
Errors in ontology relationships and in mappings between ontology terms and other classification systems such as Inter-Pro can introduce systematic errors in annotation datasets, as royalsocietypublishing.org/journal/rsob Open Biol. 10: 200149 often every gene product annotated to a misplaced term, or associated with a particular domain or keyword, is affected. Systematic errors may also originate from experimentally supported annotations produced by curators, because these manually curated data are then used to develop mappings between GO and protein families, and in phylogenetic propagation. For example, experimental annotations assigned by MOD and UniProt curators are used to establish InterPro2GO mappings, and routinely propagated to orthologues across thousands of species by Ensembl Compara pipelines [37] and PAINT. Furthermore, widely propagated annotation errors affect common uses of GO data. For example, it has been demonstrated that new annotations can produce more informative results in enrichment analyses [12,38,39], so we expect that the removal of many incorrect annotations to a given term will similarly improve enrichments. Correcting widely propagated annotations thus results in improvements throughout the GO annotation corpus-affecting hundreds or even thousands of annotations-and correcting ontology errors also benefits analyses that use annotation datasets.

Future directions
In the present study, we created co-annotation QC rules for pairwise term combinations involving five GO BP terms that already had low numbers of annotations in intersections. We aim to extend the rules to cover more GO term pairs, and to accommodate all experimentally verified annotations found in intersections using increasingly specific rule exceptions. Although rule construction and the accompanying error correction procedure is time-consuming-because large numbers of potential violations need to be traced back to the original publication and evaluated, and the reasons for the apparent violation are often obscure-maintenance overhead is low once rules are established. Adapting co-annotation QC rules and exceptions to accommodate new biology takes comparatively little effort, and provides annotation quality benefits indefinitely. Next, we will extend co-annotation QC rules to the MF and CC branches of GO, adding rules for pairwise combinations of terms within the MF and CC branches, and for pairs of terms from different branches. For example, a rule could identify cytosolic (CC) proteins that are annotated to DNA recombination (BP), or that a DNA-binding transcription factor activity (MF) is not a general transcription initiation factor activity (MF). We will also explore applications of co-annotation QC rules beyond error detection in existing annotations. For example, machine learning function prediction exercises may use our QC rules to constrain  Step 2: based on known biology, create co-annotation QC rules that disallow simultaneous annotation to term pairs ('NO OVERLAP' between annotation sets for the indicated terms).
Step 3: re-run Term Matrix to find annotations that violate the rules; report to contributing databases for validation.
Step 4: correct annotation errors, or amend rules to allow specific biologically valid exceptions. predictions such that annotations that would violate rules are excluded. We also anticipate that combining Term Matrix-based QC with novel GO annotation protocols will yield synergistic benefits. Our results to date indicate that, owing to downstream or pleiotropic effects, it is difficult to assign a direct role in a biological process to a gene product from a mutant phenotype without additional information. Additionally, it can often be challenging to discern when a gene product is directly involved in a biological process as opposed to having an impact on a process by perturbing an upstream process. Two recent innovations in GO annotation show great promise for minimizing such errors. First, the introduction of new relations to describe how a gene product is connected to a term (involved_in, acts_upstream_of , etc.), will allow curators to capture indirect annotations explicitly, and simultaneously provide a mechanism to filter when a set of direct annotations is desired. Second, the new gene product-GO term relations form part of the Gene Ontology Causal Activity Modelling (GO-CAM) [40] system, which uses OWL to represent how the molecular activities of gene products interconnect to carry out, and regulate, biological processes.

Conclusion
We envisage that the co-annotation rule-based QC procedure will help direct researchers to outstanding questions in molecular or cellular biology: annotations that appear to violate rules may indicate areas where available experimental results are inconsistent, requiring further experiments to resolve discrepancies, but some may identify interesting areas of biology where evolution has co-opted a single gene product for more than one task. Our work has built a co-annotation QC system into GO procedures that can readily be more widely implemented, thereby enabling curators and researchers to distinguish between new annotations that provide additional support for known biology and those that reflect novel, previously unreported connections between divergent processes. The co-annotation QC pipeline thus enhances GO not only by detecting and preventing annotation errors, but by highlighting advances in our understanding of biology.
Data accessibility. The GO ontology and annotation datasets are freely available from the Gene Ontology website (see the main downloads page [41]). All other data supporting this article have been uploaded as part of the electronic supplementary material.