Disease classification via gene network integrating modules and pathways

Disease classification based on gene information has been of significance as the foundation for achieving precision medicine. Previous works focus on classifying diseases according to the gene expression data of patient samples, and constructing disease network based on the overlap of disease genes, as many genes have been confirmed to be associated with diseases. In this work, the effects of diseases on human biological functions are assessed from the perspective of gene network modules and pathways, and the distances between diseases are defined to carry out the classification models. In total, 1728 diseases are divided into 12 and 14 categories by the intensity and scope of effects on pathways, respectively. Each category is a mix of several types of diseases identified based on congenital and acquired factors as well as diseased tissues and organs. The disease classification models on the basis of gene network are parallel with traditional pathology classification based on anatomic and clinical manifestations, and enable us to look at diseases in the viewpoint of commonalities in etiology and pathology. Our models provide a foundation for exploring combination therapy of diseases, which in turn may inform strategies for future gene-targeted therapy.


Introduction
Characterizing disease in the biological big data era of the twentyfirst century has been of significance [1], based on not only pathological analysis and clinical syndromes but also molecularlevel information, including gene data. The genes of an organism play a vital role in the regulation of cellular processes, as well as disease development. Much effort is directed at the possibility to growing body of omic datasets in health and disease, human disease will be defined precisely with optimal sensitivity and specificity [1]. Also, it starts a rigorous analytical process that can lead to defining prognostic determinants and better-individualized therapeutic responses [1]. For a long time, cancer has been classified mainly according to the origination location in the body. However, a collaborative project called the Pan-Cancer Initiative [26], launched in 2012, plans to study cancer from another perspective. A preliminary analysis shows that cancers originating from different organs actually have something in common at the molecular level, and cancers originating from the same tissue may have very different genomic characteristics [27]. In 2018, this project has been completed. Li et al. take a comprehensive perspective on oncogenic processes based on Pan-Cancer Atlas analyses, giving prominence to the complex impact of genome alterations on the signalling and multi-omic profiles of human cancers as well as their influence on tumour microenvironment [9]. Hoadley et al. [28] show that 33 tumours analysed can be reclassified into 28 different molecular types based on their cellular and genetic composition, rather than their origin, which would inform strategies for future therapeutic development. Sanchez-Vega et al. [29] make an integrated analysis of genetic alterations in 10 signalling pathways across 9125 tumour samples profiled by TCGA and point out significant representation of individual and co-occurring actionable alterations in these pathways, suggesting opportunities for targeted and combination therapies.
In this work, disease classification models based on gene network are proposed, as we consider not only individual genes, but also gene network modules and pathways. The largest connected component (LCC) of the human gene network is divided into 10 modules by using the fast unfolding algorithm. We integrate disease genes, topological modules and biological pathways to assess the influence of diseases on the human body, and perform different classifications of diseases for different interests. In total, 1728 diseases collected from KEGG are divided into 12 categories by the intensity of effects on pathways, and are identified as 14 categories by the scope of effects on pathways. The number of disease categories that contain cancers is the smallest among 15 types of diseases, which suggests the similarity that cancer diseases are complex and have a great impact on pathways. Each category is a mix of several types of diseases identified based on congenital and acquired factors as well as diseased tissues and organs, which implies that the human gene network gives a perspective of disease classifications, and guides future gene-targeted therapy and combination therapy of diseases.
KEGG PATHWAY [6] is a collection of manually drawn pathway maps representing our knowledge on the molecular interaction, reaction and relation networks. It includes not only the normal states but also the perturbed states of the biological systems, divided into seven types of pathways as follows: metabolism, genetic information processing, environmental information processing, cellular processes, organismal systems, human diseases and drug development. We select 317 human pathways apart from drug development pathways which include gene information.
Diseases are viewed as perturbed states of the biological system in KEGG DISEASE [6,30]. Each disease is represented by a list of known disease genes, any known environmental factors at the molecular level, diagnostic markers and therapeutic drugs, which may reflect the underlying molecular system. Diseases are divided into 15 primary classifications: cancers, immune system diseases, nervous system diseases, cardiovascular diseases, respiratory diseases, endocrine and metabolic diseases, digestive system diseases, urinary system diseases, reproductive system diseases, musculoskeletal diseases, skin diseases, congenital disorders of metabolism, congenital malformations, other congenital disorders and other diseases. In total, 1728 diseases (with known disease genes) of 67 secondary classifications are screened out (https://www.kegg.jp/kegg-bin/get_htext?htext=br08402_ gene.keg).

Module partition based on the fast unfolding algorithm
The module reflects the local characteristics of individual behaviours in the network and their interrelationships. The modules in the research network play a vital role in understanding the structure and function of the entire network, and can help us analyse and predict the interaction between the elements of the entire network. A key step was taken when Girvan and Newman popularized graphpartitioning problems by introducing the concept of modularity [31]. In its original definition, an unweighted and undirected network that has been partitioned into communities has modularity where A is the adjacency matrix of the network, m is the total number of edges, and k i ¼ P j A ij is the degree of node i. The indices i and j run over the N nodes of the graph.
The fast unfolding algorithm is a multistep method based on a local optimization of modularity in the neighbourhood of each node and implemented as follows [32]:

Adjusted cosine similarity measures the correlation of gene distribution deviation between pathways
To assess the overlap of function pathways and topological modules, we get the proportion vector . . , l i 10 ) for each pathway i (i ¼ 1, 2, . . ., 317) in 10 modules. As each module has a different number of genes, we apply the adjusted cosine similarity to measure the correlations of pathway genes distribution in network modules a functional module, and a disease is a result of the breakdown of some particular functional modules [32,33]. It is considered to be right that disease genes affect pathway functions through topology modules. The number of diseases is represented by N d ¼ 1728, the number of modules is represented by N m ¼ 10, and the number of pathways is represented by N p ¼ 317. Each disease k (k ¼ 1, 2, . . ., N d ) is represented by disease genes group D k ¼ {g k 1 , g k 2 , . . . , g k n k }: Each topological module M l (l ¼ 1, 2, . . ., N m ) is represented by genes group M l ¼ {g m 1 , g m 2 , . . . , g m n l }, and each functional pathway P t (t ¼ 1, 2, . . ., N p ) is represented by genes group P t ¼ {g t 1 , g t 2 , . . . , g t nt }: To weight the influence on pathways by diseases, we first define the access efficiency from each disease within topological modules by where jM l j ¼ n l is the size of M l , and d ij is the length of the shortest path between g i and g j in the LCC. d(g i , M l ) ¼ 1 has the value 1 for all g i in M l and the value 0 for all g i not in M l . AE(D k , M l ) describes the summation of the closeness centrality of disease genes within the module M l . The more disease genes within module M l , the more significant M l plays a role in developing the disease. Then we define the relevance of modules and functional pathways by jaccard similarity coefficient as follows: where jM l > P t j is the size of intersection of M l and P t , and jM l < P t j is the size of union of M l and P t . JSC(M l , P t ) describes similarity between finite gene groups M l and P t . This measure is selected mainly because it provides an intuitive way to characterize the set similarity. The impact score on function pathways by disease D k is defined as follows: and IS(D k ) ¼ (IS(D k , P 1 ), IS(D k , P 2 ), . . . , IS(D k , P N p )), where IS(D k , P t ) is the impact score on pathway P t , and IS(D k ) is the impact score vector on pathways by disease D k . To classify diseases by the intensity of effects on pathways, we use the normalized vector IS N (D k ) as the impact score vector of disease D k , where IS N (D k , P t ) ¼ IS(D k , P t ) À min s (IS(D k , P s )) max s (IS(D k , P s )) À min s (IS(D k , P s )) and The distance between two diseases is defined by the Euclidean distance to describe the difference of intersity of effects on pathways, as follows: To classify diseases by the scope of effects on pathways, we use the binary vector IS B (D k ) as the impact score vector of disease D k , where royalsocietypublishing.org/journal/rsos R. Soc. open sci. 6: 190214 and IS B (D k , P t ) put emphasis on the pathways which are affected beyond average. The distance between two diseases is defined by the Manhattan distance to describe the proportion of different pathways affected by two diseases, as follows:

Classification of diseases
Given a partition P ¼ {P 1 , P 2 , . . . , P s } of diseases, we construct a symmetric matrix Distance sÂs , where Distance(i, i) is the average distance between diseases in P i and Distance(i, j ) is the average distance between diseases in P i and P j . Let represent the difference of average distances of diseases within partitions to average distances of diseases between partitions.
Using Ward.D2 method [34], we get a hierarchical clustering of diseases by distance. A given dendrogram is regarded as a series of partitions {P 1 , P 2 , . . . , P N d À1 } in the order of merge. We consider partitions {P r , P rþ1 , . . . , P N d À1 } in which every disease is merged at least once, and cut the dendrogram at P m , r m N d 2 1 where the corresponding Difference Pm is the minimum.

Results
Different from previous works such as classifying diseases according to the gene expression data of patient samples [28], or understanding disease gene and phenotype associations by constructing a bipartite graph consisting of diseases and genes [22], we assess the effects of diseases on human biological functions and define the distances between diseases to carry out the classification models from the perspective of gene network modules and pathways. The genes and interactions are collected from NCBI to construct the human gene network. Then the human pathways and diseases associated with human genes are collected from KEGG, which are considered together with gene network modules to develop the disease classification models. Two disease classification models are proposed for different interests.

Genes and interactions: underlying framework of the human gene network
We obtain human gene information from NCBI in which 16 types of information of 60 674 human genes are listed, including GeneID, symbol, synonyms, type of gene, etc. Also, gene-gene interaction information is acquired from NCBI (see details in Methods). Each record is verified by one or more literatures (unique Pubmed ID) curated from publications. See in table 1, totally 321 715 interactions including at least one human gene are screened out; specifically, 17 430 human genes are involved in at least one interaction, in which 17 309 human genes are recorded in 289 946 interactions to interact with human genes and 7590 human genes are recorded in 31 736 interactions to interact with 4960 non-human genes from 83 species.  We select 317 human pathways which include gene information from KEGG (see details in Methods). For each of the six types of pathways, one subject to significant influence by diseases is selected as example to illustrate the subnetwork consisting of pathway genes (figure 1). In total, 6547 of 7409 human pathway genes belong to the LCC (figure 2c). The number of pathways in which one gene involved has a powerlaw distribution (figure 2b), and g ¼ 1.966. It suggests that minority genes involve in a large number of pathways to participate in cellular activities. In the 85 metabolism pathways, majority genes do not interact with non-human genes. However, in another 232 human pathways, most genes interact with non-human genes (figure 2d,e). As expected, human pathway genes that interact with both human and non-human genes (component IV) are more crucial. The degree of human genes which interact with both human and non-human genes is much larger than that of human genes which only interact with human genes. Specially, component IV occupies many hubs (table 3), for the reason that the median degree as well as average degree and maximum degree is much larger than that of components I, II, III. Besides, the genes in component IV interact with more non-human genes and species, which shows genetic diversity.
For example, TRIM25 is the maximum-degree gene, and its protein is a bona fide RNA binding protein associated with many proteins involved in RNA metabolism and that interacted with numerous coding and non-coding transcripts [35]. TRIM25 plays a key role in the RIG-I signalling pathway, which is a cytosolic pattern recognition receptor that senses viral RNA [36]. Additionally, TRIM25 is involved in normal development and diseases in association with the estrogen response [37]. Moreover, interactions with 589 non-human genes of 23 non-human species including mammals and viruses provide a large understanding of UBC. UBC gene is one of the two stress-regulated polyubiquitin genes (UBB and UBC) in mammals and plays a key role in maintaining cellular ubiquitin levels under stress conditions [38,39]. Ubiquitination has been associated with protein degradation, DNA repair, cell cycle regulation, kinase modification, endocytosis and regulation of other cell signalling pathways [40][41][42]. Cells require either UBB or UBC for survival [43]. Furthermore, MAPK1 and MAPK3 are involved in 98 and 97 human gene pathways, respectively, which are the most. The proteins encoded by MAPK1 and MAPK3 are involved in a wide variety of cellular processes such as proliferation, differentiation, transcription regulation and development [44].
In total, 1728 diseases (with known disease genes) of 15 primary classifications are screened out in KEGG DISEASE (see details in Methods). See in table 3 that the proportion of disease genes is larger in pathway genes (components III and IV) than non-pathway genes (components I and II), and disease genes appear in pathways involved in more diseases. The average degree of disease genes is larger than that of all genes; however, most disease genes are not hubs, resulting from low median degree, which is consistent with the idea that majority of disease genes are non-essential and do not encode hub proteins [22].

Module partition: bridge between individual genes and pathways
Modularity proposed by Newman is one measure of the structure of networks [31]. The module reflects the local characteristics of individual behaviours in the network and their interrelationships. The modules in the research network play a vital role in understanding the structure and function of the entire network, and can help us analyse and predict the interaction between the elements of the entire network. Biological networks exhibit a high degree of modularity. Trying to understand the networkbased position of disease genes, Barabási et al. have reviewed three modularity concepts: topological modules, functional modules and disease modules [32].
In this part, the LCC is considered. The fast unfolding algorithm proposed by Blonde et al. in 2008 is recognized as one of the fastest and accurate non-overlapping community discovery algorithms, especially for networks of unprecedented sizes [45]. The LCC is divided into 10 modules. Most hubs are divided into different modules to play a role in the process of implementing some function. For example, in module 1, ELAVL1 has been implicated in a variety of biological processes and has been linked to a number of diseases, including cancer. It is highly expressed in many cancers, and could be potentially useful in cancer diagnosis, prognosis and therapy. In module 2, mutations in APP have been implicated in autosomal dominant Alzheimer's disease and cerebroarterial amyloidosis. Besides, EGFR is a cell surface protein that binds to epidermal growth factor and associates with cell proliferation. In module 3, tumour suppressor gene TP53 and ubiquitin gene UBC are associated with cell cycle regulation, apoptosis, senescence, DNA repair, protein degradation or changes in metabolism.  IL2   IL3   IL6   IL4   IL7   IFNA1   IFNA2   IFNA4   IFNA5   IFNA6   IFNA7   IFNA8   IFNA10   IFNA13   IFNA14   IFNA16   IFNA17   IFNA21   IFNB1   EPO   CSF3   GHR   PRLR   OSMR   IL2RA   IL2RB  IL2RG   IL3RA   IL6R  IL4R   IL7R   IFNAR1   IFNAR2 THBS1   COMP   THBS2   THBS3   THBS4   FN1   SPP1   VTN   TNC   TNN   TNR   TNXB   VWF   IBSP   ITGA1   ITGA2   ITGA2B   ITGA3   ITGA4   ITGA5   ITGA6   ITGA7   ITGA8   ITGA9 ITGA10   ITGA11   ITGAV   ITGB1   ITGB3 ITGB4   ITGB5   ITGB6   ITGB7   ITGB8   PTK2 PIK3CA PIK3CD To assess the overlap of function pathways and topological modules, we get the proportion vector l i ¼ (l i 1 , l i 2 , . . . , l i 10 ) for each pathway i (i ¼ 1, 2, . . ., 317) in 10 modules, and apply the adjusted cosine similarity (see details in Methods) to measure the correlations of pathway genes distribution in network modules. Figure 3 shows the hierarchical cluster of the correlations between 317 pathways, which yields six groups. Most metabolism pathways get together in Group 1 and Group 3, most genetic information processing pathways get together in Group 4, while the other four types of pathways have more diverse groups (Groups 2, 5, 6). This means that vital differences exist between metabolism as well as genetic information processing pathways and another four types of pathways. The metabolism pathway genes mainly distribute in modules 1, 2 and 7, with above-average proportion in modules 1, 7 and 8. The genetic information processing pathway genes mainly distribute in modules 1, 3 and 4, with aboveaverage proportion in modules 3 and 4. While another four types of pathway genes are mainly in module 2 as well as 1 and 3, with above-average proportion in modules 1, 2, 3 or 8 (table 4). The tacit assumption in network medicine is that the topological, functional and disease modules overlap, so that functional modules correspond to topological modules and a disease can be viewed as the breakdown of some particular functional modules [32,33].  Table 3. Statistical indicators of human genes and disease genes in four types of genes (I, II, III, IV). The proportion of disease genes is larger in pathway genes (III and IV) than non-pathway genes (I and II), and disease genes appeared in pathways involved in more diseases. The average degree of disease genes is larger than that of all genes; however, most disease genes are not hubs, resulting from low median degree. Values in italics are significantly increased in the corresponding rows.  Table 4. Module division obtained by fast unfolding algorithm, and the proportion of six types of pathway genes in 10 modules, respectively. The metabolism pathway genes are above-average proportion in modules 1, 7, 8. The genetic information processing pathway genes are above-average proportion in modules 3 and 4. Another four types of pathway genes are above-average proportion in modules 1, 2, 3 or 8. Values in italics are those greater than the proportion of genes in each column. module A topological module represents a locally dense neighbourhood in a network, such that nodes have a higher tendency to link to nodes within the same local neighbourhood than to nodes outside it. A functional module represents the aggregation of nodes of similar or related function in the same network neighbourhood, where function captures the role of a gene in defining detectable phenotypes. Finally, a disease module represents a group of network components that together contribute to a cellular function and disruption of which results in a particular disease phenotype [32].
In this work, 10 topological modules are identified by network clustering algorithms; 317 pathways represent the function modules. For each of 1728 diseases, the collection of disease genes is regarded as a disease module. We perform different classification models of diseases for different interests. In the past, classification of diseases was mainly based on congenital and acquired factors as well as diseased tissues and organs, which did not help people to realize the influence on integrated pathway functions. Here, we consider disease genes, topological modules and functional pathways to assess the influence of diseases on pathways of human body (see details in Methods).

Classification by the intensity of effects on pathways
For each of 1728 diseases, we calculate and normalize a score vector IS N (D k ) corresponding to 317 pathways as the intensity of effects on pathways, and used Euclidean distance to measure the distance between two diseases. Hierarchical clustering of 1728 diseases by the intensity of effects on pathways yields 12 categories (see figure 4; detailed clustering results are available in electronic supplementary material), as the curve of Difference P changes and get the minimum when 1716 merges are conducted (figure 6a). The distance matrix between 1728 diseases arranged in the clustering order corresponding to the dendrogram is illustrated in figure 6c. Each of 12 categories is a fairly heterogeneous mix of several types of diseases, which suggests that diseases of different pathological classifications may have similar intensity of effects on pathways. Cancers mainly are grouped in CATG 1 as well as CATG 2, CATG 3,  For each of 1728 diseases, we calculate and normalize a score vector IS N (D k ) corresponding to 317 pathways as the intensity of effects on pathways, and use Euclidean distance to measure the distance between two diseases.
royalsocietypublishing.org/journal/rsos R. Soc. open sci. 6: 190214 CATG 4. Most cancers in CATG 2 and CATG 4 are of haematopoietic and lymphoid tissues, and of soft tissues and bone. Only three cancers (anaplastic large-cell lymphoma, nasopharyngeal cancer and neuroblastoma) are grouped in CATG 3. In CATG 1, many other diseases, such as fanconi anaemia and type II diabetes mellitus, are grouped together with many cancers, showing that these diseases affect human pathways similarly and seriously as cancers. Fanconi anemia (FA) is a genetic disorder that is characterized by bone marrow failure, developmental abnormalities and predisposition to cancer. Monoallelic inactivation of some FA genes, such as FA complementation group D1 (FANCD1, also known as the breast and ovarian cancer susceptibility gene BRCA2), leads to adult-onset cancer predisposition but does not cause FA, and somatic mutations in FA genes occur in cancers in the general population. Studies of FA have revealed opportunities to develop rational therapeutics for this genetic disease and for malignancies that acquire somatic mutations within the FA pathway [46,47]. Type II diabetes mellitus is a long-term metabolic disorder, and the relationship between type II diabetes mellitus and cancers has always been researched. Several studies have suggested that diabetes mellitus may alter the risk of developing a variety of cancers, and the associations are biologically plausible [48].
Coughlin et al. [48] suggest that diabetes is an independent predictor of mortality from cancer of the colon, pancreas, female breast, and, in men, of the liver and bladder. Yang et al. [49] show that chronic insulin therapy significantly increases the risk of colorectal cancer among type II diabetes mellitus patients. Huxley et al. [50] conduct a meta-analysis and the results supported a modest causal association between type II diabetes and pancreatic cancer. Diseases in CATG 1, CATG 3, CATG 4 mainly affect such functions of regulating cell proliferation, survival, growth, migration, differentiation, adhesion by affecting ras signalling pathway, rap1 signalling pathway, MAPK signalling pathway, Jak-STAT signalling pathway and PI3K-Akt signalling pathway. Diseases in CATG 6, most of which are metabolic diseases, mainly affect oxidative phosphorylation pathway, neuroactive ligand-receptor interaction pathway, thermogenesis pathway, Alzheimer's disease pathway, Parkinson's disease pathway, Huntington's disease pathway. However, Alzheimer's disease and Huntington's disease are in CATG 4 and Parkinson's disease is in CATG 3. Compared with CATG 6, diseases in CATG 8 also affect many genetic information processing pathways and environmental information processing pathways. Diseases in CATG 2 are associated with virus infection and carcinogenesis, as they mainly affect cell cycle pathway, cellular senescence pathway, transcriptional misregulation in cancer pathway, viral carcinogenesis pathway, HTLV-I infection pathway, Epstein-Barr virus infection pathway, human papillomavirus infection pathway. Diseases in CATG 7 mainly affect metabolism pathways and genetic information processing pathways, such as glycolysis/gluconeogenesis pathway, purine metabolism pathway, protein processing in endoplasmic reticulum pathway, ubiquitinmediated proteolysis pathway. Moreover, endocytosis pathway is affected so much. Diseases in CATG 10 mainly affect RNA transport pathway, mRNA surveillance pathway, Hippo signalling pathway, endocytosis pathway, oocyte meiosis pathway, tight junction pathway, adrenergic signalling in cardiomyocytes pathway, dopaminergic synapse pathway, human papillomavirus infection pathway. Diseases in CATG 9 mainly affect genetic information processing pathways including spliceosome pathway, ribosome pathway and RNA transport pathway. Diseases in CATG 5 mainly affect complement and coagulation cascades pathway. Diseases in CATG 11 mainly affect ras signalling pathway, tight junction pathway, microRNAs in cancer pathway. Diseases in CATG 12 mainly affect ubiquitin-mediated proteolysis pathway.

Classification by the scope of effects on pathways
For each of 1728 diseases, we use a binary vector IS B (D k ) corresponding to 317 pathways as the scope of effects on pathways, and use Manhattan distance to measure the distance between two diseases. Hierarchical clustering of 1728 diseases by the scope of effects on pathways yields 14 categories (see figure 5; detailed clustering results are available in electronic supplementary material), as the curve of Difference P changes and get the minimum when 1714 merges are conducted (figure 6b). The distance matrix between 1728 diseases arranged in the clustering order corresponding to the dendrogram is illustrated in figure 6d.
Diseases in CATG I mainly affect nucleotide metabolism pathways, signal transduction pathways, cellular community pathways, cell motility pathways, development pathways, cancers pathways, viral infectious diseases pathways. Diseases in CATG II mainly affect nucleotide metabolism pathways, replication and repair pathways, cancers: specific types pathways, viral infectious diseases pathways. Diseases in CATG III mainly affect signal transduction pathways, signalling molecules and interaction pathways, cellular community pathways, cell motility pathways, immune system pathways, development pathways, cancers royalsocietypublishing.org/journal/rsos R. Soc. open sci. 6: 190214 pathways. Diseases in CATG IV mainly affect nucleotide metabolism pathways, development pathways, viral infectious diseases pathways. Diseases in CATG V mainly affect signalling molecules and interaction pathways, cellular community pathways, cell motility pathways, circulatory system pathways, development pathways, viral infectious diseases pathways. Diseases in CATG VI mainly affect xenobiotics biodegradation and metabolism pathways, signalling molecules and interaction pathways, cell motility pathways. Diseases in CATG VII mainly affect membrane transport pathways, signalling molecules and interaction pathways. Diseases in CATG VIII mainly affect carbohydrate metabolism pathways, nucleotide metabolism pathways, metabolism of terpenoids and polyketides pathways, xenobiotics biodegradation and metabolism pathways, cell motility pathways, ageing pathways. Diseases in CATG IX mainly affect translation pathways. Diseases in CATG X mainly affect nucleotide metabolism pathways, transport and catabolism pathways, cell motility pathways. Diseases in CATG XI mainly affect membrane transport pathways, cellular community pathways, cell motility pathways, development pathways, ageing pathways, cancers pathways. Diseases in CATG XII mainly affect nucleotide metabolism pathways, cellular community pathways, cell motility pathways, circulatory system pathways, development pathways, ageing pathways, environmental adaptation pathways, cancers pathways. Diseases in CATG XIII mainly affect translation pathways, cell motility pathways, circulatory system pathways. Diseases in CATG XIV mainly affect nucleotide metabolism pathways, xenobiotics biodegradation and metabolism pathways, membrane transport pathways, cell motility pathways.

Associations and differences between the two classifications
In this paper, we integrate disease genes, topological modules and functional pathways to assess the influence of diseases on pathways of human body. When it comes to associations between the two classifications, the first is that both of them are based on the impact score, which is the basic metric in this paper (more details can be found in §2.4). The second is the criteria for classifications (more diseases class 1 cancers 2 immune system diseases 3 nervous system diseases 4 cardiovascular diseases 5 respiratory diseases 6 endocrine and metabolic diseases 7 digestive system diseases 8 urinary system diseases 9 reproductive system diseases 10 musculoskeletal diseases 11   details can be found in §2.5 and figure 6). The third is that most disease pairs are grouped together in both of the two classifications (illustrated in grey numbers in table 5, as the maximum either in the row or in the column); however, many of them are not in the same group in KEGG DISEASE database (figure 7a). For example, in the middle of figure 7a, it shows that many cancers have common disease genes and are also classified into one group in our classification. What is more, the giant cell tumour of bone (#699, classified as a musculoskeletal disease in KEGG DISEASE, see the electronic supplementary material) is closely connected with these cancer diseases. Giant cell tumour of the bone is a relatively uncommon tumour of the bone. Malignancy in giant cell tumour is uncommon; however, if malignant degeneration does occur, it is likely to metastasize to the lungs [51]. So it is reasonable to group giant cell tumour of bone with many cancers including non-small cell lung cancer (#19) and small cell lung cancer (#20).
In the right of figure 7a, congenital muscular dystrophies (#711 in musculoskeletal diseases), muscular dystrophy-dystroglycanopathy type A (#717 in musculoskeletal diseases, #950 in congenital disorders of metabolism, #1197 in congenital malformations), muscular dystrophy-dystroglycanopathy type B (#718  Figure 6. Corresponding to the dendrogram in figure 4, partitions fP 1659 , P 1660 , . . ., P 1727 g are considered in which every disease is merged at least once. Difference P gets the minimum when 1716 merges are conducted and revealed 12 categories (a), and the distance matrix between 1728 diseases arranged in the clustering order is illustrated in (c). Corresponding to the dendrogram in figure 5, partitions fP 1658 , P 1659 , . . ., P 1727 g are considered. Difference P gets the minimum when 1714 merges are conducted and revealed 14 categories (b), and the distance matrix between 1728 diseases arranged in the clustering order is illustrated in (d ).
royalsocietypublishing.org/journal/rsos R. Soc. open sci. 6: 190214 in musculoskeletal diseases, #951 in congenital disorders of metabolism), muscular dystrophydystroglycanopathy type C (#719), Fukuyama congenital muscular dystrophy (#720), congenital muscular dystrophy type 1C (#721 in musculoskeletal diseases, #952 in congenital disorders of metabolism), congenital muscular dystrophy type 1D (#722 in musculoskeletal diseases) are closely connected with each other. The fact that one disease may be classified as more than one category in KEGG DISEASE database motivates us to find a method to understand the relationships between different categories of disease from the genetic level.
When it comes to differences between the two classifications, the first is that they emphasize the impact of disease on pathways from different perspectives. To classify diseases by the intensity of effects on pathways, we use the normalized vector IS N (D k ) to measure the difference in intensity between the pathways. The score of the most affected pathway is 1, and the score of the least affected pathway is 0. To classify diseases by the scope of effects on pathways, we use the binary vector IS B (D k ) to mark pathways with a score exceeding the average. The second is the distance of diseases. In the first classification, the distance of diseases D i and D j is defined as the Euclidean distance of IS N (D i ) and IS N (D j ). In the second classification, the distance of diseases D i and D j is defined as the Manhattan distance of IS B (D i ) and IS B (D j ). Note that for two binary vectors, the following equation holds: distance manhattan Â dimension vector ¼ (distance euclidean ) 2 . The third is that there exist some disease pairs that are grouped differently in the two classifications ( figure 7b,c).
If the diseases are only grouped together in the first classification, it means that the impact on function by these diseases is concentrated in certain pathways but each disease uniquely affects several other pathways. The pair of type II diabetes mellitus (#536) and basal cell carcinoma (#23) is an example. Besides, hypertrophic cardiomyopathy (#387) and dilated cardiomyopathy (#389), retinitis pigmentosa (#284) and leber congenital amaurosis (#306), deafness, autosomal dominant (#346) and deafness, autosomal recessive (#347) are other disease pairs that are only grouped together in the first classification (figure 7b).
If the diseases are only grouped together in the second classification, it means that these diseases may affect similar pathways, but with different intensities. For example, laryngeal cancer (#6), fallopian tube cancer (#44), chronic myeloid leukaemia (#77), myelodysplastic syndrome (#506) are four of nine diseases that are grouped together in CATG 2 and CATG I (figure 7c and table 5), which implies they are special. In the treatment of these diseases, it should be done to pay attention to the pathways they affect; however, the focuses are different. Additionally, glycogen storage diseases (#835) and hepatic glycogen storage disease (#836), progressive external ophthalmoplegia (#330, #1145) and mitochondrial DNA depletion syndrome (#1138) are other disease pairs that are only grouped together in the second classification (figure 7c). Table 5. The overlap of diseases between the two classifications. The numbers in grey are the maximum either in the row or in the column, representing the correspondence between the two classifications. Other positive numbers imply that minority of diseases are grouped differently in the two classifications.

Discussion
In 2016, the precision medicine initiative was announced to help enable a new era of personalized care through cooperative efforts by researchers, clinicians and patients. Since then, researchers are trying to figure out what set of technologies and disciplines would afford the highest level of efficacy in the development of precision medicine [52]. The study of genomics and other molecular analyses at the omics level are rapidly growing fields that have the potential to have a profound impact upon medical practice. Many good studies focus on underlying molecular mechanisms of diseases [26,29,53], providing predictive, prognostic, diagnostic and surrogate markers of diverse disease states [54,55], classifying patient samples based on molecular data [27,28].
Most of the successful studies building on these new approaches have focused on a single disease or a class of diseases to gain a better understanding. Recent progress in genetics and genomics has led to an appreciation of the effects of gene mutations in virtually all disorders and provides the opportunity to study human diseases all at once rather than one at a time. Under the key hypothesis that a disease phenotype is rarely a consequence of an abnormality in a single effector gene product but reflects various pathobiological processes that interact in a complex network [32], the network-based approaches offer the possibility of discerning general patterns and correlations of human disease not readily apparent from the study of individual disorders [22]. To address the fundamental challenge of modern biomedical research that understanding how diseases that are similar on the phenotypic level are similar on the molecular level [4], our focus is on classifying diseases based on pathways impacted and supporting combination therapy between diseases as evidence. The distances between diseases based on the human gene network and pathways are defined to evaluate the similarity of diseases, even the similarity of gene-targeted therapies.
It is considered to be right that disease genes affect pathway functions through topology modules [32,33]. Therefore, we derive the influence on pathways by diseases through calculating inner product of the following two vectors. The first vector is used to measure the propagation efficiency of specific disease signals in each of the modules. Each component of the vector is the summation of the closeness centrality of disease genes within the corresponding module. Mathematically, the technique of network propagation is simplifying and unifying. It is a powerful data transformation method of broad utility in genetic research, since it greatly improves the power of genetic association, providing a universal amplifier for genetic analysis [56]. The second vector is a measure of relevance between the module and pathways. Each component of the vector is the jaccard similarity coefficient of the module and corresponding pathway, indicating the extent of overlap of two large gene sets. This measure is selected mainly because it has proven to do well in comparing two sets of nodes when considering the difference of the size of the two sets [57]. Moreover, it provides an intuitive way to characterize the set similarity.
Hierarchical clustering has been the dominant approach to constructing classification schemes, and much early work on hierarchical clustering was in the field of biological taxonomy from the 1950s and more so from the 1960s onwards [58]. The dendrogram expresses many of the proximity and classificatory relationships in a body of data. To answer the question 'how many groups are there?', we defined a parameter to measure the difference of average distances of diseases within groups to average distances of distances between groups, playing the same role as modularity in module partition.
In total, 1728 diseases collected from KEGG are divided into 12 categories by the intensity of effects on pathways, and are identified as 14 categories by the scope of effects on pathways. Each category is a mix of several types of diseases identified based on congenital and acquired factors as well as diseased tissues and organs. The number of disease categories that contain cancers is the smallest among 15 types of diseases, which suggests the similarity of cancer diseases in terms of having a great impact on pathways, because almost all cancers involve multiple genes. As for monogenic diseases, the disease module is regarded as the disease gene, such that the distance between two diseases is zero when the disease genes belong to the same topological module, resulting in the situation that the two diseases are grouped together. Otherwise, the two diseases will be divided into different categories. The number of disease categories will be larger than the number of topological modules generally. Our results imply that the human gene network gives a perspective of disease classifications.
The method for deriving the results of this paper is based on the topological structure of gene interaction network especially the LCC and module division results. After the completion of the Human Genome Project, the number of new genes discovered in the future should be very small. However, it is unavoidable that the topological structure of gene interaction network will be different because of the exploration of new gene -gene interactions. We take the published time of the literature as the time of discovery of interactions, based on which we obtain the LCC of the human gene royalsocietypublishing.org/journal/rsos R. Soc. open sci. 6: 190214 network at the end of each year (from 2003 to 2017). The number of genes in the LCC increases; however, the number of modules does not change much (more details are available in the electronic supplementary material). This fact may be due to properties of scale-free networks that most interactions newly discovered connect a hub gene and an isolated gene. Although the human gene network will inevitably undergo local changes with the discovery of new genes and gene -gene interactions, the overall organization and layout of the network will not be changed significantly, because it is unchanged that hub nodes play a major role in the modules. As gene -gene interaction relations are constantly explored, gene interaction networks are evolving. To describe how much a gene has been studied, an approach called gene saturation which is based on a logistic model for each gene has been proposed recently [59]. This approach may provide some guidances for experimental researchers to choose their research object and discover new gene -gene interactions efficiently.
Disease classification is a progression towards precision medicine with the need for precise patient characterization, currently based on clinical phenotypes but in future augmented by laboratory-based tests [60]. As illustrated in figure 8, researchers of hospitals and institutions obtain molecular data of patient samples for diseases of concern. Their great studies have been providing valuable guidance to precision medicine from many aspects, such as providing predictive, prognostic, diagnostic and surrogate markers of diverse disease states, informing on underlying molecular mechanisms of diseases, allowing for classification of patients based on molecular data, etc. Our work (the blue part) is an extension on their basis, firstly, because the data we need, for example disease genes, are identified and summarized in previous researches. Secondly, the direct application of genetic information can be considered as first order, while the human gene network is an integration of genetic information, which is high order. Thirdly, our classifications of diseases are at the system level, designed to provide novel insights for clinical practice at the sample level, for example, a repositioning of our understanding of diseases and exploration of the potential of combination therapy. Also, our results of disease classifications may complement each other with the classification of complications, as a clear definition of complications is essential in medicine, mainly aiming to improve quality in patients' care. The lack of method for uniform reporting of complications both in terms of definition and grading prompted the authors to propose a classification system of complications based on combining outcome and severity of sequelae [61]. The integration of such work will play a role in guiding combination therapy. Moreover, the enormous complexity of common diseases and the resulting problems, such as the fact that many patients do not respond to treatment and the increasing costs of drugs and drug development, provide  . Molecular data of patient samples for concerned diseases yield good results in many great researches, which have been providing valuable guidance to precision medicine from many aspects, such as providing predictive, prognostic, diagnostic and surrogate markers of diverse disease states, informing on underlying molecular mechanisms of diseases, allowing for classification of patients based on molecular data, etc. Our work (the blue part) is an extension on their basis, firstly, because the data we need, for example disease genes, are identified and summarized in previous researches. Secondly, the direct application of genetic information can be considered as first order, while the human gene network is an integration of genetic information, which is high order. Thirdly, our classifications of diseases are at the system level, designed to provide novel insights for clinical practice at the sample level, for example, a repositioning of our understanding of diseases and exploration of the potential of combination therapy.
royalsocietypublishing.org/journal/rsos R. Soc. open sci. 6: 190214 strong motivation for new and complementary strategies for research and clinical practice [62]. The network-based approaches classifying diseases based on pathways impacted have the potential to substantially enable the elaboration of a network based view of drug discovery and reposition (the application of known drugs to new indications), which are challenging issues in pharmaceutical science [63,64]. Prior to clinical implementation, major challenges must be addressed from a clinician's perspective, including understanding how network approach based genomic science is generated and linked to patient-oriented science, which is the framework for evaluating genomic studies, as an evidence base for providing effective precision medicine to patients in the future [65]. The diseases studied in our work are far more than other researches, not limited to cancers or certain diseases. This fact may lead to our results being extensive; however, it may guide people to pay attention to some unexpected points.
A comprehensive description of the associations between pathways and diseases requires identification of not only multiple pathways associated with a specific disease but also pathways associated with multiple diseases. In 9125 tumour samples, Sanchezvega et al. [29] point out significant representation of individual and co-occurring actionable alterations in 10 signalling pathways. Meanwhile, we find that most of the 10 signalling pathways are subject to significant influence by diseases, including Hippo signalling pathway, PI3K-Akt signalling pathway, Notch signalling pathway, p53 signalling pathway, cell cycle pathway, Ras signalling pathway, TGF-beta signalling pathway, Wnt signalling pathway. The current understanding of the Hippo pathway, for example, has been reviewed with an emphasis on the effects of this pathway on basic biology and human diseases, including cancers, immunity and cardiovascular diseases [66]. Another review shows that individuals with RASopathies share many overlapping characteristics, including cardiac malformations, short stature, neurocognitive impairment, craniofacial dysmorphy, cutaneous, musculoskeletal, and ocular abnormalities, hypotonia and a predisposition to developing cancer [67]. These results suggest resemblance between cancer diseases and non-cancer diseases, and implies opportunities for targeted and combination therapies. Integrative analysis methods have been proposed to improve power and reproducibility for identifying genes and prognosis markers associated with multiple cancers, which may lead to discovery of novel therapeutic targets for cancer therapies [68,69]. In addition, it estimates the effect (either positive or negative) of the same gene in different diseases, which helps predict possible side effects of a drug. Supported by our classification of diseases, similar applications can be implemented between cancer diseases and non-cancer diseases to discover biomarker targets and improve drug development for multiple diseases that are classified in the same category.
Data accessibility. The data calculated in results are provided as electronic supplementary material. Authors' contributions. Z.M., B.G. and Z.Z. conceived the idea for this study. Z.M. performed the theoretical and