Data-driven identification of complex disease phenotypes

Disease interaction in multimorbid patients is relevant to treatment and prognosis, yet poorly understood. In the present work, we combine approaches from network science, machine learning and computational phenotyping to assess interactions between two or more diseases in a transparent way across the full diagnostic spectrum. We demonstrate that health states of hospitalized patients can be better characterized by including higher-order features capturing interactions between more than two diseases. We identify a meaningful set of higher-order diagnosis features that account for synergistic disease interactions in a population-wide (N = 9 M) medical claims dataset. We construct a generalized disease network where (higher-order) diagnosis features are linked if they predict similar diagnoses across the whole diagnostic spectrum. The fact that specific diagnoses are generally represented multiple times in the network allows for the identification of putatively different disease phenotypes that may reflect different disease aetiologies. At the example of obesity, we demonstrate the purely data-driven detection of two complex phenotypes of obesity. As indicated by a matched comparison between patients having these phenotypes, we show that these phenotypes show specific characteristics of what has been controversially discussed in the medical literature as metabolically healthy and unhealthy obesity, respectively. The findings also suggest that metabolically healthy patients show some progression towards more unhealthy obesity over time, a finding that is consistent with longitudinal studies indicating a transient nature of metabolically healthy obesity. The disease network is available for exploration at https://disease.network/.


Introduction
Traditionally, medical research and practice follow a reductionist [1] specificdisease approach that often neglects that hospitalized patients usually suffer from a variety of diseases [2], a phenomenon called multimorbidity [3]. Interactions between multiple diseases may have severe consequences both for diagnosis and treatment and often form decisive features of the clinical picture of the patient [4]. For clinical practice, it is important to disentangle these complex disease relationships, which are far from fully understood [5].
One approach to better understand the relationships between diseases are comorbidity networks. Comorbidity networks formalize disease co-occurrences by representing diagnoses as nodes and linking diagnoses that tend to co-occur in patients [6][7][8]. In comorbidity networks, the health state of a multimorbid patient is characterized by multiple diagnoses that appear in network clusters (i.e. groups of nodes with more links within the group than to nodes outside of the group) of e.g. mental, metabolic or cardiovascular diseases [9,10]. So far, this strand of literature on comorbidity networks typically focused on pairwise co-occurrence patterns. This approach may miss complex interactions between diseases. The investigation of multimorbidity patterns has accordingly been defined as a priority in this line of research [8].
Besides the comorbidity network approach, recent years have brought increased activity in disease risk modelling using a plethora of data mining and machine learning approaches [11][12][13][14][15]. Machine learning is primarily dedicated to the prediction of individual diseases or groups of closely related diseases [13,15]. Different from network approaches it applies a broad spectrum of diagnostic information. These methods have yielded high predictive performance, but their transparence and interpretability are often limited [16,17]. Even with strategies such as post hoc interpretation [17], these approaches to the best of our knowledge cannot be used to address complex interactions across the whole disease spectrum which we pursue in the present study.
Previous work has described approaches for gathering phenotypic descriptions of patients and discovering correlations between diseases, see e.g. Roque et al. [18]. These studies typically use a variety of categories of medical records to analyse relationships between diseases [19]. To date, however, studies in the area do not allow to explicitly study interactions between more than two diseases, so-called higher-order interactions, and often use data from only one hospital [18]. In this study, we use the terms higherorder and complex interactions interchangeably. We further distinguish two types of interaction. In synergistic interactions, the combined effect of two or more diseases is higher than what would be expected from the individual diseases; in redundant interactions, the overall combined effect is reduced.
We aim to identify disease phenotypes from diagnosis data in population-wide electronic health records. For a specific index disease, there might be several disease phenotypes which differ regarding their comorbidity context, i.e. their location and neighbourhood in the disease network. A specific location in the network is related to a specific risk of acquiring further diseases (see below).
Obesity is controversially discussed in the medical literature to form different phenotypes. Using obesity as an example index disease, we evaluate if and to what extent the patients assigned to the two major obesity clusters on the generalized disease network differ in terms of co-morbidities and prognosis. For this purpose, we use a case-control design with cases and controls matched by age, sex and place of residence. We find that the main obesity clusters are largely consistent with what has been discussed in the medical literature as metabolically healthy and unhealthy obesity, respectively. Different from clinical studies in the topic area, which typically follow obese and non-obese patients over time, we show for the first time, that obesity clusters in ways that are somewhat consistent with the discussed clinical phenotypes.
Here, we integrate approaches from network science, machine learning and computational phenotyping to assess complex higher-order interactions between diseases. The proposed method is transparent and comprehends diseases across the full diagnostic spectrum in a way that leads to clinically meaningful results.

Definitions
The following terms are used throughout the paper with a specific meaning. To improve readability, we provide these definitions here: -Disease: A condition that impairs normal functioning and is typically manifested by distinguishing symptoms and coded by a diagnosis. -Diagnosis: A code in the International Classification of Diseases, typically encoding a disease, injury or symptoms. -Diagnosis feature: One or more diagnoses with similar co-occurring diagnoses (i.e. comorbidities) and similar progression. Diagnosis features are of 'higher order' if they consist of more than a single diagnosis. In this work, they are the basis for capturing the higher-order interactions between diagnoses and the corresponding diseases. They allow a more fine-grained picture of the human disease network when compared with single diagnoses. -(Optimal) feature set: A (meaningful) set of higher-order diagnosis features. The optimal (meaningful) feature set is characterized by maximizing the predictive performance of a cross-validated multinomial naive Bayes model through the tuning of few hyperparameters. The model inputs are the diagnosis features from the feature period as predictors, and the single diagnoses from the subsequent target period as predicted targets. -Generalized disease network: A network where the nodes are diagnosis features representing one or more diagnoses and with weighted links representing the similarity of the model coefficient vectors associated with the linked nodes. Stronger links represents higher similarity of the disease features with respect to their predicted target diagnoses. -(Putative) disease phenotype: A cluster of diagnosis features of a given index disease on the generalized disease network. A specific disease phenotype might indicate shared aetiological factors between the diseases reflected in the included diagnoses. Furthermore, a disease phenotype better reflects the original diseases than single diagnoses with respect to aetiological factors. -Index disease: A specific diagnosis code of interest to explore comorbidities and identify disease phenotypes.

Data and design summary
A hospital population of 9 M patients was considered. The criteria for selection into the study were having no in-or outpatient hospitalization for a duration of 6 years, followed by at least one hospitalization during the next 3 years (the feature period), followed by at least one further hospitalization during the following 3 years (the target period). Half a million patients were selected into the study population. The goal of this study was to identify disease phenotypes through the use of diagnosis information in electronic health records. The first step was to identify meaningful diagnosis features, i.e. features maximizing the predictive power, capturing higher-order interactions between diagnoses. A diagnosis feature is understood as a set of one or more (co-occurring) diagnoses. A meaningful set of diagnosis features was identified from the data using the following approach: (a) we mapped the patients' single diagnoses of the feature period to diagnosis features; (b) we used them to predict the patients' single diagnoses-the ( prediction) targets-of the target period, and, royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20201040 finally, (c) we selected a set of features that maximized the predictive performance with respect to the target diagnoses. Using this feature set, a final predictive model was fit to the whole dataset, referred to as all-data model. The non-trivial information contained in the correlation structure of the model coefficients was used to produce a comorbidity network where the nodes are sets of diagnoses (the diagnosis features) instead of single diagnoses. We call this the generalized disease network. There, disease phenotypes could be identified using clusters of diagnosis features. This was demonstrated at the example of the index disease obesity, where we propose to interpret clusters of obesity features as different disease phenotypes of obesity. Indeed, a matched comparison of patient cohorts corresponding to these obesity clusters showed that two of these clusters showed characteristics of metabolically healthy and unhealthy obesity, respectively. Figure 1 demonstrates that the use of increasingly complex (i.e. higher order) features indeed improves the average predictive performance of a multinomial naive Bayes model; see §5. Figure 1a shows the model quality in terms of the total average F1 score as a function of the maximal number of diagnoses that may make up a feature, i.e. the model order. We computed models up to model order four. Figure 1a also shows that the number of features in the optimal set increases with model order. Figure 1b shows that precision and recall vary substantially across different diagnoses, almost over the entire range of possible values between zero and one. Individual F1 scores range between 0.04 and 0.78. This means that some diagnoses can be predicted very well from the feature information, others less. Among those diagnoses with the highest prediction accuracies, we find diagnosis codes (ICD-10) starting with M (diseases of the musculoskeletal system and connective tissue) and N (diseases of the genitourinary system), whereas the lowest scores are found for codes starting with O (pregnancy, childbirth and the puerperium) and C (malignant neoplasms).

Identifying meaningful diagnosis features
In figure 1c, diagnoses are grouped by the first character of their respective ICD-10 code. Higher model orders clearly increase precision while keeping the recall approximately constant. Overall, the resulting increase in F1 is therefore mostly due to the increase in precision. For example, diagnoses starting with M (diseases of the musculoskeletal system and connective tissue), L (diseases of the skin and subcutaneous tissue) or N (diseases of the genitourinary system) show high precision on average, whereas diagnoses starting with P (conditions relating to the perinatal period), O (pregnancy, childbirth and the puerperium) or F (mental and behavioural disorders) have the lowest. We find a recall close to one for codes of group P while most other groups have recalls around 0.5.
The high recall of group P very likely is an artefact because newborns with P-diagnoses are preferentially selected into the feature period rather than the target period; see §5. A low number of new P-diagnoses in the target period together with the monotonicity assumption leads to high recall.
For an overview of all diagnoses and diagnosis groups with their respective scores, electronic supplementary material, Scores.xlsx.

Quantifying disease interactions
An immediate by-product of the computation of the model coefficients is the quantifiability of synergistic effects between diagnoses, presented in table 1. Synergistic and redundant effects are measured through their lift. Lift is computed from multiple models of orders one to four. It is the difference of a measured model coefficient and its expectation based on the coefficients of its respective lower-order models. For example, the lift of feature 'E11 s E14 s E66s' is computed from its coefficient of model order three and the coefficients of features 'E11s to E14s' from model order two and 'E66s' from model order one, respectively. Like the model coefficients, the lift is computed for specific feature-target combinations. The model coefficients, and thus lift, correspond to log-odds and are herein measured in decibans (db, see §5). A positive lift means an increase in log-odds compared to its expectation. Table 1 presents feature-target combinations that occur in at least 50 patients, thus reducing fluctuations of the model coefficients, and shows combinations with coefficients of at least 8 db, reflecting a relatively large effect size.
Note that findings in table 1 should be considered as exploratory only and do not include testing for statistical significance. Positive (negative) lifts indicate synergistic (redundant) interactions. We present the features with the largest synergistic effects across all feature-target combinations. Our model contains trivial predictions where a feature is used to predict a target already contained in that feature's diagnosis set. Note that the model incorporates a monotonicity assumption: once a patient has received a diagnosis, we assume the condition remains positive throughout. Disregarding those trivial associations, the top 10 features of at least two diagnoses with highest lifts, grouped by feature and averaged over all targets to which the respective feature contributes, are shown in table 1. In general, many of the synergistic effects contain specific components of the metabolic syndrome and cardiovascular target variables-associations that are well established in the literature. For instance, disorders of purine and pyrimidine metabolism (E79), specifically hyperuricaemia, have frequently been described as a marker of metabolic syndrome [20] and are part of several synergistic effects. For Table 1. Highest-ranking (by median lift) feature-target combinations. This table shows the highest-lift features and their corresponding target diagnoses. Included are diagnosis features comprising two or more diagnoses and where the feature-target combination occurs in at least 50 patients with a coefficient of at least 8 db. The median value of the lift is shown if there is more than one target fullfiling the selection criteria for a given feature. Note that some features with three or more diagnoses can be split into features with one or two diagnoses, respectively; these constituent features are separated by vertical bars. They provide the basis for the lift computation for the feature f. example, the combination of E79 with N18 (chronic kidney disease), if also combined with N39 (other disorders of the urinary system) is associated with an average lift of 1.9 db for the targets E11 (diabetes mellitus) and I25 (chronic ischaemic heart disease). This corresponds to 1.5-fold increased odds of positive target diagnoses E11 and I25, when N39 is additionally diagnosed together with E79 and N18 in the feature period. This combination of diagnoses constitutes the most complex example in table 1 and a full explanation is out of scope of this work. Regarding some of the included features, for example, hyperuricaemia (part of E79) has been shown to contribute to the progression of kidney disease in diabetes [21], with serum uric acid being either merely a marker of kidney damage or having a causal pathogenic role [22]. Furthermore, hyperuricaemia is associated with both chronic kidney disease and chronic ischaemic heart disease [22]. Accumulating evidence points to a possible aetiologic role of increased uric acid in the pathogenesis of cardiovascular disease [8]. A recent meta-analysis further found that every 1 mg dl −1 increase in serum uric acid was related to a 12 % increase in cardiovascular mortality in chronic kidney disease patients [23]. Several other noteworthy synergistic effects are present, e.g. a combination of N18 (chronic kidney disease) and I50 (heart failure) with regard to E78 (disorders of lipoprotein metabolism; 1.9 db, i.e. change in odds of 1.5). This is consistent with literature that shows a link between chronic kidney disease with heart failure, and with hyperlipidaemia, respectively [24,25]. Pathophysiological features indicate that heart failure can cause a reduction in cardiac output and decrease in renal perfusion which are primary drivers of renal dysfunction in heart failure. Among the various confounding factors for renal dysfunction in heart failure, dyslipidaemia has received increasing attention, and its role has not been fully understood [25].
Other displayed synergistic effects include a combination of two diagnoses for Alzheimer's disease (F00, G30) which increases the odds of fracture of femur (S72) and delirium (F05) (1.3 db, i.e. change in odds of 1.3). Accordingly, Alzheimer's disease has been shown to be an important risk factor for serious falls, including pelvic and femur fracture [26][27][28]. Furthermore, a review found that superimposed delirium among populations with dementia was highly prevalent [29]. Delirium and Alzheimer's disease are frequent causes of cognitive impairment among older adults and share a complex relationship in that delirium and Alzheimer's disease can occur independently, concurrently, and interactively, for example, delirium can alter the cause of an underlying Alzheimer's disease. Models for a shared pathophysiology of delirium and Alzheimer's disease have recently been proposed, including common baseline risk biomarkers and outcome biomarkers [30].
There is a synergy between hyperuricaemia (E79) and hypertensive heart disease (I11); which are both risk factors for chronic ischaemic heart disease (I25; 1.3 db). While hypertension is a well-established risk factor for chronic ischaemic heart disease, also hyperuricaemia has been established as an independent risk factor [31,32]. To the best of our knowledge, the combined risk of these risk factors has not yet been evaluated.
A further synergistic effect is found between smoking (F17) and acute myocardial infarction (AMI, I21) with regard to angina pectoris (I20, 1.0 db, i.e. change in odds of 1.3). Studies show accordingly that smoking after AMI is associated with a considerably increased risk of more angina [33].

The generalized disease network
The diagnosis features which capture higher-order interaction effects are used to inform the construction of the generalized disease network. See figure 2 for an overview. There, nodes correspond to diagnosis features and links between features (omitted in the figure) indicate that these features predict similar target diagnoses; see §5. There are well discernible clusters of features roughly corresponding to the chapters of the ICD-10 classification and summarized in figure 2. Specific examples for the distribution of obesity (E66), chronic ischaemic heart disease (I25), osteoporosis (M81) and asthma (J45) are presented in figure 3.

Detecting obesity phenotypes
To detect disease phenotypes, we perform a network community detection, i.e. clustering. Clusters are identified such that nodes (diagnosis features) are more strongly connected within their respective cluster than with nodes outside their clusters. We expect that different disease phenotypes present themselves in different comorbidity contexts and different prognoses. We, therefore, expect them to be located in different clusters of the network. This means that for a given index disease we expect to find distinct disease phenotypes in different clusters (colours) of the generalized disease network.
We find the index disease obesity (E66) distributed across four different clusters in the network. We use the two largest clusters to construct a matched case-control design, see §5. We compare them with non-obese controls. Individuals in one of the clusters show characteristics of metabolically unhealthy obesity, whereas the other cluster is metabolically more healthy [34][35][36][37] (figure 3a). In particular, patients in the cluster of metabolically unhealthy obesity (MUHO candidates), show a considerably higher prevalence of diabetes, hyperlipidaemia, when compared with both, patients in the cluster of metabolically healthy obesity (MHO candidates) and non-obese controls. Hypertension, ischaemic heart diseases and other forms of heart disease already have higher prevalences among MHO candidates, compared to controls. With regard to these diseases, MHO candidates have an intermediate position between control and MUHO candidates. MHO candidates, however, are similar to controls in terms of diabetes and hyperlipidaemia prevalence as well as mental and behavioural disorders due to psychoactive substance use (including nicotine dependence), which are all increased in MUHO candidates. Table 2 displays metabolic and selected further diagnoses with significant differences in prevalence for obesity phenotypes and controls. Note that, based on approx. 200 blocks in the dataset, significant differences as indicated with ** or *** in table 2 remain significant after adjustment for multiple testing. The phenotypes do not differ in terms of number of diagnoses or days in hospital, with both groups having more diagnoses and longer stays when compared with controls.
Regarding the incidence of new metabolic and cardiovascular diseases during follow-up, again MHO patients have a position in-between MUHO and control patients. Metabolic diseases, most importantly diabetes, are more frequently newly diagnosed among patients that already were in the MUHO cluster.
In-hospital mortality over a follow-up period of 9 years did not significantly differ between MHO and MUHO royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20201040 patients, and mortality of MHO patients was again inbetween that of non-obese controls and MUHO patients.
It is straight-forward but beyond the scope of this work to extend the above computational approach to other index diseases than obesity. In particular, we show in figure 3b-d that also (b) chronic ischaemic heart disease, (c) osteoporosis and (d) asthma might show distinct disease phenotypes in multiple clusters. The phenotypes of more than 100 different index diseases can be explored online at https://disease.network/.

Discussion
The goal of this study was to develop a method to automatically identify disease phenotypes through the use of diagnosis information in electronic health records. In the first step, we identified diagnosis features, i.e. meaningful sets of diagnoses, to capture higher-order interactions of co-occurring diagnoses. We showed that taking higher-order interactions into account improved the performance of predicting disease progression.
The second step was to display these higher-order interactions by constructing a network of diagnosis features with the best predictive performance and tessellate it into clusters. As diagnosis codes are generally part of multiple diagnosis features, the way how specific diagnoses are distributed across the network in clusters might indicate different disease phenotypes.
There is, to the best of our knowledge, no previous literature which has investigated the impact of higher-order interactions between diagnosis codes across the full diagnostic spectrum. Previous studies for network construction [7] have only looked at pairwise co-occurrence patterns of diagnoses.  In those networks, each diagnosis is only represented once. This single representation cannot differentiate between multiple phenotypes of the respective index disease. Complex diseases, however, have in common that they (a) cannot be captured by one single specific diagnosis (e.g. metabolic syndrome) and (b) that a specific diagnosis can present itself as part of multiple disease phenotypes (e.g. hypertension). The approach developed here allows us for the first time to investigate multiple roles of specific diagnoses in disease phenotypes, thereby filling a methodological chasm in the current literature on multimorbidity [8].
At the examples of obesity (E66), chronic ischaemic heart disease (I25), osteoporosis (M81) and asthma (J45), we show that many complex diseases are indeed distributed across multiple clusters on the disease network ( figure 3). For example, osteoporosis (figure 3c) occurs in clusters of cardiovascular  Comparison of putative metabolically healthy obesity (MHO) and metabolically unhealthy obesity (MUHO) cohorts with non-obese controls. The effect size (ES) is measured in decibans (db) computed as ES i,j : = 10(log 10 n j − log 10 n i ) db, where i, j are the group indices with 0 = control, 1 = MHO cand., 2 = MUHO cand., and n k is the relative frequency of patients in group k with at least one diagnosis of the corresponding diagnosis block (shown as percentages). A value of ES i,j > 5 db can be considered substantial evidence, ES i,j > 10 db as strong, ES i,j > 15 db as very strong and ES i,j > 20 db as decisive evidence. † Asterisks indicate the p-value of the effect size (G-test): *p < 10 −2 ; **p < 10 −4 and ***p < 10 −6 . and metabolic diseases and also in the neurodegenerative disease cluster, suggesting different (disease) phenotypes of osteoporosis. In accordance, cardiovascular [39,40], but also depression [41] and Parkinson's disease [42] have been identified as risk factors for osteoporosis. Shared pathways of both neurodegeneration with osteoporosis [43] and cardiovascular disease with osteoporosis [39] have been discussed, supporting the identified disease phenotypes. The latter has been hypothesized with common features between bone mineralization and atherosclerotic calcification [39]. Similarly, asthma is part of two clusters (figure 3d), suggesting the existence of two distinct asthma phenotypes. The first cluster is the one of mainly cardiovascular diseases (I,E in dark blue). Asthma has been identified as a risk factor for cardiovascular diseases [44]. In this cluster in the generalized disease network, asthma is primarily connected to chronic obstructive pulmonary disease which further connects to a phletora of cardiovascular diseases. Asthma is a chronic inflammatory disease, with inflammatory processes being key in the pathophysiology of atherosclerotic diseases [44]. The second cluster is closely related to childhood diseases (P,Q in green), whereas a specific asthma phenotype, particularly as an early-onset allergic type of asthma, has been discussed for children [45]. Accordingly, there is a link in the generalized disease network between the features J45p (asthma) and J30s (vasomotor and allergic rhinitis).

Metabolically healthy versus unhealthy obesity phenotypes
Specifically for obesity (E66), we used the network to investigate obesity-related phenotypes. Obesity is a major public health problem associated with increased morbidity and mortality [46]. It represents a remarkably heterogeneous condition with different obesity-related comorbidities, impairment of functional status and varying cardiometabolic outcomes [47]. The concept of metabolically healthy obesity (MHO), in comparison to the metabolically unhealthy obesity (MUHO) phenotype is under debate and its relevance to clinical practice still unclear [46,47]. We could show that obesity features were indeed distributed across multiple clusters on the disease network. Depending on their cluster membership, patients showed some characteristic specific features-in terms of present and future comorbidity-of metabolically healthy and unhealthy obesity, respectively. MUHO represented the largest cluster with more than two thirds of the obese patients, corresponding very well to the general picture of this phenotype including metabolic comorbidities like dysglycaemia, dyslipidaemia, hypertension and/or hyperuricaemia and high cardiovascular risk. The second representative cluster (≈ 15 %) comprised obese patients with musculoskeletal problems and degenerative changes who feature hypertension but were otherwise metabolically healthy (MHO), again in accordance with epidemiological evidence [47,48]. Interestingly, before matching of the cohorts, we found female preponderance in the MHO cluster in accordance with other studies [49]. This may be ascribed to differences in sex hormones, body fat distribution, adipokines, immunological parameters, the microbiome and better insulin sensitivity of women compared to men [49]. Additionally, small clusters represented obese patients with gastrointestinal disorders and cancers or liver disease (including fatty liver, another candidate of the metabolic syndrome) or patients with reproductive or urogenital problems in our analysis, all wellknown comorbidities based on obesity-related hormonal imbalance, inflammation and insulin resistance [50]. The identified diagnosis features contribute to the currently ongoing discussions regarding phenotype definitions of MHO [34][35][36][37]51]. In particular, we showed for the first time that diagnosis features of obesity are part of different clusters on a comorbidity network, and that these assignments are associated with strong differences in metabolic health, amplifying epidemiological evidence. A phenotype with low prevalence of diabetes and hyperlipidaemia seems to be distinct from a phenotype with high prevalence of hypertension, hyperlipidaemia and diabetes, which is linked to an increased risk of developing new metabolic and cardiovascular risk factors.
Furthermore, not only the prevalence of metabolic disorders was significantly higher in MUHO versus MHO but also that of mental disorders. It can be speculated that shared psychosocial factors are underlying pathophysiological mechanisms starting a vicious circle between unhealthy lifestyle, metabolic and psychological disturbances, which again are linked and mediated by obesity [49]. During follow-up, MUHO also had the highest incidence of diabetes which doubled in comparison to that of the MHO group. On the other hand, MHO was characterized by hypertension, highlighting that increased BMI is one of the most prominent causes of heart failure and ischaemic heart disease [52,53]. This might explain why both clusters showed a comparable incidence of ischaemic heart disease at follow-up. Also, mortality was increased in both clusters at follow-up without significant differences between the MHO and MUHO groups. This finding corroborates previous findings from cohort studies and surveys questioning the value of MHO for the determination of the overall prognosis of patients [54,55].
In total, the patterns identified give a mixed picture with regard to the discussions of a benign form of MHO, that is not at increased risk of negative cardiovascular prognosis and mortality [36,37,56]. Consistent with earlier research, an evaluation of the incidence of new metabolic and cardiovascular diseases during follow-up corroborates the picture that MHO patients have a position in-between MUHO and non-obese control patients, indicating a transient state [36]. The MHO disease phenotype identified is thus consistent with a more benign obesity phenotype, showing a slower progression to a more unhealthy obesity over time than the MUHO disease phenotype. Nevertheless, the MHO disease phenotype cannot be considered completely benign in terms of prognosis. This is corroborated by a meta-analysis of 22 prospective studies over a follow-up of 3.6-30 years, which did not identify any combinations of obesity and components of metabolic syndrome that were not at increased risk of cardiovascular events and mortality when compared with non-obese patients. The risk for MHO particularly increased for studies with longer-follow-up [36] suggesting that the majority of MHO patients converge towards the MUHO phenotype over time. Thus it is necessary to raise public awareness and to initiate lifestyle intervention in all MHO patients.
This elaborated example on obesity (E66) demonstrates how to use the developed method to obtain meaningful novel information about complex diseases. Different from clinical studies in the topic area, which typically follow obese and non-obese patients over time, we show for the first time, that the disease phenotypes of obesity identified in this purely data-driven approach are somewhat consistent with the discussed clinical phenotypes of obesity.

Strengths and limitations
The two unique characteristics of our approach are (i) the computational identification of higher-order diagnosis features that (ii) simultaneously predict more than a thousand diagnoses, rather than focusing on one or a couple of diseases [15,[57][58][59]. The present approach allows to describe the overall health state of inpatients across the boundaries of diseases and disease groups, providing a more holistic picture of their health-state. This facilitates the identification of phenotypes across diagnostic boundaries.
The following strategies were applied to improve validity: First, we performed fivefold cross-validation in the search for diagnosis features. This leads to meaningful diagnosis features. Second, the naive Bayes classifier is not prune to overfitting. We find comparable levels of model performance in the training and the test data, which suggests that overfitting is not an issue in our approach.
A study limitation is that only hospital diagnoses were available. Other clinical information such as prescriptions or performed medical procedures were not available. This work pertains exclusively to pattern detection in large datasets of diagnosis data which has been defined as a research priority [8]. The present study design was exploratory and appropriate to identify associations between diagnoses but not to infer causal relationships. Furthermore, cross-validated results are reported but cross-validation does not provide proper metrics for evaluation of prediction models. In the light of the study aim to identify disease phenotypes based on the contextualization of co-occurring diseases rather than predicting new diseases, this should, however, not pose a problem to the present analysis.
Because of the monotonicity assumption used and due to the discrepancies of disease onset and the timing of diagnosis which vary widely between diseases, it is not possible to make accurate timely predictions of new diagnoses in the target period based on the diagnosis features in the feature period. Based on the findings from other network analyses and the current patterns which suggest that diagnosis features are often related to diagnoses that represent later stages of the same or similar diseases it is however likely that many of the identified correlations reflect disease progression. We also show this explicitly for the index diagnosis of obesity, where metabolically health obesity approaches unhealthy obesity over the observation period. We found that without applying monotonicity, the target data were too sparse for the naive Bayes classifier to successfully capture higher-order associations. Also, we have been interested primarily in chronic diseases where this assumption is more valid.
Regarding computational cost, there are noteworthy methodological aspects which suggest that the computational cost is relatively moderate. In particular, the main point of our analysis method is that we identify statistical higherorder correlations in the data by applying (a) very efficient frequent itemset mining [60], (b) the generative classifier we use (naive Bayes) can be computed efficiently [61] and (c) the hyperparameter search applies the Bayesian optimization (BO) technique which is especially suitable for optimizing few variables efficiently with relatively few objective function evaluations; see [62] and references therein. The bottlenecks of the analysis are (a) the algorithm of mapping patient-diagnoses to patient-features, figure 6, and (b) the computation of the generalized disease network from the final (optimal) model coefficient matrix. Furthermore, a disadvantage of frequent itemset mining is that it can only detect a subset of possible higher-order associations, particularly the subset of positive associations. For higher-order associations, the positive-only associations mean a considerable saving in terms of computational cost, and a restriction to associations that are putatively most relevant in a sparse dataset.

Conclusion
In conclusion, based on a multinomial naive Bayes model, we demonstrate that health states of inpatients can be better characterized by the inclusion of higher-order features of multiple diseases. Diseases show a multitude of complex interaction effects among each other that generally impact a patient's disease progression in a non-additive way. The method developed allows to identify distinct disease phenotypes made up of clusters of diagnosis features which might correspond to different disease aetiologies that cannot be captured by single diagnoses and their interactions. This enables us to discover and analyse novel disease phenotypes based on different comorbidity contexts of diagnoses. The resulting differentiation has strong implications for a better understanding not only of disease aetiology but also for the meaning of a specific given diagnosis in terms of treatment and prognosis.

Material and methods
First, we describe the main modelling process from the input dataset to the final generalized disease network, see figure 4 for an overview. Second, the matched cohort comparison of metabolically and unhealthy obese patients is described.

Dataset
A pseudonymized administrative dataset containing all in-and outpatient stays in privately and publicly funded hospitals in Austria over a time period of 18 years was used, covering patients with exit dates in 1 January 1997-31 December 2014. The dataset consisted of 45 M recorded stays of 9 M patients. Each stay record included a patient pseudonym, entry and exit dates, one primary diagnosis, zero or more secondary diagnoses, home region (34 categories), sex (two categories) and age group at the time of stay (5-year bins, 19 categories). The diagnoses were encoded as three character categories from the Austrian adoption [63] of the World Health Organization's international statistical classification of diseases and related health problems, 10th revision [64] (ICD-10). A transfer from one department to another resulted in a new stay record. In-hospital deaths were recorded together with a code for the declared reason of death.

Study population/preprocessing
The whole time period of 18 years was split into four parts for the purposes of predictive modelling and matched cohort analysis: the 6-year no-admission period from t 0 : = 1 January 1997 00:00 to t 1 : = 1 January 2003 00:00, the 3-year feature period from t 1 to t 2 : = 1 January 2006 00:00, the 3-year target period from t 2 to t 3 : = 1 January 2009 00:00 and the 6-year mortality follow-up period from t 3 to t 4 : = 1 January 2015 00:00 (figure 5).
The criteria for selecting patients into the study were: (i) no recorded stay within the no-admission period, (ii) at least one recorded stay during the feature period, and (iii) at least one further recorded stay during the target period. This resulted in N P = 478 575 patients who have been selected into the study. By the virtue of these criteria, we aim to capture patients that have no serious conditions (i.e. requiring hospitalization) during the last 6 years before entering the study, and are all alive at least until t 2 .

Identification of (meaningful) diagnosis features
In identifying the phenotypes of complex diseases in diagnosis records, we follow a two-step approach. The first step is to qualitatively (i.e. structurally) and quantitatively capture the higher-order statistical interactions between diseases. This step is described in this section. The second step is to construct a generalized disease network, described in §5.4.
We introduce higher-order diagnosis features as sets of diagnosis codes that serve as features in a predictive multitarget classification model. The features are computed from the patients' accumulated diagnosis codes that occurred up to t 2 and are used to predict the accumulated diagnoses (monotonicity assumption) up to t 3 , i.e. the targets. The diagnosis codes are formed from the three-character ICD-10 codes, suffixed with the letter p or s if the code was used as a primary or secondary diagnosis, respectively.
Our aim is to identify a meaningful set of higher-order diagnosis features (henceforth called the feature set) in a computationally feasible way, by (I) applying a heuristic to construct the higher-order diagnosis features and (II) by selecting a feature set that maximizes the predictive performance (wrapper approach [65]). These steps are detailed as follows.
(I) Heuristically constructing higher-order diagnosis features. The patient diagnoses of the feature period are represented by the N P × N D indicator matrix X 0 , with entries X 0 ( p, d ) for patient index p ∈ {1, … , N P } and diagnosis code index d ∈ {1, … , N D }, N D being the total number of primary and secondary diagnosis codes. If patient p is diagnosed with d at least once in the feature period, we set X 0 ( p, d ) = 1, otherwise X 0 ( p, d ) = 0. The data matrix X 0 is transformed into the feature matrix X by mapping the N D diagnosis codes from the feature period to N F higher-order diagnosis features by (i) constructing a candidate feature set, (ii) greedily assigning features to patients, (iii) pruning features from the candidate feature set that occur in less than support min = 30 patients, and repeating with (ii) until all features are supported at least by support min patients. (i) Constructing a candidate feature set. A feature candidate is a set of diagnoses that co-occur in patients. The candidate feature set is the ordered (see next paragraph) set of all candidate features. It is constructed heuristically as follows: We consider all combinations up to a specific number of combined diagnoses (the model order) and that occur at least in support min patients. Furthermore, candidates are only included if their diagnoses occur more likely together than what would be expected from random chance. This is quantified using a generalized measure of mutual information, the 'minimum information difference to prior' (minIDP) measure which has to be above a threshold of minIDP min (see electronic supplementary material, text 1). All of these steps are efficiently computed using a free open source frequent itemset mining software by Borgelt [66,67].    same cardinality, from high to low minIDP and, (iii) within same (i) and (ii) from high to low support. (ii) Greedily assigning features to patients. Patientdiagnoses are mapped to patient-features by applying a greedy heuristic whose algorithm is shown in figure 6. (iii) Pruning the candidate feature set and repeating. After running the respective algorithm (figure 6), some (candidate) features may be weakly supported, i.e. their support drops below support min . Following the ordering of F, the first of those features is then removed and the algorithm is repeated using the newly pruned F, until the support of all assigned features is greater or equal support min . Nota bene, we remove only the first (versus all) of the weakly supported features because some of the features further down in F may become supported again simply by remapping the patients of the first weakly supported feature to other features. We will refer to the final ordered set of the well-supported higher-order features as final feature set. The patient-features corresponding to the final feature set are numerically represented as N P × N F matrix X with entries X( p, f ) = 1 if feature f has been assigned to patient p, and X( p, f ) = 0 otherwise. See electronic supplementary material, text 2 for more information. (II) Selecting the optimal feature set using the wrapper approach. The selection of the optimal feature set using the wrapper approach consists of these components: (i) a predictive multi-target classification model that allows to score a given feature set, (ii) and a routine to select the tunable hyperparameters such that the score is maximized. (i) The multi-target classification model. Based on a patient's features (computed from the accumulated diagnoses up to t 2 ), we seek to predict his or her diagnoses obtained up to t 3 (the prediction targets). The targets are represented by the N P × N D data matrix Y. If patient p was diagnosed at least once with target diagnosis t up to t 3 , we set Y( p, t) = 1, otherwise Y( p, t) = 0. As the target variables are binary, we require the choice of a classifier to be used in a multi-target predictor. We chose to use the multinomial naive Bayes classifier (MNB). Although MNB is often used for count data (e.g. word counts in natural language processing), it can also be used successfully for binary target variables.
For each feature f and target t and model order m, we obtain a model coefficient where π(t) are the prior log-odds, The target diagnosis t is classified present for patient p whenever Δ( p, t) > 0, otherwise is classified absent. For target t, we evaluate prediction quality using precision (i.e. the probability that a predicted diagnosis actually occurred; the type 1 error rate), recall (i.e. the probability that an occurring diagnosis was correctly predicted; the type 2 error rate), and the F1 score (i.e. the harmonic mean of precision and recall). The overall model score (total F1 score) is computed from the target-specific F1 scores as their support-weighted average. The support of a target equals its number of true positives. This gives more emphasis in the scoring function to higher prevalent target diagnoses.
Inputs : : the ordered candidate feature set, X 0 : the indicator matrix of accumulated patient-diagnoses at t 2 . Output: X: the indicator matrix of patient features. royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 18: 20201040 (ii) Selecting score-optimal hyperparameters. While we fix the minimum support required for each feature, support min , we perform a hyperparameter search by applying BO to find optimal values for the remaining hyperparameters, the smoothing parameter α (a regularizer in the MNB estimator) and the threshold minIDP min for our mutual information measure. Fivefold cross-validation was applied (a) to mitigate overfitting and (b) to reduce the noise in the total F1 score. During cross-validation, a target is selected for scoring if it occurs with at least 200 patients in every fold. A change of folds by chance in repeated model computations, leads to slightly fluctuating numbers of selected targets. To avoid a scoring bias due to changing numbers of scored target diagnoses, we fix the targets to be included in the model scoring to the top-supported 1000 targets. We refer to features in the final model, after the hyperparameters have been fixed, as the optimal feature set, F Ã . See electronic supplementary material, text 3 and associated supporting figures for a more detailed algorithmic breakdown.

The generalized disease network
From the optimal feature set F Ã and the optimal hyperparameters from §5.3, a final predictive model was fit to the whole dataset, the all-data model. See electronic supplementary material, text 4. The non-trivial information contained in the correlation structure of the model coefficients of the all-data model was then used to produce a network of diagnosis features as follows. There, nodes correspond to diagnosis features and links indicate that the mutual information of their coefficient vectors is statistically significant. The corresponding null model takes transitive relations between features into account, i.e. links are included if they (a) tend to predict similar target diagnoses and (b) are not explained through other associations between features.
Trivial associations here mean, that they can be explained transitively through other associations in the network. For example, if the three features A, B and C have identical pairwise association strengths, then each of their pairwise associations can be trivially (transitively) explained through the remaining other two associations.
The unfiltered disease network Φ quantifies the similarity of two features f 1 and f 2 in terms of the diagnoses they predict. Entries in the N F × N F weighted adjacency matrix Φ( f 1 , f 2 ) are given by the Gaussian approximation to mutual information Fðf 1 ,f 2 Þ :¼ À1=2 log[1 À r 2 ðf 1 ,f 2 Þ], computed from the Pearson correlation coefficient ρ( f 1 , f 2 ) of features f 1 and f 2 , computed from the row vectors C( f 1 , ·) and C( f 2 , ·), respectively.
To adjust for multiple testing and class imbalance, we filter Φ using a network backboning approach based on Gemmetto et al. [69] where we proceed as follows: (i) Initialize the weights matrix W with elements w i,j from Φ, omitting self-loops, i.e.
(ii) Discretize the weights matrix W by binning each entry into one of 30 bins in the range [min i,j w i,j , max i,j w i,j ]. Note that by construction, Φ is non-negative and symmetric. The number of bins is chosen such to balance the performance of subsequent filtering procedure (decreases with increasing bin count) and the resolution of the discretization. (iii) Compute the irreducible maximum entropy backbone from the corresponding weighted configuration model (WCM) of the network as follows. The WCM is the maximum entropy ensemble of weighted networks that results from fixing the node strengths. The strength of node i is s i :¼ P j w i,j .
(a) Vector y ¼ (y 1 , . . . , y N f ) (N f is the number of features) is initialized with random numbers in [0, 1). (b) We compute y w :¼ argmin y X j ( j = i) y i y j 1 À y i y j À s i ..,N f , using an extended Levenberg-Marquardt minimizer [70]. At the minimum, the term in parentheses (i.e. the residual) is very close to zero. (c) Compute the probability (under the null model) of generating a link between nodes i and j with a weight equal to or greater than the observed weight as g i,j :¼ (y w i y w j ) wi,j . This corresponds to the 'local filter' from Gemmetto et al. [69, eqn 15] (but with p i,j ¼ y w i y w j for the WCM instead of the enhanced configuration model where also the degree sequence is fixed). (d) Filter the network Φ by keeping only edges (i, j ) with γ i,j < 0.05.
(iv) Apply Louvain modularity detection [71] on the backbone and present its giant component with coloured clusters as the resultant generalized disease network (figure 2).

Quantifying synergistic interactions
The interaction strength of sub-features that make up a feature is quantified by the lift, L( f,t). For a feature f and a target t, the lift is given by the difference of the corresponding model coefficient and its expected value L( f, t) :¼ C( f, t) À C( f, t): Let F be the ordered optimal feature set and F(f ) be the set of diagnoses of feature index f. The expected coefficient is given by F C( f, t) :¼ P g[G C jFj (g, t) , where C jFðgÞj (g,t) is the coefficient from the model of (lower) order jFðgÞj. G is a partition of the set of diagnoses of feature f, constructed as follows. Given f, G is constructed by starting at feature index f and walking down F and collecting all feature indices g into G until U g [ G FðgÞ ¼ FðfÞ. The ordering of F guarantees ⋂ g ∈ G F(g) = ∅.
The coefficient C( f,t) can be thought of as the expected coefficient based on the linearity of the model under the assumption that f itself would not have been included in the optimal feature set. For non-interacting diagnoses C( f,t) ¼ C( f,t). Positive lifts indicate synergy; negative lifts redundancy.

Matched cohort comparison
The two largest clusters of obesity (E66), in terms of number of features, are selected for comparison with non-obese controls. We select patients from the study population who are positive in any of the cluster's features during the feature period into the corresponding cluster cohort. The controls are all patients who are non-obese during the feature period. All patients were uniquely assignable to one of the three cohorts. To compare these cohorts while adjusting for the factors age, sex and place of residence, we take the smaller of these groups, in terms of patient number, and randomly match each patient therein to three individuals from the larger group and to 10 controls, with same sex, region and a maximum age difference of 2 years. A patient could only be used once for matching. If there are not sufficient matchable patients in the larger group, then the patient is excluded. This affected 41 patients (4.4%).
We compared the cohorts in terms of their prevalence of metabolic disorders and selected further disorders during the feature period. We also analysed the incidence of new metabolic and cardiovascular diseases in the 3-year target period. With regard to mortality, we compared the cohorts over a 9-year follow-up period (2006-2014).