Precision global health and comorbidity: a population-based study of 16 357 people in rural Uganda

In low-income countries, complex comorbidities and weak health systems confound disease diagnosis and treatment. Yet, data-driven approaches have not been applied to develop better diagnostic strategies or to tailor treatment delivery for individuals within rural poor communities. We observed symptoms/diseases reported within three months by 16 357 individuals aged 1+ years in 17 villages of Mayuge District, Uganda. Symptoms were mapped to the Human Phenotype Ontology. Comorbidity networks were constructed. An edge between two symptoms/diseases was generated if the relative risk greater than 1, ϕ correlation greater than 0, and local false discovery rate less than 0.05. We studied how network structure and flagship symptom profiles varied against biosocial factors. 88.05% of individuals (14 402/16 357) reported at least one symptom/disease. Young children and individuals in worse-off households—low socioeconomic status, poor water, sanitation, and hygiene, and poor medical care—had dense network structures with the highest comorbidity burden and/or were conducive to the onset of new comorbidities from existing flagship symptoms, such as fever. Flagship symptom profiles for fever revealed self-misdiagnoses of fever as malaria and sexually transmitted infections as a potentially missed cause of fever in individuals of reproductive age. Network analysis may inform the development of new diagnostic and treatment strategies for flagship symptoms used to characterize syndromes/diseases of global concern.

reporting of the same symptom, were manually assigned to the same HPO primary id, e.g. 'eye problem' and 'eye disease' was assigned to 'abnormality of the eye', id HP: 0000478. We assigned 86 unique HPO primary ids to the 106 symptoms.
Second, we ensured that each assigned HPO primary id was distinct from another assigned HPO primary id. To be distinct, the manually assigned HPO primary id could not be a subclass of another assigned HPO primary id. These nested reported symptoms occurred when different levels of detail were provided about a similar problem. For example, 'ear pain' and 'ear problem' were both reported. The latter was classified as an 'abnormality of the ear' and assigned to HP:0000598; it is also a superclass of the HPO term 'ear pain', id HP:0030766. Formally, the full HPO text was parsed using Python v2.7 (www.python.org) so that a directed edge was generated between each primary HPO term and their direct superclasses. This procedure generated one directed acyclic graph (DAG), which is the design of the HPO, with 211 nodes and 242 edges and the root as HPO term 'All', id HP:0000001 (External Database S1). The DAG is presented in Figure S2 and it is shown again in Figure S3 with a hierarchical layout. These figures were generated using Cytoscape v3.3.0 [3]. Using the NetworkX library in Python [4], we iterated over each of the matched 86 HPO primary ids to find all ancestors (parents) in the DAG. The ancestors were then searched to find any overlap with the other 85 HPO primary ids. This search revealed that 27 of the 86 manually assigned HPO primary ids were descendants (children) of another assigned HPO primary id in the DAG. Four of the 27 nested HPO primary ids had two ancestors. Of these terms, 3 of 4 were straightforward choices amongst the two parents; one parent was a superclass of the other parent. For 1 of the 4, a choice was made to classify 'gastrointestinal hemorrhage' as an 'abnormality of the digestive system' as opposed to an 'abnormality of the cardiovascular system'. This choice was due to the known morbidity attributable to intestinal helminth infections in the study villages [5]. This classification also resulted in 1 of the 27 HPO primary ids no longer being a subclass of another assigned HPO primary id. For each manually assigned HPO primary id that was a subclass of another manually assigned HPO id, the superclass HPO id was used in place of the HPO primary id. Ultimately, there were 60 unique and distinct HPO ids assigned (86 original -26 nested ids) to 106 reported symptoms. Together with the 20 diagnosis groupings, there were 80 total diagnoses and symptoms in which to explore comorbidity relationships. With these groupings, if an individual listed two symptoms and one of those symptoms was a subclass of another symptom, e.g. 'ear pain' is a subclass of 'ear problem' (HPO term 'abnormality of the ear'), then these two symptoms would be counted as one reported symptom.

Comorbidity network construction
All comorbidity networks consisted of a maximum of 80 nodes, corresponding to the 60 symptoms mapped with the HPO and 20 disease categories. Accordingly, for each network, there was a maximum of 3,160 pairwise combinations (edges) of the 80 symptoms/diseases. In the sub-group analyses (see next section), some symptoms/diseases may not be present in the subgroup of interest. Isolates are not presented. Undirected network edges were generated in Stata v13.1 (StataCorp LLC) as follows. If two symptoms/diseases were reported together in the population more often than expected by chance then an edge was generated between those symptoms/diseases. Hence, only symptoms/diseases with a ≥1 coocurrence with at least one other symptom/disease were investigated in the correlation analyses. Relative risk (RR) [6] was used to determine the edge strength. Fisher's exact 2-sided p-value was calculated [7][8][9]. Though relative risk tends to overestimate rare events, Fisher's exact p-value is conservative for rare events [10]. RR 95% confidence intervals and Z-statistics were calculated as described in Katz [11] and Altman [12], respectively. We also calculated the ɸ correlation, its 95% confidence interval, and approximate p-value [6,8]. The ɸ correlation, with possible values between [-1,1], is a complementary measure to RR, but ɸ underestimates associations between two symptoms/diseases of vastly different prevalence [6,8].
An edge was generated between two symptoms/diseases if the RR>1, ɸ>0, RR q-value ≤0.05, and ɸ q-value ≤0.05. The q-value is a measure of the false discovery rate (FDR) and a correction for multiple comparisons. The FDR is a directly interpretable measure of the percentage of false positives amongst our significant results and is less prone to researcher bias than the Bonferroni correction (family-wise error rate) [13]. For each network, we estimated the q-values as described in Storey and Tibshirani [14] and using the Qvalue package in R v.3.2.3 [14,15]. Specifically, the q-value extends the Benjamini-Hochberg FDR correction [16] to fully automate the estimation of the number of truly null hypotheses and to convey the percentage of significant findings that may be false positives around a particular statistical test of interest. There were 8,001 correlations run in total to investigate comorbidity associations; the q-values were calculated using the results from all tests ( Figures S4-S6).

Comorbidity networks of population sub-groups
Six divisions in the study population were examined to understand how comorbidity structure varies by social and environmental contexts [17]. The division types and definitions are provided below. For the descriptive statistics of each division, refer to Table 4 and Table S1. All network statistics are provided in External Database S2.

Full population.
We provided the comorbidity structure of all individuals in the study area as an overall summary of community health and reference category for population sub-group comparisons. This information is useful for untargeted, en masse health interventions. For example, mass drug administration (MDA) is an en masse treatment programme that provides preventive chemotherapies to treat populations at risk of neglected tropical diseases (NTDs), including schistosomiasis and soil-transmitted helminths [18]. Information about the comorbidities affecting the population of interest can be used to design integrated MDA programmes, e.g. the development of treatment packages for comorbid NTDs [19].

Gender and age.
This sub-group analysis provides a cross-sectional view of disease onset and progression across life [20] in a rural sub-Saharan African (SSA) population. Four age groups (1-4, 5-14, 15-49, and 50+) were analyzed against gender. The percentage of the study population in each gender-age group is provided in Table 1. The age cutoffs were chosen to represent groups of international importance. In low-income countries, mortality risk is highest amongst children less than 5 years of age [21]. The under-5 mortality rate is highest in SSA [21]. The age group of 5-14 years captures school-aged children, who are the focus of MDA programmes, and the population subset with the heaviest infection intensities of schistosomiasis [5,22]. Individuals aged 15-49 represent the population of reproductive and working age [21,23]. Hence, the comorbidity of females in this group is a profile of risks affecting maternal health. With only 7.19% of individuals aged 50+ years (Table 1), this category represents the elderly in our study population. It also is one age group used to calculate global life expectancy from the Global Burden of Disease Study [21].

Socioeconomic (SES) index.
SES is a direct measurement of health inequalities, which affect the lifetime risk of disease and access to resources for medical care [24]. Globally, mortality is highest amongst the poorest of the poor [24]. For NTDs, socioeconomic status is a key determinant of the risk of infection and access to medicines through MDA [5,25]. To reduce dimensionality and to account for any confounding (correlation) amongst SES variables, principal components analysis (PCA) was used to develop a SES index [26,27]. Specifically, polychoric PCA was used to account for the several binary SES indicators and was run at the household level in Stata v13.1 (StataCorp LLC) [28]. The first 3 eigenvalues were 2.35, 1.67, and 1.19 and explained, respectively, 21.38%, 15.14%, and 10.86% of variation. The variation explained by the first component is in the standard range for SES indices [26,27,29]. The SES score for each household was generated from the first component and assigned to all members of the household. All SES variables entered in the direction as expected for the first component (Table S2). The highest weights were assigned to council membership (0.73), other status (0.71), and home ownership (if renter, weight=-0.65). SES scores were normally distributed across households ( Figure S7). Accordingly, SES scores were split into 3 categories to develop an ordinal index of low (SES score cutoff=-0.47, 33.33 th %ile), middle (SES score cutoff=0.33, 66.67 th %ile), and high SES. The 3-category SES index was internally coherent [27]; a summary of the mean value of each SES variable by SES category is provided in Table 2.

Water, sanitation, and hygiene (WASH) index.
WASH may arguably be the most widely advocated health intervention for low-income countries [30]. The promotion of WASH is essential to achieving many of the United Nation's Global Goals-a set of development targets-for the year 2030 [31]. For NTDs, WASH is a critical component in efforts to reduce pathogen transmission [32] (http://washntds.org). Many NTDs persist due to the lack of access to safe water and sanitation [32]. The components of WASH are highly interdependent. We developed a WASH index to account for these interdependencies and to measure the quality of WASH engagement.
Tetrachoric PCA with the seven binary WASH indicators was run in Stata v.13.1 (StataCorp LLC) [28]. The first component explained a remarkable 41.97% of variation (2.94 eigenvalue) in WASH. The second and third components explained 18.67% and 13.70% of variation with 1.31 and 0.96 eigenvalues, respectively. WASH scores were generated at the household level from the first principal component and assigned to all members of the household. The scores were roughly uniformly distributed across households ( Figure S7) with no clear evidence of clumping [27]. Accordingly, WASH scores were split into 3 categories to develop an ordinal index of poor (WASH score cutoff=-0.63, 33.33 th %ile), moderate (WASH score cutoff=0.10, 66.67 th %ile), and high WASH quality.
Interestingly, all WASH indicators did not enter as expected in the first principal component (Table S3). All variables used were, as dictated by the WHO/UNICEF JMP definitions, positive indicators of WASH quality. The variables that entered as expected were improved drinking water (0.25), improved cooking water (0.67), improved bathing water (0.80), water availability/quantity (0.08), and improved sanitation (0.19); the weights for when each variable =1 are listed in parentheses. Positive values (=1) for water treatment (-0.18 weight) and soap availability (-0.30 weight) did not have a positive contribution to the WASH quality index. This negative contribution was likely due to crowding out of WASH behaviours. We discovered that multiple WASH behaviours might produce a crowding out effect, i.e. when a household engaged in one WASH behaviour then a complementary behaviour might have been perceived as not necessary. In our study population, this crowding out effect occurred with a trade off between improved sanitation and hygiene/soap availability as well as water treatment and improved ! 7! drinking water. Water treatment was negatively associated with the use of improved drinking water sources (Spearman !!-0.20, p-value<0.0001, obs. 3491). Soap availability was negatively correlated with improved sanitation (Spearman ! -0.12, p-value<0.0001, obs. 3491). Future studies should further investigate possible crowding out effects as an unintended consequence of promoting multiple complementary WASH interventions. A summary of the mean value of each WASH indicator by WASH category is provided in Table 3.

Health care source (quality).
To understand how the type of care affects comorbidity, we asked households to list where they "usually go to get drugs and other medical care." The household response was assigned to all members of the home. An index was developed to measure the source and quality of care. Two sources were possible: private and government. Private care was indicative of more informal, unregulated care and included private clinics, drug shops, herbalists or witch doctors, and other unspecified options not noted in government care. Government care measured formal, regulated medical care and included government health centres and village health teams. Households could list multiple sources. Accordingly, three categories were developed for the health care index: private only, mixed source, and government only. These categories, respectively, are thought to capture low, moderate, and high quality medical services. Private care was viewed as suboptimal to government care because individuals in our study area [33] and in other SSA countries, e.g. Kenya [34], are able to purchase drugs at shops without formal diagnostic testing, which can result in mistreating diseases [35], and private medical sources lack formal regulation.

Village intestinal helminth prevalence.
An index of intestinal helminth prevalence was constructed to not only capture NTD coinfections in the study area, but also to represent varying environmental conditions and infection risk exposure. Our study area is endemic for intestinal schistosomiasis and hookworm [5]. These infections are geographically focused and have distinct transmission routes; schistosomiasis depends on contact with freshwater sources containing competent intermediate snail hosts and hookworm depends on contact with contaminated soil in the appropriate microclimatic conditions. Consequently, the village prevalence of each infection captures distinct village ecology and infrastructure [5]. Village prevalence was calculated as the percentage of all individuals sampled with at least one egg per gram of stool. A stratified random sample of ~60 people per village was used. Prevalence was measured before MDA and the study villages had not received MDA for 1.5 years prior to the sampling. Detailed methods for the participant sampling, parasitology, and measurement of village ecology and infrastructure are described in Chami et al [5]. We classified villages by prevalence of S. mansoni and hookworm infections, utilizing WHO categories of prevalence for MDA treatment [18]. High, moderate, and low S. mansoni village prevalence was >50%, ≤50% and ≥10%, and <10%, respectively. For hookworm, high, moderate, and low prevalence was >50%, ≤50% and ≥20%, and <20%, respectively. It should be noted that for low prevalence of hookworm, the WHO does not recommend MDA [18]. All individuals within a village were assigned the village prevalence category. There were five unique categories of prevalence for the 16 Figure S4: P-value distributions and estimated proportion of true null hypotheses for q-values. The q-values, local false discovery rate (FDR), and estimated proportion of true null hypothesesp 0 were estimated for the RR and Spearman r correlations, see Comorbidity Network Methods for a detailed description. A) Histogram of p-values for RR. RR p-values display a U-shaped distribution because a two-sided significance test was run, though RR can only be 0. To account for the U-shaped distribution, the bootstrap method was used to estimatep 0 as described in [36]. B) Histogram of p-values for Spearman r. The smoother method was used to estimatep 0 as described in [14].      Figure S9: Trends in global network density and avg. shortest path. Density A) and average shortest path B) against age and gender comorbidity networks. Density C) and average shortest path D) against socioeconomic and behavioral contexts. Health care quality: low (private clinics only), middle (mixed care), and high (government health centres only). SES: socioeconomic status index. WASH: water, sanitation, and hygiene index.