Prediction of cholesterol ratios within a Korean population

Cholesterol ratios (total cholesterol (TC)/high-density lipoprotein cholesterol (HDL-c) and triglyceride (TG)/HDL-c) have been suggested as better indicators to predict various clinical features such as insulin resistance and heart disease. Therefore, we aimed to build a single nucleotide polymorphism (SNP) set to predict constitutional lipid metabolism. The genotype data of 7795 samples were obtained from the Korea Association Resource. Among the total of 7795 samples, 7016 subjects were used to perform 10-fold cross-validation. We selected the SNPs that showed significance constantly throughout all 10 cross-validation sets; another 779 samples were used as the final validation set. After performing the 10-fold cross-validation, the six SNPs (rs4420638 (APOC1), rs12421652 (BUD13), rs17411126 (LPL), rs6589566 (ZPR1), rs16940212 (LOC101928635) and rs10852765 (ABCA8)) were finally selected for predicting cholesterol ratios. The weighted genetic risk scores (wGRS) were calculated based on the regression slopes of the six selected SNPs. Our results showed upward trends of wGRS for both the TC/HDL-c and TG/HDL-c ratios within the 10-fold cross-validation. Similarly, the wGRS of the six SNPs also showed upward trends in analyses using the SNP selection set and final validation set. The selected six SNPs can be used to explain both the TC/HDL-c and TG/HDL-c ratios. Our results may be useful for the prospective predictions of cholesterol-related diseases.


HDS, 0000-0003-1732-7838
Cholesterol ratios (total cholesterol (TC)/high-density lipoprotein cholesterol (HDL-c) and triglyceride (TG)/HDL-c) have been suggested as better indicators to predict various clinical features such as insulin resistance and heart disease. Therefore, we aimed to build a single nucleotide polymorphism (SNP) set to predict constitutional lipid metabolism. The genotype data of 7795 samples were obtained from the Korea Association Resource. Among the total of 7795 samples, 7016 subjects were used to perform 10-fold cross-validation. We selected the SNPs that showed significance constantly throughout all 10 cross-validation sets; another 779 samples were used as the final validation set. After performing the 10-fold cross-validation, the six SNPs (rs4420638 (APOC1), rs12421652 (BUD13), rs17411126 (LPL), rs6589566 (ZPR1), rs16940212 (LOC101928635) and rs10852765 (ABCA8)) were finally selected for predicting cholesterol ratios. The weighted genetic risk scores (wGRS) were calculated based on the regression slopes of the six selected SNPs. Our results showed upward trends of wGRS for both the TC/HDL-c and TG/HDLc ratios within the 10-fold cross-validation. Similarly, the wGRS of the six SNPs also showed upward trends in analyses using the SNP selection set and final validation set. The selected six SNPs can be used to explain both the TC/HDL-c and TG/HDL-c ratios. Our results may be useful for the prospective predictions of cholesterol-related diseases.

Introduction
Blood cholesterol and lipids are well-known heritable risk factors of cardiovascular diseases, including heart attacks and stroke [1,2]. Therefore, numerous large-scale genetic studies have been conducted to identify cholesterol and lipid-associated markers. One result of these efforts is that many significantly lipid-related markers have been revealed. For example, one recent genome-wide association study (GWAS) found new lipid-associated markers such as CD163-APOBEC1, NCOA2, NID2-PTGDR and WDR11-FGFR2 [3].
It was suggested that blood cholesterol ratios that use total cholesterol (TC), triglyceride (TG), and high-density lipoprotein cholesterol (HDL-c) are more effective indicators for the prediction of various cardiovascular diseases compared to the traditional lipid level [4]. For example, TC and serum lipoprotein ratios were associated with blood pressure [5]. Other previous studies have also reported that the TC/HDL-c ratio was a more effective marker for coronary heart disease risk [6,7]. In addition, the TG to HDL-c ratio was an important marker for insulin resistance, which was related to type 2 diabetes mellitus, particularly in a rural Korean population [8]. Several other previous studies have supported the implications of TG and HDL-C in insulin resistance [9][10][11]. Moreover, TG/HDL-c ratios were reported to be possible indicators of low-density lipoprotein cholesterol particle size in patients with type 2 diabetes and normal HDL-c levels [12].
Considering the effect of cholesterol ratios on clinical features, predicting cholesterol ratios could help increase the quality of life. However, previous studies have focused on the finding of markers for traditional lipid levels. Indeed, there was only one GWAS for cholesterol ratios with significant markers in the Korean population [13].
We investigated a single nucleotide polymorphism (SNP) set in the present study to predict cholesterol ratios with the weighted genetic risk score (wGRS) method using the genotype data from the Korea Association Resource (KARE). The wGRS method is a simple widely used method for building a set of SNPs for prediction. Several previous studies have already shown the usefulness of wGRS as a prediction model for various diseases [14][15][16]. Moreover, we only used previously reported significant SNPs in GWAS to increase our study's validity. A further 10-fold cross-validation process was also performed to select constantly significant SNPs in all analysis sets.

Study subjects
The present study used the genotype data from the KARE project. This study was approved by the Public Institutional Bioethics Committee as designated by the Ministry of Health and Welfare (P01-201502-31-002). Regarding the quality of the genotype data, we deleted samples and SNPs that showed a call rate lower than 98%, and SNPs with a minor allele frequency (MAF) of less than 0.05 were also excluded in further analyses. Finally, 7795 samples in total (3675 males and 4120 females) were used for the statistical analyses. The 7795 samples were divided into one set of 7016 samples (3308 males and 3708 females) as a part of the SNP selection set for 10-fold cross-validation and the remaining 779 samples (367 males and 412 females) were used as the final validation set. The statistical powers of this study were obtained using G*Power Version 3.1 software (Universität Kiel, Germany) [17]. The software calculated both the test set (n = 702) and the final validation set (n = 779) as at over 95%. Details about the number of samples are as shown in table 1.

SNP pruning for statistical analyses
First, we collected 351 significant SNPs that had been reported in previous cholesterol-related GWAS with a secondary replication study to identify reliable SNPs for cholesterol ratio prediction [18][19][20][21][22]. Then, we obtained the genotype data of the collected GWAS catalogue markers including other markers in nearby regions (±100 kb from the GWAS markers) from the KARE data (7103 SNPs). The linkage disequilibrium (LD) coefficients (r 2 > 0.2) of all pairs of SNPs were calculated using the Haploview software to prevent the issue in the wGRS method that is caused by high LD [23]. Among the 7103 SNPs, a set of 691 markers were remained after LD calculation. Then, we excluded SNPs which were not linked (r 2 < 0.98) to previous reported GWAS catalogue SNPs. Finally, we obtained 134 SNPs for further analyses. The p-values of the SNPs were obtained via regression analyses using the training set (n = 6314) to identify the most significant SNPs. Regression analysis was conducted using the GoldenHelix SVS8 software (Bozeman, MT, USA). Three clinical values (age, sex and body mass index, BMI) were used as covariates. The most significant SNPs in the same LD were selected for each training set. To improve the validity of the present study, we used only SNPs which showed p-values lower than 0.01 in statistical analyses. The wGRS was calculated as the sum of the number of cholesterol ratioincreasing alleles multiplied by the regression slope across all variants in each set, as previously described ( n i=1 number of risk allele in SNP i × weight i ; n = number of SNP, weight: regression slope value of SNP i ) [24]. Then, we divided the cholesterol ratios of each set into quartiles and calculated the average wGRS. After 10-fold cross-validation, we selected six SNPs that overlapped across all training sets (electronic supplementary material, table S1). We applied wGRS in the quartile of the validation set that had the same cholesterol ratios as the SNP selection set to observe wGRS variation.

Results
The average age, BMI, TC and HDL-c were higher in female subjects than male subjects in overall subjects (age, 51.8 and 52.7; BMI, 24.3 and 24.9; TC 197.9 and 199.3; HDL-c, 47.7 and 50.9 in men and women, respectively). Similar results were observed in the SNP selection set and the final validation set. By contrast, TG was higher in male than in female subjects (171.1 for men and 138.2 for women overall). Detailed information about the clinical characteristics was shown in table 1.
The analysis process for the cholesterol ratio prediction was summarized in figure 1. Among all GWAS catalogue and nearby SNPs (around 100 kb), the twelve SNPs (rs4420638, rs6589566, rs12421652, rs17411126, rs16940212, rs10852765, rs12229654, rs1250252, rs12686004, rs164212, rs2297194 and rs496311) were reached at our p-value threshold (p < 0.01) for both the TC/HDL-c and TG/HDL-c ratios (table 2). We performed 10-fold cross-validation by randomly dividing 7016 samples of the SNP selection set into 6314 samples as a training set and 702 samples as a test set. The 10-fold cross-validation process identified that only six SNPs (rs4420638 (APOC1), rs12421652 (BUD13), rs17411126 (LPL), rs6589566 (ZPR1), rs16940212 (LOC101928635) and rs10852765 (ABCA8)) constantly showed significance in all 10 training sets (the highest p-value was 0.01 for rs10852765 in sets 8 and 9) (   of the selected six SNPs is listed in table 3 with their location, allele information and genotype data with their average cholesterol ratios. Based on the results of the regression analyses of training sets (n = 6314) during the 10-fold crossvalidation, we calculated the wGRS using the four SNPs and applied the wGRS to corresponding test sets (n = 702). The regression slopes of the six SNPs using the training sets were listed in electronic supplementary material, table S1 with their p-values. After performing 10-fold cross-validation, we observed the relationship between wGRS and the cholesterol ratios. Our results showed upward trends for wGRS with increases of the TC/HDL-c in both the training set (R 2 = 0.8864, p < 0.0001) and test set (R 2 = 0.8279, p < 0.0001) (electronic supplementary material, figure S1a). The TG/HDL-c ratios also showed similar results with an R 2 value of 0.8033 for the training set and 0.8248 for test set (electronic supplementary material, figure S1d). In addition, we also found upward trends in all other subgroup analyses using male and female subjects (R 2 > 0.5, p < 0.0001) (electronic supplementary material, figure S1b,c,e and f ).
Finally, regression slopes using the SNP selection set (n = 7016) were calculated to apply wGRS to the final validation set (n = 779) (rs4420638, 0.239 and 0.00249; rs12421652, 0.139 and 0.00233; rs17411126, 0.118 and 0.00227; rs6589566, 0.136 and 0.00274; rs16940212, 0.079 and 0.00095; rs10852765, 0.051 and 0.00066 for TC/HDL-c and TG/HDL-c ratio). As expected, wGRSs showed upward trends with increases of both the TC/HDL-c and TG/HDL-c ratios, similar to the results from the 10-fold cross-validation (figure 2). Although the wGRS of the third quantile for the female subjects was lower than that of the second quantile (figure 2d), the analyses for male and female subjects showed generally upward wGRS as the TC/HDL-c and TG/HDL-c ratios increased.    Table 3. Information of used markers for cholesterol ratio prediction. Gene name, location and position of the SNPs were listed based on NCBI database. C/C, C/R and R/R represent the homozygote of the major allele and the heterozygote and homozygote of the minor allele, respectively. LD information was obtained from 1000 Genomes project data (http://www.internationalgenome.org/). GWAS, genome-wide association study; MAF, minor allele frequency. Kim et al. [19] . Coram et al. [20] .

Discussion
To our knowledge, this is the first attempt made to build an SNP set for the prediction of both the TC/HDL-c and TG/HDL-c ratios in a Korean population using the wGRS method. The analysis scheme of the present study was designed based on previous studies [25,26]. Our results consistently showed an upward wGRS trend with increasing cholesterol ratios in all analyses, including final validation. These results indicated that the selected six SNPs (rs4420638 (APOC1), rs12421652 (BUD13), rs17411126 (LPL), rs6589566 (ZPR1), rs16940212 (LOC101928635) and rs10852765 (ABCA8)) could be used for the prediction of both TC/HDL-c and TG/HDL-c ratios in a Korean population.
In the present study, we established variety sample sets using a total of 7795 subjects. To confirm cholesterol values of our sample sets, we consulted a previous large-scale report of cholesterol using Korean subjects [27]. According to the previous report, the TC level and HDL-c of Korean men was slightly lower than that of women. By contrast, TG was higher in women than men population. Similar differences also could be found in all of our sample sets, indicating that our sample sets were suitable for cholesterol prediction study for Korean population. According to the results (electronic supplementary material, figure S1), our SNP set for cholesterol ratios prediction showed good prediction ability in analyses using total subjects (R 2 > 0.8). However, prediction ability for men and women subjects was slightly lower than total samples in both analyses for TC/HDL-c and TG/HDL-c (R 2 = 0.5050 and 0.6763 for men; R 2 = 0.6162 and 0.7700 for women). Further sex-specific analyses might be helpful for more precise cholesterol prediction.
Several previous studies have shown the importance of the six selected SNPs and genes for cholesterol metabolism and various diseases. The rs4420638 which is located in the APOE-APOC1-APOC4-APOC2 cluster showed a protective effect on LDL-cholesterol levels [28]. The rs4420638 was also responsible for risk of coronary heart disease of Asian population [29]. The association of LPL with lipid variables and coronary artery disease has been reported many times [30][31][32]. One recent study has demonstrated that the rs17411126, which is linked to rs326 in LPL (r 2 = 1.00), was implicated in the increase of HDLc and APOA1 after a high-carbohydrate and low-fat diet in males of the Han Chinese population [33]. In addition, several studies have suggested that the rs6589566 could be a marker for the risk of coronary artery disease [34][35][36]. Moreover, the apolipoprotein A5 haplotypes, including rs6589566, were implicated in the elevation of the TG/HDL-c ratio and the risk for metabolic syndrome in a Korean population [37]. The exact roles of rs12421652 (linked to rs11216126 in BUD13, r 2 = 1.00), rs16940212 (LOC101928635) and rs10852765 (linked to rs4148008 in ABCA8, r 2 = 0.98) in lipid metabolism are not fully understood yet. The strong association between rs16940212 and blood cholesterol level (TG and HDL-c) was reported in the previous study using Korean population [38]. However, previous studies have found several pieces of evidence between the gene and lipid metabolism. ABCA8 might function as a transporter of lipophilic substrates such as the bioactive lipid leukotriene C4 [39]. In addition, differential lipid response to statins was observed in a previous association study that used SNPs in the BUD13-APOA5 gene region [40]. Further studies may be needed to understand the effects of SNPs on genes and lipid metabolism.
A recent study suggested a marker set for the prediction of cholesterol levels using various models, such as Ridge Regression, Lasso and Hyper-Lasso, with a Caucasian population [41]. Another study identified 19 of the most significant SNPs among the markers in 17 lipid-related genes in a Hispanic population [42]. Unfortunately, we failed to find our selected six SNPs in both of the previous studies. This inconsistency may be caused by the genetic background differences between Koreans and other populations, and indicates that our SNP set may not be suitable for the prediction of cholesterol ratios in other populations.
In summary, we composed an SNP set to predict cholesterol ratios using four markers. Using these markers, the wGRS showed increases of both the TC/HDL-c and TG/HDL-c ratios during the 10-fold cross-validation process. These results were also replicated in further analysis using the final validation set, as predicted. Although the exact role of the four SNPs in lipid metabolism was not fully elucidated, the SNPs explained the cholesterol ratio variation well for a Korean population. Our results might provide valuable information for the prevention of various diseases, including cardiovascular diseases.
Ethics. The present study used the genotype data from the KARE project. This study was approved by the Public Institutional Bioethics Committee as designated by the Ministry of Health and Welfare (P01-201502-31-002). This study was provided with biospecimens and data from the Korean Genome Analysis Project (4845-301), the Korean Genome and Epidemiology Study (4851-302) and Korea Biobank Project (4851-307, KBP-2015-035), which are supported by the Korea Center for Disease Control and Prevention, Republic of Korea.