Enhancing selection of alcohol consumption-associated genes by random forest

Chenglin Lyu; Roby Joehanes; Tianxiao Huan; Daniel Levy; Yi Li; Mengyao Wang; Xue Liu; Chunyu Liu; Jiantao Ma

doi:10.1017/S0007114524000795

Enhancing selection of alcohol consumption-associated genes by random forest

Published online by Cambridge University Press: 12 April 2024

Yi Li ,

Xue Liu ,

Chunyu Liu and

Jiantao Ma

Show author details

Chenglin Lyu: Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA Department of Anatomy and Neurobiology, Boston University Chobanian & Avedisian School of Medicine, Boston, MA 02118, USA
Roby Joehanes: Affiliation:
Framingham Heart Study and Population Sciences Branch, NHLBI, Framingham, MA 01702, USA
Tianxiao Huan: Affiliation:
Framingham Heart Study and Population Sciences Branch, NHLBI, Framingham, MA 01702, USA
Daniel Levy: Affiliation:
Framingham Heart Study and Population Sciences Branch, NHLBI, Framingham, MA 01702, USA
Yi Li: Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
Mengyao Wang: Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
Xue Liu: Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
Chunyu Liu*: Affiliation:
Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
Jiantao Ma*: Affiliation:
Nutrition Epidemiology and Data Science, Friedman School of Nutrition Science and Policy, Tufts University, Boston, MA 02111, USA
*: *Corresponding authors: Chunyu Liu, email: liuc@bu.edu; Jiantao Ma, email: jiantao.ma@tufts.edu
*Corresponding authors: Chunyu Liu, email: liuc@bu.edu; Jiantao Ma, email: jiantao.ma@tufts.edu

Article contents

Abstract
Methods
Results
Discussion
Supplementary material
Footnotes
References

Rights & Permissions

Abstract

Machine learning methods have been used in identifying omics markers for a variety of phenotypes. We aimed to examine whether a supervised machine learning algorithm can improve identification of alcohol-associated transcriptomic markers. In this study, we analysed array-based, whole-blood derived expression data for 17 873 gene transcripts in 5508 Framingham Heart Study participants. By using the Boruta algorithm, a supervised random forest (RF)-based feature selection method, we selected twenty-five alcohol-associated transcripts. In a testing set (30 % of entire study participants), AUC (area under the receiver operating characteristics curve) of these twenty-five transcripts were 0·73, 0·69 and 0·66 for non-drinkers v. moderate drinkers, non-drinkers v. heavy drinkers and moderate drinkers v. heavy drinkers, respectively. The AUC of the selected transcripts by the Boruta method were comparable to those identified using conventional linear regression models, for example, AUC of 1958 transcripts identified by conventional linear regression models (false discovery rate < 0·2) were 0·74, 0·66 and 0·65, respectively. With Bonferroni correction for the twenty-five Boruta method-selected transcripts and three CVD risk factors (i.e. at P < 6·7e-4), we observed thirteen transcripts were associated with obesity, three transcripts with type 2 diabetes and one transcript with hypertension. For example, we observed that alcohol consumption was inversely associated with the expression of DOCK4, IL4R, and SORT1, and DOCK4 and SORT1 were positively associated with obesity, and IL4R was inversely associated with hypertension. In conclusion, using a supervised machine learning method, the RF-based Boruta algorithm, we identified novel alcohol-associated gene transcripts.

Keywords

Alcohol consumption Gene expression CVD Machine learning random forest Boruta

Type: Research Article
Information: British Journal of Nutrition , Volume 131 , Issue 12 , 28 June 2024 , pp. 2058 - 2067

DOI: https://doi.org/10.1017/S0007114524000795 [Opens in a new window]
Copyright: © The Author(s), 2024. Published by Cambridge University Press on behalf of The Nutrition Society

Alcohol consumption is an important lifestyle factor that has been associated with cardiovascular health. Excessive alcohol consumption leads to hypertension, dyslipidemia and type 2 diabetes^{(Reference Emanuele, Swade and Emanuele1,Reference Chait, Mancini and February2)} . Whereas moderate alcohol consumption may improve cardiovascular health despite that several recent studies suggest no beneficial relationship with reduction of CVD^{(3–Reference Stockwell, Zhao and Panwar5)}. The use of high-throughput transcriptomic analysis has been playing a significant role in investigating the pathogenesis of CVD^{(Reference Huan, Esko and Peters6–Reference Benton, Lea and Macartney-Coxson8)}. In our previous study, using conventional linear regression models, we examined associations between alcohol consumption and transcriptomic markers in the community-based Framingham Heart Study (FHS)^{(Reference Ma, Huang and Yan9)}.

‘Big Data’ applications such as machine learning approaches provide new tools to discover novel biomarkers for better understanding of molecular mechanisms underlying diseases and to increase accuracy of disease predictions^{(Reference Luo, Wu and Gopukumar10)}. Random forest (RF) is a supervised machine learning method that scores the importance of the features in a dataset^{(Reference Breiman11,Reference Hu and Szymczak12)} . RF is a promising approach in prediction and classification for bias reduction^{(Reference Breiman11,Reference Hu and Szymczak12)} . RF has been successfully applied in analysing different types of omics biomarkers^{(Reference Degenhardt, Seifert and Szymczak13–Reference Swan, Mobasheri and Allaway15)}. Boruta is an extension method based on RF to evaluate the importance of original features by comparing them with their randomised copies^{(Reference Kursa, Jankowski and Rudnicki16)}. In essence, the Boruta method is an automatic feature selection method. The Boruta method has been used in over 100 studies in selecting omics biomarkers related to diseases or traits^{(Reference Degenhardt, Seifert and Szymczak13)}. A recent study showed that, using simulated and published datasets, the Boruta method was a stable RF-based feature selection approach^{(Reference Acharjee, Larkman and Xu17)}.

Analysis using conventional linear regression may experience issues with multiple testing and cannot effectively handle high-order interactions among tested biomarkers^{(Reference Liu, Ackerman and Carulli18)}. Compared with conventional linear regression, RF method offers alternative analytical models that may have several advantages such as model flexibility^{(Reference Steyerberg, van der Ploeg and Van Calster19)}. RF-based approaches may improve the handling of high-dimensional data by decorrelating the classifiers and minimising the influence of over-fitting^{(Reference Polewko-Klim, Lesinski and Golinska20)}. However, it is unclear whether using RF with automatic feature selection algorithms such as the Boruta method can identify additional alcohol-associated transcriptomic markers. To address this research question, we aimed to use the RF with the Boruta method to improve the identification of alcohol-associated gene transcripts and examine the associations of these gene transcripts with CVD risk factors in the FHS.

Methods

Study participants

The FHS participants included in the present study are those who attended the eighth examination (2005–2008) of the Offspring cohort or the second examination (2008–2011) of the Third Generation cohort^{(Reference Feinleib, Kannel and Garrison21,Reference Splansky, Corey and Yang22)} . The study sample of the present study was the same as that was used in our previous alcohol-associated gene transcripts analysis using conventional linear regression^{(Reference Ma, Huang and Yan9)}. Briefly, after excluding participants with missing data on alcohol consumption and gene expression, we included 5508 participants, 2381 from the Offspring cohorts and 3127 from the Third Generation cohort. The FHS protocols and procedures were approved by the Institutional Review Board for Human Research at Boston University Medical Center, and all participants provided written informed consent. This study was conducted according to the guidelines laid down in the Declaration of Helsinki, and all procedures involving human subjects were approved by the Institutional Review Board for Human Research at Boston University Medical Center (IRB number: H-41461). Written informed consent was obtained from all participants.

Alcohol consumption

Participants’ alcohol consumption was measured by a technician-administered questionnaire during the physical examination in the FHS clinic. Frequency of standard servings of beer, wine and spirit consumed in a typical week or month were documented. We calculated the grams (g) of ethanol consumed each day using the following conversion factors: one 12 oz. beer has 14 g of ethanol, one 4–5 oz. wine has 14 g of ethanol, and one 1·5 oz. of 80 proof liquor has 14 g of ethanol^{(Reference Liu, Marioni and Hedman23)}. Based on the estimated daily alcohol consumption, we categorised our study participants into three groups: non-drinkers (n 1729), moderate drinkers (0·1–28 g/d in women and 0·1–42 g/d in men; n 3427) and heavy drinkers (> 28 g/d in women and > 42 g/d in men; n 352). We also split the moderate drinkers to light drinkers (0·1–14 g/d in women and 0·1–28 g/d in men; n 2806) and at-risk drinkers (14·1–28 g/d in women and 28·1–42 g/d in men; n 621) and conducted sensitivity analyses separately for the two groups.

Gene expression profiling

We analysed gene expression levels that were measured using the GeneChip Human Exon 1.0 ST Array as described previously^{(Reference Joehanes, Ying and Huan24)}. Briefly, fasting peripheral whole blood samples, from the same examinations that alcohol consumption was assessed, were collected in PAXgene^TM tubes. Standard operating procedures were followed to isolate RNA using a KingFisher^® 96 robot, and 50 ng RNA was amplified to create the cDNA library. The Affymetrix 7G GCS3000 scanner was used to measure gene expression levels, and the Human Exon 1.0 ST Array probeset was used to annotate gene transcripts. The final gene expression profiles were residuals of 17 873 transcripts of autosomal genes generated using linear mixed models with adjustment for technical covariates and other factors as fixed effects as well as batch as a random effect^{(Reference Joehanes, Ying and Huan24)}.

CVD risk factors

Obesity, hypertension and type 2 diabetes status at the same time for alcohol consumption and gene expression measurements were analysed in the present study^{(Reference Sun, Ho and Gao25)}. Obesity was defined as BMI ≥ 30 kg/m². Hypertension was defined as systolic blood pressure (SBP) ≥ 140 mm Hg or diastolic blood pressure (DBP) ≥ 90 mm Hg or taking antihypertensive drugs for high blood pressure. We also defined hypertension as SBP > 130 mm Hg or DBP > 80mm Hg or taking antihypertension drugs^{(Reference Czuriga-Kovacs, Czuriga and Kardos26)}. Type 2 diabetes was defined as fasting blood glucose level ≥ 126 mg/dl or taking antidiabetic drugs.

Statistical analysis

We performed three main statistical analyses (Fig. 1), including (1) using the Boruta method to select alcohol-associated gene transcripts, (2) using RF to examine the prediction capability of Boruta-selected transcripts for alcohol consumption categories and (3) examining the cross-sectional associations of Boruta-selected transcripts with three CVD risk factors (obesity, hypertension and type 2 diabetes). These analyses were performed by R studio (version 4.1.2).

Fig. 1. Study flow chart. FDR, false discovery rate; FHS, Framingham Heart Study; MSigDB, Molecular Signatures Database.

Use Boruta algorithm for gene selection

RF method evaluates the importance of variables in the models by mean accuracy and Gini index^{(Reference Breiman11)}. However, the regular RF method does not provide cut-off values for these parameters for the purpose of variable selection. The Boruta algorithm extends the regular RF method by reporting the level of the predictors as ‘Confirmed’, ‘Tentative’ and ‘Rejected’^{(Reference Kursa, Jankowski and Rudnicki16,Reference Kursa and Rudnicki27)} . We therefore used the Boruta method, implemented with the R Boruta package^{(Reference Kursa and Rudnicki27)}, to facilitate automatic selection of alcohol-associated gene transcripts. In this analysis, alcohol consumption (g/d) was treated as outcome variable and gene transcripts were the main predictors, with sex and age as covariates. We used parameter doTrace = 2 to obtain ‘confirmed’ attributes, that is, alcohol-associated gene transcripts. To achieve biological and statistical relevance of the transcripts determined by the Boruta algorithm, we applied two filtering methods, data-driven and pathway-based approaches, to choose transcripts to be tested. The first two sets were selected using the data-driven approach. The first set included 15 146 gene transcripts with absolute pairwise Pearson’s r < 0·6 and the second set included 1958 gene transcripts with false discovery rate (FDR) < 0·2 in the meta-analysis from our previous alcohol-associated gene transcript analysis using conventional linear regression models^{(Reference Ma, Huang and Yan9)}. The third to the fifth sets of gene transcripts were determined based on well-established gene pathway databases, including Wikipathways (n 6890), Molecular Signatures Database (MSigDB) hallmark gene sets (H; 4003 genes) and MSigDB immunological signature gene sets (C7; 14 580 genes)^{(Reference Martens, Ammar and Riutta28–Reference Subramanian, Tamayo and Mootha30)}. One at a time, we run Boruta models for these five sets of transcripts.

Gene ontology analysis

A web-based gene ontology (GO) analysis (http://geneontology.org/) was performed to evaluate the biological process relevant to the Boruta method-selected transcripts^{(Reference Thomas, Ebert and Muruganujan31)}. Fisher’s exact tests were conducted using the default reference gene list. Similarly, GO term with FDR < 0·05 was considered statistically significant.

Exam prediction capability of selected gene transcripts

We used the RF models to examine whether the Boruta method-selected gene transcripts can distinguish different levels of alcohol consumption. Three comparisons were performed, including non-drinkers v. moderate drinkers, non-drinkers v. heavy drinkers, and moderate drinkers v. heavy drinkers. The R randomForest package was used to perform these comparisons^{(Reference Liaw and Wiener32)}. We randomly divided our study participants into a training set, which included 70 % of the entire participants, and a testing set, which included 30 % of the entire participants. The training data were used to train the RF model by default parameters: ntree (number of trees to grow) = 500 and mtry (number of variables randomly sampled as candidates at each split) = square root of number of attributes tested. The out-of-bag error rate in the training set was used to determine the performance of the RF model, and the area under the receiver operating characteristic (ROC) curve (AUC) derived from the testing set was used to evaluate the prediction capability of the selected predictors.

Four sets of predictors were analysed, including 1958 transcripts with FDR < 0·2 in meta-analysis (set 1) and twenty-five alcohol-associated genes with significant Bonferroni-corrected P values (set 2) in our previous alcohol-associated gene transcript analysis^{(Reference Ma, Huang and Yan9)}, Boruta method-selected gene transcripts (set 3) and 144 alcohol consumption-associated CpG (DNA methylation sites) identified from a previous epigenome-wide association analyses and meta-analysis (set 4)^{(Reference Liu, Marioni and Hedman23)}. We examined these four sets of predictors one at a time. In addition to these omics predictors, sex and age were covariates in all models. To determine the optimal threshold value for AUC calculation and avoid over- or under-sampling misclassification, we iterated each model ten times. The first iteration used default values. In the second iteration, using the coords function in R pROC package^{(Reference Robin, Turck and Hainard33)}, we calculated the maximum value of the sum of specificity and sensitivity using the Youden method based on the initial AUC calculation. This maximum value was used to derive the threshold for AUC calculation in this iteration. This process was repeated in the rest of iterations. We reported the AUC corresponding to the lowest out-of-bag error rate after the initial iteration. Also, we compared the AUC calculated for the four different sets of predictors using the DeLong algorithm, implemented using the R pROC package. Code for Boruta method and AUC calculation using RF are in Supplemental materials.

Association analysis between the expression level of selected genes with CVD risk factors

We performed cross-sectional analyses between the Boruta method-selected transcripts and obesity, hypertension, and type 2 diabetes. Covariates included age, sex, current smoking status, cohort (Offspring or Third Generation cohort), estimated blood cell compositions^{(Reference Joehanes, Ying and Huan24)} and BMI (only in analyses for hypertension and type 2 diabetes). Generalised estimation equations were used to account for familial relationships. Bonferroni correction (i.e. 0·05 divided by the number of transcripts selected times three CVD risk factors) was applied to determine statistical significance.

Interaction analyses and stratification analyses

We examined potential interaction between alcohol consumption and sex and age (in continuous scale) in relation to gene expression for transcripts identified by the Boruta method. Linear mixed regression was performed accounting for family structure in FHS. A product term of alcohol consumption and sex or alcohol consumption and age were added in models. Covariates included sex, age, current smoking status, the FHS cohort index (Offspring v. Third Generation) and blood cell counts (counts of white cell, red cell, and platelet and proportion of neutrophils, lymphocytes, monocytes, basophils and eosinophils)^{(Reference Ma, Huang and Yan9)}. We also performed interaction analysis between transcripts selected by the Boruta method and sex and age in relation to the three CVD risk factors. In these analyses, we used the same generalised estimation equation modelling described above in the main effect analysis to test the statistical significance of the product term of transcripts and sex or age. Further, we stratified our study participants by sex and age (below or above median age 55 years) and reran the association analysis between transcripts and CVD risk factors in each stratum.

Results

Study participants

About 54·3 % participants were women, and the average age of the participants was 55·4 (Table 1). We classified the participants into three categories based on alcohol consumption levels: non-drinkers, moderate drinkers and heavy drinkers. Non-drinkers tended to be older in age, followed by heavy drinkers and moderate drinkers. Men tended to drink more alcohol compared with women. More heavy drinkers were current smokers (19 %) compared with non-drinkers (9 %) and moderate drinkers (7 %). The proportion of participants with obesity and type 2 diabetes was higher in non-drinkers (38 % and 16 %, respectively), while the proportion of participants with hypertension was higher in heavy drinkers (53 %).

Table 1. Participant characteristics

Values are represented as mean ± sd or n (%); alcohol consumption is presented as median (IQR).

Use Boruta algorithm for gene selection

The Boruta method selected six gene transcripts (SORT1, ODC1, CTSG, IL4R, MPO and CYTH1) from the Wikipathways set, ten transcripts (IFI44L, P2RY14, PLAGL1, DOCK4, GAPVD1, IFITM1, UTP20, MPO, ATP5F1D and RBM38) from the MSigDB hallmark pathway set and eleven transcripts (FCGR1A, IFI6, ABCA13, DOCK4, LCN2, DDX58, OLFM4, CTSG, MPO, CEACAM8 and BPI) from the MSigDB immunological signature sets (Table 2). Among transcripts that were associated with alcohol consumption at FDR < 0·2 in our previous analysis using linear regression models^{(Reference Liu, Marioni and Hedman23)}, the Boruta method selected four transcripts (OLFM4, CTSG, MPO and CEACAM8). From those with absolute pairwise r < 0·6, the Boruta method selected three transcripts (SORT1, DOCK4 and TNFSF13B). After removing duplicated transcripts (Table 2), we found twenty-five alcohol-associated transcripts using the Boruta method. We compared the differences of gene expression levels in moderate and heavy drinkers relative to non-drinkers (online Supplementary Fig. 1). We found no substantial evidence supporting non-linear relationships between alcohol consumption and these twenty-five transcripts. Also, we found no significant statistical interaction between the twenty-five transcripts and sex and age at P < 0·002 (Bonferroni correction for twenty-five transcripts; online Supplementary Table 7).

Table 2. Boruta algorithm-selected genes

MSigDB, Molecular Signatures Database; FDR, false discovery rate.

×: transcripts have been identified using conventional linear regression models (see ref. 9).

P values are from meta-analysis in ref. 9.

Transcription start and stop positions are based on GRCh37.

Among these twenty-five Boruta method-selected transcripts, twelve transcripts, (MEIS1, ODC1, ABCA13, OLFM4, CTSG, CEACAM8, LCN2, UTP20, DOCK4, IL4R, MPO and BPI) had P < 2·9e-6 (Bonferroni correction for 17 176 genes) in our previous meta-analysis based on linear regression models^{(Reference Ma, Huang and Yan9)}. In these twelve transcripts, six (MEIS1, ODC1, ABCA13, OLFM4, CTSG and CEACAM8; Table 2) were also among those (n 25) significant using discovery and replication strategy (P < 8e-4 in the discovery analysis and P < 1·9e-4 in the replication analysis)^{(Reference Ma, Huang and Yan9)}. The correlation between the thirteen unique transcripts identified by the Boruta method and those identified by the conventional linear models (either using discovery and replication or meta-analysis; n 101) was largely modest, 97 % pairs with Pearson’s |r| < 0·3 (online Supplementary Fig. 3). The pairwise correlation of the twenty-five Boruta method-selected transcripts ranged from 0 to 0·84 (Pearson’s |r|) (online Supplementary Fig. 2). There were 240 pairs of transcripts with |r| < 0·3, 38 pairs of with |r| between 0·3 and 0·6, and 22 pairs with |r| > 0·6. In these twenty-two pairs with |r| > 0·6, there were three clusters of transcripts (online Supplementary Fig. 2), including (1) IFI6, DDX58 and IFITM1, (2) MPO, CTSG; LCN2, BPI, CEACAM8, ABCA13 and OLFM4, and (3) ODC1 and RBM38.

Gene ontology analysis

We found that the twenty-five Boruta method-selected transcripts were enriched in ten GO biological processes (online Supplementary Table 1). The ancestor charts of these significant GO terms were shown in online Supplementary Fig. 4. These significant GO terms are primarily for defence response to bacterium (GO:0042742; P = 2·9e-5; FDR = 0·04) and immune response (GO:0006955; P = 1·4e-6; FDR = 0·004). We observed that several transcripts with |r| > 0·6 were among the enriched genes, for example, IFI6 and DDX58 from the first cluster (online Supplementary Fig. 2).

Exam prediction capability of selected gene transcripts

In Fig. 2, we showed the ROC curves for the four sets of predictors derived from the present analysis and our previous studies, including 1958 transcripts with FDR < 0·2 based on conventional regression^{(Reference Ma, Huang and Yan9)}, twenty-five transcripts using discovery and replication strategy based on conventional regression^{(Reference Ma, Huang and Yan9)}, the twenty-five Boruta method-selected transcripts and 144 alcohol-associated CpG^{(Reference Liu, Marioni and Hedman23)}. In addition, we integrated predictors from the latter three sets to test whether additively combining transcripts and CpG might improve prediction. We calculated the AUC based on the lowest out-of-bag error rate and the largest AUC from the ten iterations (online Supplementary Table 2). For all predictors, the AUC based on the lowest out-of-bag error rate was slightly better in the analyses for non-drinkers v. heavy drinkers (0·73–0·77) compared with that for non-drinkers v. moderate drinkers (0·66–0·70) and moderate drinkers v. heavy drinkers (0·65–0·70). In analysis to compare non-drinkers and heavy drinkers, the AUC of the twenty-five Boruta method-selected transcripts was comparable (0·73) to that based on the conventional linear regression (0·74 for the 1958 transcripts and 0·73 for the twenty-five transcripts) and lower than that using the 144 CpG (0·77). We found the combining-predictors approach had a slightly better AUC than transcripts-based approaches and similar as that for CpG. However, no significant statistical difference was detected between the twenty-five Boruta method-selected transcripts and other sets of predictors using Delong tests in the above comparisons (online Supplementary Table 3). The AUC from analyses based on light drinkers was not substantially different from that in the primary analyses combining light and at-risk drinkers (online Supplementary Table 4).

Fig. 2. ROC of selected predictors. (1) Boruta method was based on the twenty-five Boruta method-selected transcripts; (2) 1958 transcripts and (3) twenty-five transcripts were from alcohol-gene expression analyses using conventional linear regression (see ref. 9); (4) 144 CpG were from meta-analysis of alcohol-associated DNA methylation markers (see ref. 21); (5) combined predictors from sets 1, 3 and 4. ROC, receiver operating characteristics.

Cross-sectional association with CVD risk factors

With Bonferroni correction for the twenty-five Boruta-selected transcripts and three CVD risk factors (i.e. at P < 6·7e-4), we observed that thirteen transcripts were associated with obesity, one transcript with hypertension and three transcripts with type 2 diabetes (Table 3). In analysis for hypertension defined as SBP > 130 mm Hg or DBP > 80mm Hg, the association was largely consistent. Nonetheless, two transcripts, RBM38 (P = 1·7e–4) and DOCK4 (P = 1·7e-4), remained significant at P < 6·7e-4. Thus, taken together, nineteen transcript-CVD risk factor pairs were observed. Among these nineteen pairs, five pairs have been reported in our previous study^{(Reference Ma, Huang and Yan9)}, and the other fourteen pairs were unique in the present study (Table 3; online Supplementary Table 5). In the FHS, we have observed that alcohol consumption was inversely associated with the risk of obesity and type 2 diabetes and positively associated with the risk of hypertension^{(Reference Sun, Ho and Gao25)}. Therefore, if a transcript is positively associated with alcohol consumption, we expect that this transcript is inversely associated with obesity and diabetes and positively associated with hypertension, or vice versa. For the fourteen novel pairs, the direction of the associations for four transcript–obesity pairs and one transcript–hypertension pair were consistent with our hypothesis. The association between alcohol consumption and these five transcripts were shown in online Supplementary Table 6. For example, alcohol consumption was inversely associated with the expression of DOCK4, IL4R and SORT1, and regression coefficients were −0·0017 (95 % CI: −0·0024, −0·0011; P = 1·8e-7), −0·0016 (95 % CI: −0·0021, −0·0011; P = 1·3e-10) and −0·0007 (95 % CI: −0·0011, −0·0003; P = 0·0003) per 10 g/d higher alcohol consumption, respectively. Consistently, DOCK4 and SORT1 were positively associated with obesity and IL4R was inversely associated with hypertension (Table 3).

Table 3. Cross-sectional analysis of Bruta method-selected genes with CVD risk factors

FHS, Framingham Heart Study.

Generalised estimation equations with adjustment for age, sex, current smoking status, FHS cohorts (the Offspring or Third Generation cohort), estimated blood cell compositions and BMI (only in analyses for hypertension and type 2 diabetes).

We found no significant interaction between the twenty-five transcripts and age (online Supplementary Table 8). We observed significant interaction between sex and three transcripts, including DOCK4 (P = 5·5e-5), RBM38 (P = 2·9e-4) and MPO (P = 2·9e-5), in relation to obesity. Stratified analyses by sex and age are presented in online Supplementary Table 9–12. For all the three transcripts, their association with obesity was in the same direction in both sex; however, the association strength varied in male and female participants. In male participants, the OR for obesity was 1·30 (95 % CI = 1·03, 1·64; P = 0·03) for DOCK4, 1·66 (95 % CI = 1·38, 2·00; P = 7·9e-8) for RBM38 and 1·46 (95 % CI = 1·09, 1·96; P = 0·01) for MPO. Whereas, in female participants, the OR was 2·48 (95 % CI = 1·98, 3·11; P = 2·0e-15) for DOCK4, 2·65 (95 % CI = 2·17, 3·23; P = 7·9e-22) for RBM38 and 0·65 (95 % CI = 0·42, 1·00; P = 0·05) for MPO.

Discussion

In the present analysis, we used the Boruta method and demonstrated that twenty-five gene transcripts were associated with alcohol consumption in FHS participants. Compared with our previous study based on conventional linear regression analysis, the present study identified thirteen additional alcohol-associated transcripts. Several of the thirteen transcripts such as FCGR1A and SORT1 were further linked to CVD risk factors. We also showed that the Boruta method-selected transcripts have comparable prediction capabilities as the transcripts identified by conventional linear regression analysis in the testing set (30 % of entire study participants). Taken together, the present analysis suggests that the Boruta method can contribute to a better understanding of alcohol-associated transcriptomic changes. Taken together, the present analysis expanded the candidate list of gene transcripts for future validation studies, suggesting that the Boruta method can contribute to a better understanding of alcohol-associated transcriptomic changes.

RF is a commonly performed supervised machine learning method for transcriptomic data^{(Reference Kursa34)}. The RF-based Boruta method has been used in studies analysing both array- and RNA-sequencing (RNA-seq)-based transcriptomic data^{(Reference Kursa34–Reference Lin, Jo and Luebeck36)}. We used the Boruta method because of its stable feature selection capability relative to other approaches, for example, a study reported that the Boruta method could identify important genes and achieved the highest ratio of self-consistent selections^{(Reference Acharjee, Larkman and Xu17)}. However, a recent study compared three feature selection algorithms, Boruta, Vita, and AUC-RF, and showed that the three approaches had a comparable performance regarding identification of transcriptomic signatures predicting colorectal cancer^{(Reference Long, Park and Anh37)}. A recent study also compared several machine learning methods and showed the LASSO method identified more transcripts predicting asthma than the Boruta method^{(Reference Dessie, Gautam and Ding38)}. It is difficult to directly compare these studies because of different study designs, data distribution and phenotypes. Future studies to compare multiple machine learning methods are needed to explore at what conditions a certain method can perform better.

Because of the high dimensionality of the transcriptomic data, we applied two filtering methods, data-driven and pathway-based approaches before running the Boruta algorithm. Overall, the pathway-based approach performed better than the data-driven approach because the former identified more transcripts. This suggests that embedding biological knowledge may lead to a better performance of the Boruta method. To the best of our knowledge, machine learning approaches (such as RF with Boruta method) have not been extensively examined to study alcohol consumption-related transcriptomic changes. The present study contributes novel information to the current literature; however, future studies are needed to establish a critical process for using machine learning methods in this research area, such as performing data harmonisation and transformation, selecting appropriate machine learning methods, and conducting external validation.

In our previous study using conventional linear regression models^{(Reference Ma, Huang and Yan9)}, we reported significant associations between twenty-two alcohol-associated transcripts and three CVD risk factors. The present study also showed several additional transcript–CVD risk factor pairs, particularly five pairs (for five transcripts; online Supplementary Table 6) were in line with our previous observations on alcohol consumption and CVD risk factors^{(Reference Sun, Ho and Gao25)}. Three of the five transcripts (FCGR1A, IFITM1 and SORT1) are among the thirteen unique transcripts identified by the Boruta method. The three transcripts had low to moderate correlation with those identified by our previous study using conventional regression models^{(Reference Ma, Huang and Yan9)}. GO analysis showed that FCGR1A (Fc gamma receptor Ia) and IFITM1 (interferon-induced transmembrane protein 1) were enriched in nine GO terms related to defence or immune response (online Supplementary Table 1), suggesting that alcohol consumption may trigger chronic inflammation and then affect CVD risk. A genetic variant (rs4970843-C) at intron of SORT1 (sortilin 1) was associated with height^{(Reference Yengo, Vedantam and Marouli39)}, which is consistent with the present observation on the SORT1 and obesity (i.e. increased BMI). However, a study in the Danish PRISME study showed that heavy alcohol drinking was associated with an increased sortilin, which is opposite to the present observation on a negative association of alcohol consumption with SORT1 expression levels (online Supplementary Table 6). This may be due to most of our study participants (93 %) are non-drinkers and moderate drinkers. Nonetheless, because of the cross-sectional and observational nature of the present analysis, we cannot infer causality. Future studies with large sample size and in diverse populations are warranted to validate the present findings.

In approximately 30 % of our study participants (i.e. the testing set), we tested the prediction capabilities of the twenty-five Boruta method-selected transcripts. Compared with the transcripts identified by conventional regression models, the twenty-five Boruta method-selected transcripts had a comparable prediction capability. Although no statistical significance was detected, the overall prediction capabilities of selected gene transcripts were relatively weaker than DNA methylation markers (AUC 0·73 v. 0·77). These DNA methylation markers were selected based on a large meta-analysis in thirteen population-based cohorts^{(Reference Liu, Marioni and Hedman23)}; therefore, this set of DNA methylation markers may be less noisy than the gene transcripts. The analysis combining gene transcripts and DNA methylation markers did not substantially increase the AUC, which also suggests that DNA methylation markers may have better prediction capabilities. However, the additive approach that was used to combine selected gene transcripts and CpG may be biased because the potential interaction between different types of omics markers is not considered^{(Reference Singh, Shannon and Gautier40)}. Thus, novel analytical approaches to integrating multiple omics markers are needed to comprehensively identify alcohol-associated markers. In addition, compared with array-based transcriptomic data, RNA-seq has a better resolution and enables the identification of non-coding RNA. Future studies utilising RNA-seq data are needed to examine the alcohol-associated transcriptomic changes.

The advantages of the present study include using a well-established machine learning method and comprehensive data (alcohol consumption, transcriptomics and clinical risk factors) collected from the well-characterised community-based FHS. However, in addition to several weaknesses described above, other limitations warrant discussion. First, all study participants were Europeans, and most study participants were non-drinkers or moderate drinkers. This limits the generalisability of the present study to other more diverse populations. Second, interpretation of the transcripts selected by machine learning approaches is challenging. We explored their cross-sectional association with CVD risk factors. However, transcriptomic profiles may change over time. Prospective association analyses are therefore needed to provide more robust data regarding the relationship between alcohol, gene expression and CVD risk factors. Third, different types of alcoholic beverages may have different responses in gene expression levels. Future studies with larger sample size are needed to examine specific transcriptomic characteristics associated with consumption of each type of alcoholic beverage. Fourth, questionnaires were used to collect self-reported alcohol consumption. Measurement errors may exist and affect transcript selection and prediction accuracy. Nonetheless, this also highlights the needs for future studies to comprehensively investigate surrogate markers for alcohol consumption.

The association of alcohol consumption and cardiovascular health is complex, mainly due to the uncertainty related to the potential impact of moderate alcohol drinking on cardiovascular health^{(3–Reference Stockwell, Zhao and Panwar5)}. Majority of study participants are non-drinkers or moderate drinkers. Our previous study using conventional regression models did not find a clear protective effect of alcohol consumption on CVD risk factors through transcriptomic biomarkers. In the present study, we used a different analytical approach, yet the findings echo those from our previous study^{(Reference Ma, Huang and Yan9)}. It should be noted that the present analysis only examined one commonly used machine learning algorithm. Other machine learning and deep learning algorithms^{(Reference Wekesa and Kimwele41)}, together with profound bioinformatics knowledge, may facilitate the identification of true causal transcriptomic markers and improve the discrimination capacities of alcohol-associated transcriptomic biomarkers.

In conclusion, we applied a supervised machine learning approach, the RF-based Boruta method, and identified additional alcohol-associated gene transcripts, compared with analysis using the conventional linear regression models. These additional transcripts expand the candidate list for future validation studies; thus, our findings support the notion that machine learning approaches can contribute useful information to unravel the complex relationship between alcohol consumption and CVD risk. Our findings support the notion that machine learning approaches can contribute useful information to unraveling the complex relationship between alcohol consumption and CVD risk. The present study also highlights that future studies in large and diverse samples are needed to comprehensively investigate the impact of alcohol consumption on transcriptomic changes and subsequent disease burden.

Acknowledgements

The Framingham Heart Study was supported by NIH contracts N01-HC-25195, HHSN268201500001I and 75N92019D00031. Funding for SABRe gene expression was provided by Division of Intramural Research, NHLBI, and Center for Population Studies, NHLBI.

J. M. and C. Liu are supported by NIH grant R01AA028263.

The authors’ contributions were as follows – J. M. and C. Liu designed research and had primary responsibility for final content; C. Lyu conducted the analyses; J. M., C. Lyu and C. Liu interpreted the result; R. J. conducted quality control and residual calculation for gene expression data; C. Lyu and J. M. wrote the manuscript; R. J., T. H., D. L. and C. Liu critically reviewed the manuscript; and all authors read and approved the final manuscript.

The authors declare no conflicts of interest.

The views and opinions expressed in this manuscript are those of the authors and do not necessarily represent the views of the National Heart, Lung, and Blood Institute, the National Institutes of Health, or the US Department of Health and Human Services.

The datasets analysed in the present study are available at the dbGaP repository phs000007.v32.p13 (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000007.v30.p11).

Supplementary material

For supplementary material/s referred to in this article, please visit https://doi.org/10.1017/S0007114524000795

Footnotes

†

These authors contributed equally to this work

References

Emanuele, NV, Swade, TF & Emanuele, MA (1998) Consequences of alcohol use in diabetics. Alcohol Health Res World 22, 211–219.Google Scholar PubMed

Chait, A, Mancini, M, February, AW, et al. (1972) Clinical and metabolic study of alcoholic hyperlipidaemia. Lancet 2, 62–64.CrossRef Google Scholar PubMed

Collaborators GBDA (2018) Alcohol use and burden for 195 countries and territories, 1990–2016: a systematic analysis for the Global Burden of Disease Study 2016. Lancet 392, 1015–1035.CrossRef Google Scholar

Chikritzhs, TN, Naimi, TS, Stockwell, TR, et al. (2015) Mendelian randomisation meta-analysis sheds doubt on protective associations between ‘moderate’ alcohol consumption and coronary heart disease. Evid Based Med 20, 38.CrossRef Google Scholar

Stockwell, T, Zhao, J, Panwar, S, et al. (2016) Do ‘Moderate’ drinkers have reduced mortality risk? A systematic review and meta-analysis of alcohol consumption and all-cause mortality. J Stud Alcohol Drugs 77, 185–198.CrossRef Google Scholar

Huan, T, Esko, T, Peters, MJ, et al. (2015) A meta-analysis of gene expression signatures of blood pressure and hypertension. PLoS Genet 11, e1005035.CrossRef Google Scholar PubMed

Yao, C, Chen, BH, Joehanes, R, et al. (2015) Integromic analysis of genetic variation and gene expression identifies networks for cardiovascular disease phenotypes. Circulation 131, 536–549.CrossRef Google Scholar PubMed

Benton, MC, Lea, RA, Macartney-Coxson, D, et al. (2013) Mapping eQTLs in the Norfolk Island genetic isolate identifies candidate genes for CVD risk traits. Am J Hum Genet 93, 1087–1099.CrossRef Google Scholar PubMed

Ma, J, Huang, A, Yan, K, et al. (2023) Blood transcriptomic biomarkers of alcohol consumption and cardiovascular disease risk factors: the Framingham Heart Study. Hum Mol Genet 32, 649–658.CrossRef Google Scholar PubMed

Luo, J, Wu, M, Gopukumar, D, et al. (2016) Big Data application in biomedical research and health care: a literature review. Biomed Inform Insights 8, 1–10.CrossRef Google Scholar PubMed

Breiman, L (2001) Random forests. Machine Learning 45, 5–32.CrossRef Google Scholar

Hu, J & Szymczak, S (2023) A review on longitudinal data analysis with random forest. Brief Bioinform 24, bbad002.CrossRef Google Scholar PubMed

Degenhardt, F, Seifert, S & Szymczak, S (2019) Evaluation of variable selection methods for random forests and omics data sets. Brief Bioinform 20, 492–503.CrossRef Google Scholar PubMed

Cammarota, C & Pinto, A (2021) Variable selection and importance in presence of high collinearity: an application to the prediction of lean body mass from multi-frequency bioelectrical impedance. J Appl Stat 48, 1644–1658.CrossRef Google Scholar

Swan, AL, Mobasheri, A, Allaway, D, et al. (2013) Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. OMICS 17, 595–610.CrossRef Google Scholar PubMed

Kursa, M, Jankowski, A & Rudnicki, W (2010) Boruta – a system for feature selection. Fundam Inform 101, 271–285.CrossRef Google Scholar

Acharjee, A, Larkman, J, Xu, Y, et al. (2020) A random forest based biomarker discovery and power analysis framework for diagnostics research. BMC Med Genomics 13, 178.CrossRef Google Scholar PubMed

Liu, C, Ackerman, HH & Carulli, JP (2011) A genome-wide screen of gene–gene interactions for rheumatoid arthritis susceptibility. Hum Genet 129, 473–485.CrossRef Google Scholar PubMed

Steyerberg, EW, van der Ploeg, T & Van Calster, B (2014) Risk prediction with machine learning and regression methods. Biom J 56, 601–606.CrossRef Google Scholar PubMed

Polewko-Klim, A, Lesinski, W, Golinska, AK, et al. (2020) Sensitivity analysis based on the random forest machine learning algorithm identifies candidate genes for regulation of innate and adaptive immune response of chicken. Poult Sci 99, 6341–6354.CrossRef Google Scholar PubMed

Feinleib, M, Kannel, WB, Garrison, RJ, et al. (1975) The Framingham Offspring Study. Design and preliminary data. Prev Med 4, 518–525.CrossRef Google Scholar PubMed

Splansky, GL, Corey, D, Yang, Q, et al. (2007) The third generation cohort of the National Heart, Lung, and Blood Institute’s Framingham Heart Study: design, recruitment, and initial examination. Am J Epidemiol 165, 1328–1335.CrossRef Google Scholar PubMed

Liu, C, Marioni, RE, Hedman, AK, et al. (2018) A DNA methylation biomarker of alcohol consumption. Mol Psychiatry 23, 422–433.CrossRef Google Scholar PubMed

Joehanes, R, Ying, S, Huan, T, et al. (2013) Gene expression signatures of coronary heart disease. Arterioscler Thromb Vasc Biol 33, 1418–1426.CrossRef Google Scholar PubMed

Sun, X, Ho, JE, Gao, H, et al. (2021) Associations of alcohol consumption with cardiovascular disease-related proteomic biomarkers: the Framingham Heart Study. J Nutr 151, 2574–2582.CrossRef Google Scholar PubMed

Czuriga-Kovacs, KR, Czuriga, D, Kardos, L, et al. (2019) Reply to letter: reversibility of hypertension-induced subclinical vascular changes: do the new ACC/AHA 2017 blood pressure guidelines and heart rate changes make a difference? J Clin Hypertens (Greenwich) 21, 1243–1244.CrossRef Google Scholar

Kursa, M & Rudnicki, W (2010) Feature selection with the Boruta package. J Stat Software 36, 13.CrossRef Google Scholar

Martens, M, Ammar, A, Riutta, A, et al. (2021) WikiPathways: connecting communities. Nucleic Acids Res 49, D613–D21.CrossRef Google Scholar PubMed

Mootha, VK, Lindgren, CM, Eriksson, KF, et al. (2003) PGC-1-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 34, 267–273.CrossRef Google Scholar

Subramanian, A, Tamayo, P, Mootha, VK, et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102, 15545–15550.CrossRef Google Scholar PubMed

Thomas, PD, Ebert, D, Muruganujan, A, et al. (2022) PANTHER: making genome-scale phylogenetics accessible to all. Protein Sci 31, 8–22.CrossRef Google Scholar

Liaw, A & Wiener, M (2002) Classification and regression by randomForest. R News 2, 18–22.Google Scholar

Robin, X, Turck, N, Hainard, A, et al. (2011) pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinf 12, 77.CrossRef Google Scholar

Kursa, MB (2014) Robustness of Random Forest-based gene selection methods. BMC Bioinf 15, 8.CrossRef Google Scholar PubMed

Shen, J, Qi, L, Zou, Z, et al. (2020) Identification of a novel gene signature for the prediction of recurrence in HCC patients by machine learning of genome-wide databases. Sci Rep 10, 4435.CrossRef Google Scholar

Lin, MS, Jo, SY, Luebeck, J, et al. (2023) Transcriptional immune suppression and upregulation of double stranded DNA damage and repair repertoires in ecDNA-containing tumors. bioRxivCrossRef Google Scholar

Long, NP, Park, S, Anh, NH, et al. (2019) High-throughput omics and statistical learning integration for the discovery and validation of novel diagnostic signatures in colorectal cancer. Int J Mol Sci 20, 296.CrossRef Google Scholar PubMed

Dessie, EY, Gautam, Y, Ding, L, et al. (2023) Development and validation of asthma risk prediction models using co-expression gene modules and machine learning methods. Sci Rep 13, 11279.CrossRef Google Scholar

Yengo, L, Vedantam, S, Marouli, E, et al. (2022) A saturated map of common genetic variants associated with human height. Nature 610, 704–712.CrossRef Google Scholar PubMed

Singh, A, Shannon, CP, Gautier, B, et al. (2019) DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics 35, 3055–3062.CrossRef Google Scholar PubMed

Wekesa, JS & Kimwele, M (2023) A review of multi-omics data integration through deep learning approaches for disease diagnosis, prognosis, and treatment. Front Genet 14, 1199087.CrossRef Google Scholar PubMed

Fig. 1. Study flow chart. FDR, false discovery rate; FHS, Framingham Heart Study; MSigDB, Molecular Signatures Database.

Table 1. Participant characteristics

Table 2. Boruta algorithm-selected genes

Table 3. Cross-sectional analysis of Bruta method-selected genes with CVD risk factors

Lyu et al. supplementary material

File 815.2 KB

Article contents

Enhancing selection of alcohol consumption-associated genes by random forest

Abstract

Keywords

Methods

Study participants

Alcohol consumption

Gene expression profiling

CVD risk factors

Statistical analysis

Use Boruta algorithm for gene selection

Gene ontology analysis

Exam prediction capability of selected gene transcripts

Association analysis between the expression level of selected genes with CVD risk factors

Interaction analyses and stratification analyses

Results

Study participants

Use Boruta algorithm for gene selection

Gene ontology analysis

Exam prediction capability of selected gene transcripts

Cross-sectional association with CVD risk factors

Discussion

Acknowledgements

Supplementary material

Footnotes

References

Lyu et al. supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests