Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models

Farideh Jalali-najafabadi; Michael Stadler; Nick Dand; Deepak Jadon; Mehreen Soomro; Pauline Ho; Helen Marzo-Ortega; Philip Helliwell; Eleanor Korendowych; Michael A. Simpson; Jonathan Packham; Catherine H. Smith; Jonathan N. Barker; Neil McHugh; Richard B. Warren; Anne Barton; John Bowes; Catherine H. Smith; Jonathan N. Barker; Richard B. Warren; Nick Dand; Catherine H. Smith;  ;

doi:10.1038/s41598-021-00854-x

Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models

Scientific Reports ◽

10.1038/s41598-021-00854-x ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Farideh Jalali-najafabadi ◽

Michael Stadler ◽

Nick Dand ◽

Deepak Jadon ◽

Mehreen Soomro ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Risk Prediction ◽

Cross Validation ◽

Prediction Models ◽

Risk Prediction Models ◽

Information Theoretic ◽

Machine Learning Methods ◽

Stratification Method ◽

Information Theoretic Criteria

AbstractIn view of the growth of clinical risk prediction models using genetic data, there is an increasing need for studies that use appropriate methods to select the optimum number of features from a large number of genetic variants with a high degree of redundancy between features due to linkage disequilibrium (LD). Filter feature selection methods based on information theoretic criteria, are well suited to this challenge and will identify a subset of the original variables that should result in more accurate prediction. However, data collected from cohort studies are often high-dimensional genetic data with potential confounders presenting challenges to feature selection and risk prediction machine learning models. Patients with psoriasis are at high risk of developing a chronic arthritis known as psoriatic arthritis (PsA). The prevalence of PsA in this patient group can be up to 30% and the identification of high risk patients represents an important clinical research which would allow early intervention and a reduction of disability. This also provides us with an ideal scenario for the development of clinical risk prediction models and an opportunity to explore the application of information theoretic criteria methods. In this study, we developed the feature selection and psoriatic arthritis (PsA) risk prediction models that were applied to a cross-sectional genetic dataset of 1462 PsA cases and 1132 cutaneous-only psoriasis (PsC) cases using 2-digit HLA alleles imputed using the SNP2HLA algorithm. We also developed stratification method to mitigate the impact of potential confounder features and illustrate that confounding features impact the feature selection. The mitigated dataset was used in training of seven supervised algorithms. 80% of data was randomly used for training of seven supervised machine learning methods using stratified nested cross validation and 20% was selected randomly as a holdout set for internal validation. The risk prediction models were then further validated in UK Biobank dataset containing data on 1187 participants and a set of features overlapping with the training dataset.Performance of these methods has been evaluated using the area under the curve (AUC), accuracy, precision, recall, F1 score and decision curve analysis(net benefit). The best model is selected based on three criteria: the ‘lowest number of feature subset’ with the ‘maximal average AUC over the nested cross validation’ and good generalisability to the UK Biobank dataset. In the original dataset, with over 100 different bootstraps and seven feature selection (FS) methods, HLA_C_*06 was selected as the most informative genetic variant. When the dataset is mitigated the single most important genetic features based on rank was identified as HLA_B_*27 by the seven different feature selection methods, consistent with previous analyses of this data using regression based methods. However, the predictive accuracy of these single features in post mitigation was found to be moderate (AUC= 0.54 (internal cross validation), AUC=0.53 (internal hold out set), AUC=0.55(external data set)). Sequentially adding additional HLA features based on rank improved the performance of the Random Forest classification model where 20 2-digit features selected by Interaction Capping (ICAP) demonstrated (AUC= 0.61 (internal cross validation), AUC=0.57 (internal hold out set), AUC=0.58 (external dataset)). The stratification method for mitigation of confounding features and filter information theoretic feature selection can be applied to a high dimensional dataset with the potential confounders.

Download Full-text

Performance Metrics for the Comparative Analysis of Clinical Risk Prediction Models Employing Machine Learning

Circulation Cardiovascular Quality and Outcomes ◽

10.1161/circoutcomes.120.007526 ◽

2021 ◽

Author(s):

Chenxi Huang ◽

Shu-Xia Li ◽

César Caraballo ◽

Frederick A. Masoudi ◽

John S. Rumsfeld ◽

...

Keyword(s):

Machine Learning ◽

Risk Prediction ◽

Health Care Professionals ◽

Clinical Decision Making ◽

Performance Metrics ◽

Prediction Models ◽

Learning Models ◽

Risk Prediction Models ◽

Clinical Risk ◽

Machine Learning Models

Background: New methods such as machine learning techniques have been increasingly used to enhance the performance of risk predictions for clinical decision-making. However, commonly reported performance metrics may not be sufficient to capture the advantages of these newly proposed models for their adoption by health care professionals to improve care. Machine learning models often improve risk estimation for certain subpopulations that may be missed by these metrics. Methods and Results: This article addresses the limitations of commonly reported metrics for performance comparison and proposes additional metrics. Our discussions cover metrics related to overall performance, discrimination, calibration, resolution, reclassification, and model implementation. Models for predicting acute kidney injury after percutaneous coronary intervention are used to illustrate the use of these metrics. Conclusions: We demonstrate that commonly reported metrics may not have sufficient sensitivity to identify improvement of machine learning models and propose the use of a comprehensive list of performance metrics for reporting and comparing clinical risk prediction models.

Download Full-text

Building Risk Prediction Models for Type 2 Diabetes Using Machine Learning Techniques

Preventing Chronic Disease ◽

10.5888/pcd16.190109 ◽

2019 ◽

Vol 16 ◽

Cited By ~ 6

Author(s):

Zidian Xie ◽

Olga Nikolayeva ◽

Jiebo Luo ◽

Dongmei Li

Keyword(s):

Machine Learning ◽

Type 2 Diabetes ◽

Risk Prediction ◽

Prediction Models ◽

Machine Learning Techniques ◽

Risk Prediction Models ◽

Learning Techniques

Download Full-text

Abstract P280: Revisiting CVD Risk Prediction Using Machine Learning Approaches: A Case Study

Circulation ◽

10.1161/circ.141.suppl_1.p280 ◽

2020 ◽

Vol 141 (Suppl_1) ◽

Cited By ~ 1

Author(s):

Hesam Dashti ◽

Yanyan Liu ◽

Robert J Glynn ◽

Paul M Ridker ◽

Samia Mora ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Risk Prediction ◽

Prediction Models ◽

Baseline Risk ◽

Case Group ◽

Ann Model ◽

Biomedical Image Processing ◽

Learning Methods ◽

Machine Learning Methods

Introduction: Applications of machine learning (ML) methods have been demonstrated by the recent FDA approval of new ML-based biomedical image processing methods. In this study, we examine applications of ML, specifically artificial neural networks (ANN), for predicting risk of cardiovascular (CV) events. Hypothesis: We hypothesized that using the same CV risk factors, ML-based CV prediction models can improve the performance of current predictive models. Methods: Justification for the Use of Statins in Prevention: An Intervention Trial Evaluating Rosuvastatin (JUPITER; NCT00239681) is a multi-ethnic trial that randomized non-diabetic participants with LDL-C<130 mg/dL and hsCRP≥2 mg/L to rosuvastatin versus placebo. We restricted the analysis to white and black participants allocated to the placebo arm, and estimated the race- and sex-specific Pooled Cohorts Equations (PCE) 5-year risk score using race, sex, age, HDL-C, total cholesterol, systolic BP, antihypertensive medications, and smoking. A total of 218 incident CV cases occurred (maximum follow-up 5 years). For every participant in the case group, we randomly selected 4 controls from the placebo arm after stratifying for the baseline risk factors (Table 1). The risk factors from a total of n=1,090 participants were used to train and test the ANN model. We used 80% of the participants (n=872) for designing the network and left out 20% of the data (n=218) for testing the predictive model. We used the TensorFlow software to design, train, and evaluate the ANN model. Results: We compared the performances of the ANN and the PCE score on the 218 test subjects (Figure 1). The high AUC of the neural network (0.85; 95% CI 0.78-0.91) on this dataset suggests advantages of machine learning methods compared to the current methods. Conclusions: This result demonstrates the potential of machine learning methods for enhancing and improving the current techniques used in cardiovascular risk prediction and should be evaluated in other cohorts.

Download Full-text

Risk prediction for malignant intraductal papillary mucinous neoplasm of the pancreas: logistic regression versus machine learning

Scientific Reports ◽

10.1038/s41598-020-76974-7 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Jae Seung Kang ◽

Chanhee Lee ◽

Wookyeong Song ◽

Wonho Choo ◽

Seungyeoun Lee ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Risk Prediction ◽

Prediction Models ◽

Model Development ◽

Test Set ◽

Risk Prediction Models ◽

Mucinous Neoplasms ◽

Tumour Location ◽

Tenfold Cross Validation

AbstractMost models for predicting malignant pancreatic intraductal papillary mucinous neoplasms were developed based on logistic regression (LR) analysis. Our study aimed to develop risk prediction models using machine learning (ML) and LR techniques and compare their performances. This was a multinational, multi-institutional, retrospective study. Clinical variables including age, sex, main duct diameter, cyst size, mural nodule, and tumour location were factors considered for model development (MD). After the division into a MD set and a test set (2:1), the best ML and LR models were developed by training with the MD set using a tenfold cross validation. The test area under the receiver operating curves (AUCs) of the two models were calculated using an independent test set. A total of 3,708 patients were included. The stacked ensemble algorithm in the ML model and variable combinations containing all variables in the LR model were the most chosen during 200 repetitions. After 200 repetitions, the mean AUCs of the ML and LR models were comparable (0.725 vs. 0.725). The performances of the ML and LR models were comparable. The LR model was more practical than ML counterpart, because of its convenience in clinical use and simple interpretability.

Download Full-text

Predicting Acute Kidney Injury after Cardiac Surgery by Machine Learning Approaches

Journal of Clinical Medicine ◽

10.3390/jcm9061767 ◽

2020 ◽

Vol 9 (6) ◽

pp. 1767 ◽

Cited By ~ 1

Author(s):

Charat Thongprayoon ◽

Panupong Hansrivijit ◽

Tarun Bathini ◽

Saraschandra Vallabhajosyula ◽

Poemlarp Mekraksakit ◽

...

Keyword(s):

Machine Learning ◽

Cardiac Surgery ◽

Risk Prediction ◽

Prediction Models ◽

Clinical Medicine ◽

Kidney Injury ◽

Risk Scores ◽

Learning Approaches ◽

Risk Prediction Models ◽

General Utility

Cardiac surgery-associated AKI (CSA-AKI) is common after cardiac surgery and has an adverse impact on short- and long-term mortality. Early identification of patients at high risk of CSA-AKI by applying risk prediction models allows clinicians to closely monitor these patients and initiate effective preventive and therapeutic approaches to lessen the incidence of AKI. Several risk prediction models and risk assessment scores have been developed for CSA-AKI. However, the definition of AKI and the variables utilized in these risk scores differ, making general utility complex. Recently, the utility of artificial intelligence coupled with machine learning, has generated much interest and many studies in clinical medicine, including CSA-AKI. In this article, we discussed the evolution of models established by machine learning approaches to predict CSA-AKI.

Download Full-text

Building Risk Prediction Models for Daily Use of Marijuana Using Machine Learning Techniques

Drug and Alcohol Dependence ◽

10.1016/j.drugalcdep.2021.108789 ◽

2021 ◽

pp. 108789

Author(s):

Tarang Parekh ◽

Farhan Fahim

Keyword(s):

Machine Learning ◽

Risk Prediction ◽

Prediction Models ◽

Machine Learning Techniques ◽

Risk Prediction Models ◽

Learning Techniques

Download Full-text

A Framework for Using Real-World Data and Health Outcomes Modeling to Evaluate Machine Learning–Based Risk Prediction Models

Value in Health ◽

10.1016/j.jval.2021.11.1360 ◽

2021 ◽

Author(s):

Patricia J. Rodriguez ◽

David L. Veenstra ◽

Patrick J. Heagerty ◽

Christopher H. Goss ◽

Kathleen J. Ramos ◽

...

Keyword(s):

Machine Learning ◽

Health Outcomes ◽

Risk Prediction ◽

Real World ◽

Prediction Models ◽

Real World Data ◽

Risk Prediction Models ◽

World Data

Download Full-text

Development of Nonlaboratory-Based Risk Prediction Models for Cardiovascular Diseases Using Conventional and Machine Learning Approaches

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph182312586 ◽

2021 ◽

Vol 18 (23) ◽

pp. 12586

Author(s):

Mirza Rizwan Sajid ◽

Bader A. Almehmadi ◽

Waqas Sami ◽

Mansour K. Alzahrani ◽

Noryanti Muhammad ◽

...

Keyword(s):

Machine Learning ◽

Cardiovascular Diseases ◽

Risk Prediction ◽

Prediction Models ◽

Good Choice ◽

Support Vector ◽

Learning Approaches ◽

Middle Income ◽

Risk Prediction Models ◽

Low Middle Income Countries

Criticism of the implementation of existing risk prediction models (RPMs) for cardiovascular diseases (CVDs) in new populations motivates researchers to develop regional models. The predominant usage of laboratory features in these RPMs is also causing reproducibility issues in low–middle-income countries (LMICs). Further, conventional logistic regression analysis (LRA) does not consider non-linear associations and interaction terms in developing these RPMs, which might oversimplify the phenomenon. This study aims to develop alternative machine learning (ML)-based RPMs that may perform better at predicting CVD status using nonlaboratory features in comparison to conventional RPMs. The data was based on a case–control study conducted at the Punjab Institute of Cardiology, Pakistan. Data from 460 subjects, aged between 30 and 76 years, with (1:1) gender-based matching, was collected. We tested various ML models to identify the best model/models considering LRA as a baseline RPM. An artificial neural network and a linear support vector machine outperformed the conventional RPM in the majority of performance matrices. The predictive accuracies of the best performed ML-based RPMs were between 80.86 and 81.09% and were found to be higher than 79.56% for the baseline RPM. The discriminating capabilities of the ML-based RPMs were also comparable to baseline RPMs. Further, ML-based RPMs identified substantially different orders of features as compared to baseline RPM. This study concludes that nonlaboratory feature-based RPMs can be a good choice for early risk assessment of CVDs in LMICs. ML-based RPMs can identify better order of features as compared to the conventional approach, which subsequently provided models with improved prognostic capabilities.

Download Full-text

Abstract 15663: Prediction of 5-year Cardiovascular and All-cause Mortality After Myocardial Infarction in United States Veterans

Circulation ◽

10.1161/circ.142.suppl_3.15663 ◽

2020 ◽

Vol 142 (Suppl_3) ◽

Author(s):

Bing Lu ◽

Daniel Posner ◽

Jason L Vassy ◽

Yuk-lam Ho ◽

Ashley Galloway ◽

...

Keyword(s):

Risk Prediction ◽

Cross Validation ◽

Prediction Models ◽

Geographic Region ◽

Health Administration ◽

Electronic Health Record Data ◽

Risk Prediction Models ◽

Risk Of Death ◽

All Cause Mortality

Objective: This is a large prospective study aimed to develop risk prediction models of CVD and all-cause mortality in patients who survived MI. Methods: Using 2002-2012 national electronic health record data from the Veterans Health Administration, sex-specific risk prediction models for CVD and all-cause death were developed from the 5-year follow-up data of 100,601 first MI survivors aged >30 years. Model performance was evaluated using a 5-fold cross-validation approach. Results: We followed 98,657 male and 1,944 female MI survivors up to 5 years (407,199 person-years). There were 31,622 deaths (men 31,147, women 475) and 12,901 CVD deaths (men 12,752, women 149) observed during follow up. Among men, greater age, current smoking, diabetes, atrial fibrillation, heart failure, peripheral artery disease, geographic region, and lower BMI (<20kg/m 2 ) were associated with increased risk of subsequent CVD and all-cause-mortality, while statin treatment, hypertension medication, beta-blocker, eGFR level, and high BMI (≥25 kg/m 2 ) were significantly associated with reduced risk of CVD and all-cause-mortality. Similar associations were generally observed among women. We observed U-shaped relations between total cholesterol and outcomes, and HDL cholesterol and outcomes. The prediction models demonstrated good discrimination and calibration. The estimated Harrell’s C-statistics of the final models versus the cross-validation estimates were similar, ranging from 0.75 to 0.81. The predicted risk of death was well-calibrated compared to observed risk. Conclusions: We developed and validated risk prediction models of 5-year risk for CVD and all-cause death for patients following MI. Traditional risk factors, co-morbidity, lack of blood pressure or lipid treatment, and geographic region were all associated with greater risk of CVD and all-cause mortality.

Download Full-text

Assessment of performance of the machine learning-based breast cancer risk prediction model: a systematic review and meta-analysis (Preprint)

10.2196/preprints.35750 ◽

2021 ◽

Author(s):

Ying Gao ◽

Shu Li ◽

Yujing Jin ◽

Lengxiao Zhou ◽

Shaomei Sun ◽

...

Keyword(s):

Breast Cancer ◽

Neural Network ◽

Machine Learning ◽

Breast Cancer Risk ◽

Cancer Risk ◽

Prediction Model ◽

Risk Prediction ◽

Prediction Models ◽

Risk Prediction Model ◽

Risk Prediction Models

BACKGROUND Background: Machine learning algorithms well-suited in cancer research, especially in breast cancer for the investigation and development of riTo assess the performance of available machine learning-based breast cancer risk prediction model. OBJECTIVE Objective: To assess the performance of available machine learning-based breast cancer risk prediction model. METHODS Methods: As of June 9, 2021, articles on breast cancer risk prediction models by machine learning were searched in PubMed, Embase, and Web of Science. Studies describing the development or validation of risk prediction models for predicting future breast cancer risk were included. Pooled area under the curve (AUC) were calculated using the DerSimonian and Laird random-effects model. RESULTS Result: A total of 8 studies with 10 datasets were included. Neural network was the most common machine learning method for the development of risk prediction models. The pooled AUC of machine learning-based optimal risk prediction model reported in each study was 0.73 (95%CI: 0.66-0.80), which was higher than that of traditional risk factor-based risk prediction models (all Pheterogeneity < 0.001). The pooled AUC of neural network-based risk prediction model was higher than that of non-neural network-based optimal risk prediction model (0.71 vs. 0.68). Subgroup analysis showed that incorporation of imaging features risk models had a higher pooled AUC than model of non-incorporation of imaging features (0.73 vs. 0.61; Pheterogeneity =0.001). CONCLUSIONS Conclusions: The pooled machine learning-based breast cancer risk prediction model yield a good prediction performance and promising results.

Download Full-text