Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results

Chansik An; Yae Won Park; Sung Soo Ahn; Kyunghwa Han; Hwiyoung Kim; Seung-Koo Lee

doi:10.1371/journal.pone.0256152

Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results

PLoS ONE ◽

10.1371/journal.pone.0256152 ◽

2021 ◽

Vol 16 (8) ◽

pp. e0256152

Author(s):

Chansik An ◽

Yae Won Park ◽

Sung Soo Ahn ◽

Kyunghwa Han ◽

Hwiyoung Kim ◽

...

Keyword(s):

Machine Learning ◽

Test Performance ◽

Small Sample Size ◽

Area Under The Curve ◽

Small Sample ◽

Simple Task ◽

Test Set ◽

Validation Methods ◽

Small Sample Sizes ◽

Set Splitting

This study aims to determine how randomly splitting a dataset into training and test sets affects the estimated performance of a machine learning model and its gap from the test performance under different conditions, using real-world brain tumor radiomics data. We conducted two classification tasks of different difficulty levels with magnetic resonance imaging (MRI) radiomics features: (1) “Simple” task, glioblastomas [n = 109] vs. brain metastasis [n = 58] and (2) “difficult” task, low- [n = 163] vs. high-grade [n = 95] meningiomas. Additionally, two undersampled datasets were created by randomly sampling 50% from these datasets. We performed random training-test set splitting for each dataset repeatedly to create 1,000 different training-test set pairs. For each dataset pair, the least absolute shrinkage and selection operator model was trained and evaluated using various validation methods in the training set, and tested in the test set, using the area under the curve (AUC) as an evaluation metric. The AUCs in training and testing varied among different training-test set pairs, especially with the undersampled datasets and the difficult task. The mean (±standard deviation) AUC difference between training and testing was 0.039 (±0.032) for the simple task without undersampling and 0.092 (±0.071) for the difficult task with undersampling. In a training-test set pair with the difficult task without undersampling, for example, the AUC was high in training but much lower in testing (0.882 and 0.667, respectively); in another dataset pair with the same task, however, the AUC was low in training but much higher in testing (0.709 and 0.911, respectively). When the AUC discrepancy between training and test, or generalization gap, was large, none of the validation methods helped sufficiently reduce the generalization gap. Our results suggest that machine learning after a single random training-test set split may lead to unreliable results in radiomics studies especially with small sample sizes.

Download Full-text

Radiomics machine learning study with small sample size: single random training-test set split may result in unreliable results

10.21203/rs.3.rs-105766/v2 ◽

2020 ◽

Author(s):

Chansik An ◽

Yae Won Park ◽

Sung Soo Ahn ◽

Kyunghwa Han ◽

Hwiyoung Kim ◽

...

Keyword(s):

Machine Learning ◽

Sample Size ◽

Small Sample Size ◽

Area Under The Curve ◽

Small Sample ◽

Simple Task ◽

Test Set ◽

Magnetic Resonance Imaging Mri ◽

Selection Operator ◽

Set Splitting

Abstract Objective: This study aims to determine how randomly splitting a dataset into training and test sets affects the estimated performance of a machine learning model under different conditions, using real-world brain tumor radiomics data.Materials and Methods: We conducted two classification tasks of different difficulty levels with magnetic resonance imaging (MRI) radiomics features: (1) “Simple” task, glioblastomas [n=109] vs. brain metastasis [n=58] and (2) “difficult” task, low- [n=163] vs. high-grade [n=95] meningiomas. Additionally, two undersampled datasets were created by randomly sampling 50% from these datasets. We performed random training-test set splitting for each dataset repeatedly to create 1,000 different training and test set pairs. For each dataset pair, the least absolute shrinkage and selection operator model was trained by five-fold cross-validation (CV) or nested CV with or without repetitions in the training set and tested with the test set, using the area under the curve (AUC) as an evaluation metric.Results: The AUCs in CV and testing varied widely based on data composition, especially with the undersampled datasets and the difficult task. The mean (±standard deviation) AUC difference between CV and testing was 0.029 (±0.022) for the simple task without undersampling and 0.108 (±0.079) for the difficult task with undersampling. In a training-test set pair, the AUC was high in CV but much lower in testing (0.840 and 0.650, respectively); in another dataset pair with the same task, however, the AUC was low in CV but much higher in testing (0.702 and 0.836, respectively). None of the CV methods helped overcome this issue.Conclusions: Machine learning after a single random training-test set split may lead to unreliable results in radiomics studies, especially when the sample size is small.

Download Full-text

Radiomics machine learning study with a small sample size: single random training-test set split may result in unreliable results

10.21203/rs.3.rs-105766/v1 ◽

2020 ◽

Author(s):

Chansik An ◽

Yae Won Park ◽

Sung Soo Ahn ◽

Kyunghwa Han ◽

Hwiyoung Kim ◽

...

Keyword(s):

Machine Learning ◽

Standard Deviation ◽

Sample Size ◽

Model Performance ◽

Small Sample ◽

Operating Characteristics ◽

Simple Task ◽

Test Set ◽

Relative Standard ◽

Test Sets

Abstract Objective: To determine how the estimated performance of a machine learning model varies according to how a dataset is split into training and test sets using brain tumor radiomics data, under different conditions.Materials and Methods: Two binary tasks with different levels of difficulty ('simple’ task, glioblastoma [GBM, n=109] vs. brain metastasis [n=58]; 'difficult’ task, low- [n=163] vs. high grade [n=95] meningiomas) were performed using radiomics features from magnetic resonance imaging (MRI). For each trial of the 1,000 different training-test set splits with a ratio of 7:3, a least absolute shrinkage and selection operator (LASSO) model was trained by 5-fold cross-validation (CV) in the training set and tested in the test set. The model stability and performance was evaluated according to the number of input features (from 1 to 50), the sample size (full vs. undersampled), and the level of difficulty. In addition to 5-fold CV without a repetition, three other CV methods were compared: 5-fold CV with 100 repetitions, nested CV, and nested CV with 100 repetitions.Results: The highest mean cross-validated area under the receiver operating characteristics curve (AUC) and the higher stability (lower AUC differences between training and testing) was achieved with 6 and 13 features from the GBM and meningioma task, respectively. For the simple task, simple task with undersampling, difficult task, and difficult task with undersampling, average mean AUCs were 0.947, 0.923, 0.795, and 0.764, and average AUC differences between training and testing were 0.029, 0.054, 0.053, and 0.108, respectively. Among four CV models, the most conservative method (i.e., lowest AUC and highest relative standard deviation [RSD]) was nested CV with 100 repetitions.Conclusions: A single random split of a dataset into training and test sets may lead to an unreliable report of model performance in radiomics machine learning studies, and reporting the mean and standard deviation of model performance metrics by performing nested and/or repeated CV on the entire dataset is suggested.

Download Full-text

G-computation and machine learning for estimating the causal effects of binary exposure statuses on binary outcomes

Scientific Reports ◽

10.1038/s41598-021-81110-0 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Florent Le Borgne ◽

Arthur Chatton ◽

Maxime Léger ◽

Rémi Lenain ◽

Yohann Foucher

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Statistical Power ◽

Small Sample ◽

Causal Effects ◽

Small Samples ◽

Support Vector ◽

Sample Sizes ◽

Super Learner ◽

Small Sample Sizes

AbstractIn clinical research, there is a growing interest in the use of propensity score-based methods to estimate causal effects. G-computation is an alternative because of its high statistical power. Machine learning is also increasingly used because of its possible robustness to model misspecification. In this paper, we aimed to propose an approach that combines machine learning and G-computation when both the outcome and the exposure status are binary and is able to deal with small samples. We evaluated the performances of several methods, including penalized logistic regressions, a neural network, a support vector machine, boosted classification and regression trees, and a super learner through simulations. We proposed six different scenarios characterised by various sample sizes, numbers of covariates and relationships between covariates, exposure statuses, and outcomes. We have also illustrated the application of these methods, in which they were used to estimate the efficacy of barbiturates prescribed during the first 24 h of an episode of intracranial hypertension. In the context of GC, for estimating the individual outcome probabilities in two counterfactual worlds, we reported that the super learner tended to outperform the other approaches in terms of both bias and variance, especially for small sample sizes. The support vector machine performed well, but its mean bias was slightly higher than that of the super learner. In the investigated scenarios, G-computation associated with the super learner was a performant method for drawing causal inferences, even from small sample sizes.

Download Full-text

A Machine Learning Approach to Reveal the NeuroPhenotypes of Autisms

International Journal of Neural Systems ◽

10.1142/s0129065718500582 ◽

2019 ◽

Vol 29 (07) ◽

pp. 1850058 ◽

Cited By ~ 8

Author(s):

Juan M. Górriz ◽

Javier Ramírez ◽

F. Segovia ◽

Francisco J. Martínez ◽

Meng-Chuan Lai ◽

...

Keyword(s):

Machine Learning ◽

Brain Structure ◽

Feature Space ◽

Classification Problem ◽

Small Sample ◽

Biological Sex ◽

Machine Learning Approach ◽

Learning Machine ◽

Small Sample Sizes ◽

Low Dimensional

Although much research has been undertaken, the spatial patterns, developmental course, and sexual dimorphism of brain structure associated with autism remains enigmatic. One of the difficulties in investigating differences between the sexes in autism is the small sample sizes of available imaging datasets with mixed sex. Thus, the majority of the investigations have involved male samples, with females somewhat overlooked. This paper deploys machine learning on partial least squares feature extraction to reveal differences in regional brain structure between individuals with autism and typically developing participants. A four-class classification problem (sex and condition) is specified, with theoretical restrictions based on the evaluation of a novel upper bound in the resubstitution estimate. These conditions were imposed on the classifier complexity and feature space dimension to assure generalizable results from the training set to test samples. Accuracies above [Formula: see text] on gray and white matter tissues estimated from voxel-based morphometry (VBM) features are obtained in a sample of equal-sized high-functioning male and female adults with and without autism ([Formula: see text], [Formula: see text]/group). The proposed learning machine revealed how autism is modulated by biological sex using a low-dimensional feature space extracted from VBM. In addition, a spatial overlap analysis on reference maps partially corroborated predictions of the “extreme male brain” theory of autism, in sexual dimorphic areas.

Download Full-text

Machine Learning Model Validation for Early Stage Studies with Small Sample Sizes

10.1109/embc46164.2021.9629697 ◽

2021 ◽

Author(s):

Robyn Larracy ◽

Angkoon Phinyomark ◽

Erik Scheme

Keyword(s):

Machine Learning ◽

Model Validation ◽

Early Stage ◽

Learning Model ◽

Small Sample ◽

Sample Sizes ◽

Machine Learning Model ◽

Small Sample Sizes

Download Full-text

MLW-gcForest: A Multi-Weighted gcForest Model for Cancer Subtype Classification by Methylation Data

Applied Sciences ◽

10.3390/app9173589 ◽

2019 ◽

Vol 9 (17) ◽

pp. 3589 ◽

Cited By ~ 2

Author(s):

Yunyun Dong ◽

Wenkai Yang ◽

Jiawen Wang ◽

Juanjuan Zhao ◽

Yan Qiang

Keyword(s):

Machine Learning ◽

Small Sample Size ◽

Small Sample ◽

The Cancer Genome Atlas ◽

High Dimensionality ◽

Methylation Data ◽

Learning Methods ◽

Cancer Subtypes ◽

Machine Learning Methods

Effective cancer treatment requires a clear subtype. Due to the small sample size, high dimensionality, and class imbalances of cancer gene data, classifying cancer subtypes by traditional machine learning methods remains challenging. The gcForest algorithm is a combination of machine learning methods and a deep neural network and has been indicated to achieve better classification of small samples of data. However, the gcForest algorithm still faces many challenges when this method is applied to the classification of cancer subtypes. In this paper, we propose an improved gcForest algorithm (MLW-gcForest) to study the applicability of this method to the small sample sizes, high dimensionality, and class imbalances of genetic data. The main contributions of this algorithm are as follows: (1) Different weights are assigned to different random forests according to the classification ability of the forests. (2) We propose a sorting optimization algorithm that assigns different weights to the feature vectors generated under different sliding windows. The MLW-gcForest model is trained on the methylation data of five data sets from the cancer genome atlas (TCGA). The experimental results show that the MLW-gcForest algorithm achieves high accuracy and area under curve (AUC) values for the classification of cancer subtypes compared with those of traditional machine learning methods and state of the art methods. The results also show that methylation data can be effectively used to diagnose cancer.

Download Full-text

Behavioral and cortisol responses to feeding frequency in pregnant sows under isocaloric intake

Journal of Animal Science ◽

10.1093/jas/skaa226 ◽

2020 ◽

Vol 98 (8) ◽

Author(s):

Hayford Manu ◽

Suhyup Lee ◽

Mike C Keyes ◽

Jim Cairns ◽

Samuel K Baidoo

Keyword(s):

Small Sample Size ◽

Feeding Activity ◽

Area Under The Curve ◽

Total Activity ◽

Small Sample ◽

Experimental Unit ◽

Feeding Frequency ◽

Daily Feeding ◽

Feeding Activities ◽

Cortisol Responses

Abstract The study focused on behavioral and cortisol responses to feeding frequency in pregnant sows under isocaloric intake. Twenty-four sows [(Landrace × Yorkshire); BW 216.70 ± 3.98 kg; parity 3.04 ± 0.53] were balanced for parity and randomly assigned to 1 of 3 feeding frequency regimes. Sows were fed corn–soybean meal-based diet 1× [0730 (Control), T1], 2× [half ration at 0730 and 1530 hours, T2], or 3× [one-third portion at 0730, 1130, and 1530 hours, T3] from days 30 to 60 of gestation. Sows received 7055 kcal ME/d during gestation from 2.21 kg of diet formulated to contain SID Lys/ME of 1.71 g/Mcal. Saliva samples were collected every 2 hr from 0630 to 1830 hours on day 52 and assayed for cortisol using ELISA procedure. Behavior data were collected for 7 d from day 53 of gestation by affixing a remote insights ear tag to each sow. Each sow had 120,960 data points categorized into: “Active”, “Feed,” or “Dormant”. Because of housing constraint, all sows were housed in individual stalls in the same room presenting a potential limitation of the study. The data were analyzed using PROC MIXED and GLIMMIX procedures of SAS 9.4 for cortisol and behavior count data, respectively. Sow was the experimental unit. The area under the curve (AUC) is quantitative evaluation of response as threshold varies over all possible values. The T2 sows had reduced 12-hr cortisol AUC compared with control sows (P = 0.024) and T3 sows (P = 0.004), respectively. The T2 sows had lower 3 hr (P = 0.039) and 5 hr (P = 0.015) postfeeding cortisol AUC compared with control sows. Feed anticipatory activity (FAA), 24-hr total activity, and feeding activities (eating and/or sham chewing) were reduced for T2 sows relative to the control and T3 sows (P < 0.01). Consequently, T2 sows had lower 24-hr total activity (P < 0.001) and feeding activities (P < 0.001) AUC compared with both the control and T3 sows, respectively. The T3 sows had greater FAA (P < 0.001) and 24-hr total activity AUC (P = 0.010) compared with control sows. Our data although inconclusive due to small sample size, twice daily feeding appears to be the threshold that reduces sows’ total activity AUC, feeding activity AUC, and activation of hypothalamic–pituitary–adrenal axis, reduced hunger, and exhibit potential to improve sow welfare in relation to once and thrice daily feeding regimes under isocaloric intake per kilogram live metabolic weight.

Download Full-text

Population Pharmacokinetic-Pharmacodynamic Analysis of Anidulafungin in Adult Patients with Fungal Infections

Antimicrobial Agents and Chemotherapy ◽

10.1128/aac.01473-12 ◽

2012 ◽

Vol 57 (1) ◽

pp. 466-474 ◽

Cited By ~ 22

Author(s):

Ping Liu

Keyword(s):

Fungal Infections ◽

Small Sample Size ◽

Safety Data ◽

Area Under The Curve ◽

Positive Association ◽

Disease Status ◽

Small Sample ◽

Adult Patients ◽

Population Pharmacokinetic ◽

Concentration Data

ABSTRACTTo evaluate the exposure-response relationships for efficacy and safety of intravenous anidulafungin in adult patients with fungal infections, a population pharmacokinetic-pharmacodynamic (PK-PD) analysis was performed with data from 262 patients in four phase 2/3 studies. The plasma concentration data were fitted with a previously developed population PK model. Anidulafungin exposures in patients with weight extremities (e.g., 40 kg and 150 kg) were simulated based on the final PK model. Since the patient population, disease status, and efficacy endpoints varied in these studies, the exposure-efficacy relationship was investigated separately for each study using logistic regression as appropriate. Safety data from three studies (n= 235) were pooled for analysis, and one study was excluded due to concomitant use of amphotericin B as a study treatment and different disease populations. The analysis showed that the same dosing regimen of anidulafungin can be administered to all patients regardless of body weight. Nonetheless, caution should be taken for patients with extremely high weight (e.g., >150 kg). There was a trend of positive association between anidulafungin exposure and efficacy in patients with esophageal candidiasis or invasive candidiasis, including candidemia (ICC); however, adequate characterization of the effect of anidulafungin exposure on response could not be established due to the relatively small sample size. No threshold value for exposure could be established, since patients with low exposure also achieved successful outcomes (e.g., area under the curve < 40 mg · h/liter in ICC patients). There was no association between anidulafungin exposure and the treatment-related adverse events or all-causality hepatic laboratory abnormalities.

Download Full-text

An investigation on the factors affecting machine learning classifications in gamma-ray astronomy

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa166 ◽

2020 ◽

Vol 492 (4) ◽

pp. 5377-5390 ◽

Cited By ~ 3

Author(s):

Shengda Luo ◽

Alex P Leung ◽

C Y Hui ◽

K L Li

Keyword(s):

Machine Learning ◽

Gamma Ray ◽

Small Sample Size ◽

Classification Performance ◽

Small Sample ◽

Classification Model ◽

Machine Learning Techniques ◽

Large Area ◽

Actual Performance ◽

Statistical Fluctuations

ABSTRACT We have investigated a number of factors that can have significant impacts on the classification performance of gamma-ray sources detected by Fermi Large Area Telescope (LAT) with machine learning techniques. We show that a framework of automatic feature selection can construct a simple model with a small set of features that yields better performance over previous results. Secondly, because of the small sample size of the training/test sets of certain classes in gamma-ray, nested re-sampling and cross-validations are suggested for quantifying the statistical fluctuations of the quoted accuracy. We have also constructed a test set by cross-matching the identified active galactic nuclei (AGNs) and the pulsars (PSRs) in the Fermi-LAT 8-yr point source catalogue (4FGL) with those unidentified sources in the previous 3rd Fermi-LAT Source Catalog (3FGL). Using this cross-matched set, we show that some features used for building classification model with the identified source can suffer from the problem of covariate shift, which can be a result of various observational effects. This can possibly hamper the actual performance when one applies such model in classifying unidentified sources. Using our framework, both AGN/PSR and young pulsar (YNG)/millisecond pulsar (MSP) classifiers are automatically updated with the new features and the enlarged training samples in 4FGL catalogue incorporated. Using a two-layer model with these updated classifiers, we have selected 20 promising MSP candidates with confidence scores $\gt 98{{\ \rm per\ cent}}$ from the unidentified sources in 4FGL catalogue that can provide inputs for a multiwavelength identification campaign.

Download Full-text

Applying AdaBoost to Improve Diagnostic Accuracy

Methodology ◽

10.1027/1614-2241/a000166 ◽

2019 ◽

Vol 15 (2) ◽

pp. 77-87 ◽

Cited By ~ 1

Author(s):

Zhehan Jiang ◽

Kevin Walker ◽

Dexin Shi

Keyword(s):

Machine Learning ◽

Diagnostic Accuracy ◽

Latent Variables ◽

Latent Variable ◽

Area Under The Curve ◽

Diagnostic Information ◽

Small Sample ◽

Simulation Studies ◽

Cognitive Diagnostic Modeling ◽

Diagnostic Modeling

Abstract. Cognitive diagnostic modeling has been adopted to support various diagnostic measuring processes. Specifically, this approach allows practitioners and/or researchers to investigate an individual’s status with regard to certain latent variables of interest. However, the diagnostic information provided by traditional estimation approaches often suffers from low accuracy, especially under small sample conditions. This paper adopts an AdaBoost technique, popular in the field of machine learning, to estimate latent variables. Further, the proposed approach involves the construction of a simple iterative algorithm that is based upon the AdaBoost technique – such that the area under the curve (AUC) is minimized. The algorithmic details are elaborated via pseudo codes with line-to-line verbal explanations. Simulation studies were conducted such that the improvement of latent variable estimates via the proposed approach can be examined.

Download Full-text