Granular Classification for Imbalanced Datasets: A Minkowski Distance-Based Method

Chen Fu; Jianhua Yang

doi:10.3390/a14020054

Granular Classification for Imbalanced Datasets: A Minkowski Distance-Based Method

Algorithms ◽

10.3390/a14020054 ◽

2021 ◽

Vol 14 (2) ◽

pp. 54

Author(s):

Chen Fu ◽

Jianhua Yang

Keyword(s):

Imbalanced Data ◽

Main Idea ◽

Fuzzy Rule ◽

Classification Performance ◽

Distance Measures ◽

Minkowski Distance ◽

Imbalanced Datasets ◽

Minority Class ◽

Information Granules ◽

Practical Applications

The problem of classification for imbalanced datasets is frequently encountered in practical applications. The data to be classified in this problem are skewed, i.e., the samples of one class (the minority class) are much less than those of other classes (the majority class). When dealing with imbalanced datasets, most classifiers encounter a common limitation, that is, they often obtain better classification performances on the majority classes than those on the minority class. To alleviate the limitation, in this study, a fuzzy rule-based modeling approach using information granules is proposed. Information granules, as some entities derived and abstracted from data, can be used to describe and capture the characteristics (distribution and structure) of data from both majority and minority classes. Since the geometric characteristics of information granules depend on the distance measures used in the granulation process, the main idea of this study is to construct information granules on each class of imbalanced data using Minkowski distance measures and then to establish the classification models by using “If-Then” rules. The experimental results involving synthetic and publicly available datasets reflect that the proposed Minkowski distance-based method can produce information granules with a series of geometric shapes and construct granular models with satisfying classification performance for imbalanced datasets.

Download Full-text

A two-stage clustering-based cold-start method for active learning

Intelligent Data Analysis ◽

10.3233/ida-205393 ◽

2021 ◽

Vol 25 (5) ◽

pp. 1169-1185

Author(s):

Deniu He ◽

Hong Yu ◽

Guoyin Wang ◽

Jie Li

Keyword(s):

Active Learning ◽

Class Imbalance ◽

Imbalanced Data ◽

Cold Start ◽

Classification Performance ◽

The Novel ◽

Two Stage ◽

Minority Class ◽

Novel Method ◽

Multiple Clusters

The problem of initialization of active learning is considered in this paper. Especially, this paper studies the problem in an imbalanced data scenario, which is called as class-imbalance active learning cold-start. The novel method is two-stage clustering-based active learning cold-start (ALCS). In the first stage, to separate the instances of minority class from that of majority class, a multi-center clustering is constructed based on a new inter-cluster tightness measure, thus the data is grouped into multiple clusters. Then, in the second stage, the initial training instances are selected from each cluster based on an adaptive candidate representative instances determination mechanism and a clusters-cyclic instance query mechanism. The comprehensive experiments demonstrate the effectiveness of the proposed method from the aspects of class coverage, classification performance, and impact on active learning.

Download Full-text

Evolutionary Undersampling for Classification with Imbalanced Datasets: Proposals and Taxonomy

Evolutionary Computation ◽

10.1162/evco.2009.17.3.275 ◽

2009 ◽

Vol 17 (3) ◽

pp. 275-306 ◽

Cited By ~ 194

Author(s):

Salvador García ◽

Francisco Herrera

Keyword(s):

Fitness Function ◽

Imbalanced Data ◽

Selection Procedure ◽

Prototype Selection ◽

Imbalanced Datasets ◽

Classification Rate ◽

Minority Class ◽

Good Trade ◽

And Performance ◽

Nonparametric Statistical Procedures

Learning with imbalanced data is one of the recent challenges in machine learning. Various solutions have been proposed in order to find a treatment for this problem, such as modifying methods or the application of a preprocessing stage. Within the preprocessing focused on balancing data, two tendencies exist: reduce the set of examples (undersampling) or replicate minority class examples (oversampling). Undersampling with imbalanced datasets could be considered as a prototype selection procedure with the purpose of balancing datasets to achieve a high classification rate, avoiding the bias toward majority class examples. Evolutionary algorithms have been used for classical prototype selection showing good results, where the fitness function is associated to the classification and reduction rates. In this paper, we propose a set of methods called evolutionary undersampling that take into consideration the nature of the problem and use different fitness functions for getting a good trade-off between balance of distribution of classes and performance. The study includes a taxonomy of the approaches and an overall comparison among our models and state of the art undersampling methods. The results have been contrasted by using nonparametric statistical procedures and show that evolutionary undersampling outperforms the nonevolutionary models when the degree of imbalance is increased.

Download Full-text

Imbalanced Data Sets Classification Based on SVM for Sand-Dust Storm Warning

Discrete Dynamics in Nature and Society ◽

10.1155/2015/562724 ◽

2015 ◽

Vol 2015 ◽

pp. 1-8

Author(s):

Yonghua Xie ◽

Yurong Liu ◽

Qingqiu Fu

Keyword(s):

Dust Storm ◽

Adaptive Sampling ◽

Imbalanced Data ◽

Real Data ◽

Classification Performance ◽

Selection Strategy ◽

Data Sets ◽

Minority Class ◽

Redundant Data ◽

Sand Dust

In view of the SVM classification for the imbalanced sand-dust storm data sets, this paper proposes a hybrid self-adaptive sampling method named SRU-AIBSMOTE algorithm. This method can adaptively adjust neighboring selection strategy based on the internal distribution of sample sets. It produces virtual minority class instances through randomized interpolation in the spherical space which consists of minority class instances and their neighbors. The random undersampling is also applied to undersample the majority class instances for removal of redundant data in the sample sets. The comparative experimental results on the real data sets from Yanchi and Tongxin districts in Ningxia of China show that the SRU-AIBSMOTE method can obtain better classification performance than some traditional classification methods.

Download Full-text

Imbalanced data classification based on hybrid resampling and twin support vector machine

Computer Science and Information Systems ◽

10.2298/csis161221017l ◽

2017 ◽

Vol 14 (3) ◽

pp. 579-595 ◽

Cited By ~ 2

Author(s):

Lu Cao ◽

Hong Shen

Keyword(s):

Support Vector Machine ◽

Real Life ◽

Imbalanced Data ◽

Data Classification ◽

Training Data ◽

Twin Support Vector Machine ◽

Support Vector ◽

Imbalanced Datasets ◽

Minority Class ◽

Imbalanced Data Classification

Imbalanced datasets exist widely in real life. The identification of the minority class in imbalanced datasets tends to be the focus of classification. As a variant of enhanced support vector machine (SVM), the twin support vector machine (TWSVM) provides an effective technique for data classification. TWSVM is based on a relative balance in the training sample dataset and distribution to improve the classification accuracy of the whole dataset, however, it is not effective in dealing with imbalanced data classification problems. In this paper, we propose to combine a re-sampling technique, which utilizes oversampling and under-sampling to balance the training data, with TWSVM to deal with imbalanced data classification. Experimental results show that our proposed approach outperforms other state-of-art methods.

Download Full-text

Imbalanced sentiment classification based on sequence generative adversarial nets

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201370 ◽

2020 ◽

Vol 39 (5) ◽

pp. 7909-7919

Author(s):

Chuantao Wang ◽

Xuexin Yang ◽

Linkai Ding

Keyword(s):

Deep Learning ◽

Online Reviews ◽

Classification Performance ◽

Sentiment Classification ◽

Classification Task ◽

Algorithm Optimization ◽

Minority Class ◽

Sample Distribution ◽

Practical Applications ◽

Deep Model

The purpose of sentiment classification is to solve the problem of automatic judgment of sentiment tendency. In the sentiment classification task of text data (such as online reviews), the traditional deep learning model focuses on algorithm optimization, but ignores the characteristics of the imbalanced distribution of the number of samples in each classification, which will cause the classification performance of the model to decrease in practical applications. In this paper, the experiment is divided into two stages. In the first stage, samples of minority class in the sample distribution are used to train a sequence generative adversarial nets, so that the sequence generative adversarial nets can learn the features of the samples of minority class in depth. In the second stage, the trained generator of sequence generative adversarial nets is used to generate false samples of minority class and mix them with the original samples to balance the sample distribution. After that, the mixed samples are input into the sentiment classification deep model to complete the model training. Experimental results show that the model has excellent classification performance in comparing a variety of deep learning models based on classic imbalanced learning methods in the sentiment classification task of hotel reviews.

Download Full-text

BBW: a batch balance wrapper for training deep neural networks on extremely imbalanced datasets with few minority samples

Applied Intelligence ◽

10.1007/s10489-021-02623-9 ◽

2021 ◽

Author(s):

Jingzhao Hu ◽

Hao Zhang ◽

Yang Liu ◽

Richard Sutcliffe ◽

Jun Feng

Keyword(s):

Neural Networks ◽

Learning Process ◽

Deep Neural Networks ◽

Imbalanced Data ◽

Parameter Tuning ◽

Classification Performance ◽

Imbalanced Datasets ◽

Sample Distribution ◽

Network Layers ◽

Additional Processing

AbstractIn recent years, Deep Neural Networks (DNNs) have achieved excellent performance on many tasks, but it is very difficult to train good models from imbalanced datasets. Creating balanced batches either by majority data down-sampling or by minority data up-sampling can solve the problem in certain cases. However, it may lead to learning process instability and overfitting. In this paper, we propose the Batch Balance Wrapper (BBW), a novel framework which can adapt a general DNN to be well trained from extremely imbalanced datasets with few minority samples. In BBW, two extra network layers are added to the start of a DNN. The layers prevent overfitting of minority samples and improve the expressiveness of the sample distribution of minority samples. Furthermore, Batch Balance (BB), a class-based sampling algorithm, is proposed to make sure the samples in each batch are always balanced during the learning process. We test BBW on three well-known extremely imbalanced datasets with few minority samples. The maximum imbalance ratio reaches 1167:1 with only 16 positive samples. Compared with existing approaches, BBW achieves better classification performance. In addition, BBW-wrapped DNNs are 16.39 times faster, relative to unwrapped DNNs. Moreover, BBW does not require data preprocessing or additional hyper-parameter tuning, operations that may require additional processing time. The experiments prove that BBW can be applied to common applications of extremely imbalanced data with few minority samples, such as the classification of EEG signals, medical images and so on.

Download Full-text

A Learning Framework for Medical Image-Based Intelligent Diagnosis from Imbalanced Datasets

10.3233/shti210801 ◽

2021 ◽

Author(s):

Tetiana Biloborodova ◽

Inna Skarga-Bandurova ◽

Mark Koverha ◽

Illia Skarha-Bandurov ◽

Yelyzaveta Yevsieieva

Keyword(s):

Image Classification ◽

Predictive Models ◽

Medical Image ◽

Imbalanced Data ◽

Classification Performance ◽

Data Reuse ◽

Imbalanced Datasets ◽

Learning Framework ◽

Class Distribution ◽

Medical Image Classification

Medical image classification and diagnosis based on machine learning has made significant achievements and gradually penetrated the healthcare industry. However, medical data characteristics such as relatively small datasets for rare diseases or imbalance in class distribution for rare conditions significantly restrains their adoption and reuse. Imbalanced datasets lead to difficulties in learning and obtaining accurate predictive models. This paper follows the FAIR paradigm and proposes a technique for the alignment of class distribution, which enables improving image classification performance in imbalanced data and ensuring data reuse. The experiments on the acne disease dataset support that the proposed framework outperforms the baselines and enable to achieve up to 5% improvement in image classification.

Download Full-text

Imbalanced Data Fault Diagnosis Based on an Evolutionary Online Sequential Extreme Learning Machine

Symmetry ◽

10.3390/sym12081204 ◽

2020 ◽

Vol 12 (8) ◽

pp. 1204 ◽

Cited By ~ 3

Author(s):

Wei Hao ◽

Feng Liu

Keyword(s):

Fault Diagnosis ◽

Extreme Learning Machine ◽

High Speed ◽

Imbalanced Data ◽

Classification Performance ◽

Complex Data ◽

Minority Class ◽

Diagnosis Model ◽

Learning Machine ◽

Hidden Layer

To quickly and effectively identify an axle box bearing fault of high-speed electric multiple units (EMUs), an evolutionary online sequential extreme learning machine (OS-ELM) fault diagnosis method for imbalanced data was proposed. In this scheme, the resampling scale is first determined according to the resampling empirical formulation, the K-means synthetic minority oversampling technique (SMOTE) method is then used for oversampling the minority class samples, a method based on Euclidean distance is applied for undersampling the majority class samples, and the complex data features are extracted from the reconstructed dataset. Second, the reconstructed dataset is input into the diagnosis model. Finally, the artificial bee colony (ABC) algorithm is used to globally optimize the combination of input weights, hidden layer bias, and the number of hidden layer nodes for an OS-ELM, and the diagnosis model is allowed to evolve. The proposed method was tested on the axle box bearing monitoring data of high-speed EMUs, on which the position of the axle box bearings was symmetrical. Numerical testing proved that the method has the characteristics of faster detection and higher classification performance regarding the minority class data compared to other standard and classical algorithms.

Download Full-text

A Novel Ensemble Framework Based on K-Means and Resampling for Imbalanced Data

Applied Sciences ◽

10.3390/app10051684 ◽

2020 ◽

Vol 10 (5) ◽

pp. 1684

Author(s):

Huajuan Duan ◽

Yongqing Wei ◽

Peiyu Liu ◽

Hongxia Yin

Keyword(s):

Evaluation Criteria ◽

Imbalanced Data ◽

Data Preprocessing ◽

Ensemble Classifier ◽

Data Sets ◽

Imbalanced Datasets ◽

Minority Class ◽

K Value ◽

Imbalanced Classification ◽

The Past

Imbalanced classification is one of the most important problems of machine learning and data mining, existing in many real datasets. In the past, many basic classifiers such as SVM, KNN, and so on have been used for imbalanced datasets in which the number of one sample is larger than that of another, but the classification effect is not ideal. Some data preprocessing methods have been proposed to reduce the imbalance ratio of data sets and combine with the basic classifiers to get better performance. In order to improve the whole classification accuracy, we propose a novel classifier ensemble framework based on K-means and resampling technique (EKR). First, we divide the data samples in the majority class into several sub-clusters using K-means, k-value is determined by Average Silhouette Coefficient, and then adjust the number of data samples of each sub-cluster to be the same as that of the minority classes through resampling technology, after that each adjusted sub-cluster and the minority class are combined into several balanced subsets, the base classifier is trained on each balanced subset separately, and finally integrated into a strong ensemble classifier. In this paper, the extensive experimental results on 16 imbalanced datasets demonstrate the effectiveness and feasibility of the proposed algorithm in terms of multiple evaluation criteria, and EKR can achieve better performance when compared with several classical imbalanced classification algorithms using different data preprocessing methods.

Download Full-text

Raking and Relabeling for Imbalanced Data

10.36227/techrxiv.17712122.v1 ◽

2022 ◽

Author(s):

Seunghwan Park ◽

Hae-Wwan Lee ◽

Jongho Im

Keyword(s):

High Dimensional Data ◽

Imbalanced Data ◽

Sampling Strategy ◽

Classification Performance ◽

Mixed Data ◽

Categorical Variables ◽

High Dimensional ◽

Data Generation ◽

Minority Class ◽

Imbalanced Data Classification

<div>We consider the binary classification of imbalanced data. A dataset is imbalanced if the proportion of classes are heavily skewed. Imbalanced data classification is often challengeable, especially for high-dimensional data, because unequal classes deteriorate classifier performance. Under sampling the majority class or oversampling the minority class are popular methods to construct balanced samples, facilitating classification performance improvement. However, many existing sampling methods cannot be easily extended to high-dimensional data and mixed data, including categorical variables, because they often require approximating the attribute distributions, which becomes another critical issue. In this paper, we propose a new sampling strategy employing raking and relabeling procedures, such that the attribute values of the majority class are imputed for the values of the minority class in the construction of balanced samples. The proposed algorithms produce comparable performance as existing popular methods but are more flexible regarding the data shape and attribute size. The sampling algorithm is attractive in practice, considering that it does not require density estimation for synthetic data generation in oversampling and is not bothered by mixed-type variables. In addition, the proposed sampling strategy is robust to classifiers in the sense that classification performance is not sensitive to choosing the classifiers.</div>

Download Full-text