Concepts Seeds Gathering and Dataset Updating Algorithm for Handling Concept Drift

2015 ◽  
Vol 7 (2) ◽  
pp. 29-57 ◽  
Author(s):  
Nabil M. Hewahi ◽  
Ibrahim M. Elbouhissi

In data mining, the phenomenon of change in data distribution over time is known as concept drift. In this research, the authors introduce a new approach called Concepts Seeds Gathering and Dataset Updating algorithm (CSG-DU) that gives the traditional classification models the ability to adapt and cope with concept drift as time passes. CSG-DU is concerned with discovering new concepts in data stream and aims to increase the classification accuracy using any classification model when changes occur in the underlying concepts. The proposed approach has been tested using synthetic and real datasets. The experiments conducted show that after applying the authors' approach, the classification accuracy increased from low values to high and acceptable ones. Finally, a comparison study between CSG-DU and Set Formation for Delayed Labeling algorithm (SFDL) has been conducted; SFDL is an approach that handles sudden and gradual concept drift. CSG-DU results outperforms SFDL in terms of classification accuracy.

Author(s):  
Leena Deshpande ◽  
M. Narsing Rao

<p>Abstract:-In Internetworking system, the huge amount of data is scattered, generated and processed over the network. The data mining techniques are used to discover the unknown pattern from the underlying data. A traditional classification model is used to classify the data based on past labelled data. However in many current applications, data is increasing in size with fluctuating patterns. Due to this new feature may arrive in the data. It is present in many applications like sensornetwork, banking and telecommunication systems, financial domain, Electricity usage and prices based on its demand and supplyetc .Thus change in data distribution reduces the accuracy of classifying the data. It may discover some patterns as frequent while other patterns tend to disappear and wrongly classify. To mine such data distribution, traditionalclassification techniques may not be suitable as the distribution generating the items can change over time so data from the past may become irrelevant or even false for the current prediction. For handlingsuch varying pattern of data, concept drift mining approach is used to improve the accuracy of classification techniques. In this paper we have proposed ensemble approach for improving the accuracy of classifier. The ensemble classifier is applied on 3 different data sets. We investigated different features for the different chunk of data which is further given to ensemble classifier. We observed the proposed approach improves the accuracy of classifier for different chunks of data.</p>


Author(s):  
Jiaoyan Chen ◽  
Freddy Lecue ◽  
Jeff Z. Pan ◽  
Huajun Chen

Data stream learning has been largely studied for extracting knowledge structures from continuous and rapid data records. In the semantic Web, data is interpreted in ontologies and its ordered sequence is represented as an ontology stream. Our work exploits the semantics of such streams to tackle the problem of concept drift i.e., unexpected changes in data distribution, causing most of models to be less accurate as time passes. To this end we revisited (i) semantic inference in the context of supervised stream learning, and (ii) models with semantic embeddings. The experiments show accurate prediction with data from Dublin and Beijing.


2019 ◽  
Vol 24 (13) ◽  
pp. 9835-9855 ◽  
Author(s):  
Ricardo de Almeida ◽  
Yee Mey Goh ◽  
Radmehr Monfared ◽  
Maria Teresinha Arns Steiner ◽  
Andrew West

Abstract Most information sources in the current technological world are generating data sequentially and rapidly, in the form of data streams. The evolving nature of processes may often cause changes in data distribution, also known as concept drift, which is difficult to detect and causes loss of accuracy in supervised learning algorithms. As a consequence, online machine learning algorithms that are able to update actively according to possible changes in the data distribution are required. Although many strategies have been developed to tackle this problem, most of them are designed for classification problems. Therefore, in the domain of regression problems, there is a need for the development of accurate algorithms with dynamic updating mechanisms that can operate in a computational time compatible with today’s demanding market. In this article, the authors propose a new bagging ensemble approach based on neural network with random weights for online data stream regression. The proposed method improves the data prediction accuracy as well as minimises the required computational time compared to a recent algorithm for online data stream regression from literature. The experiments are carried out using four synthetic datasets to evaluate the algorithm’s response to concept drift, along with four benchmark datasets from different industries. The results indicate improvement in data prediction accuracy, effectiveness in handling concept drift, and much faster updating times compared to the existing available approach. Additionally, the use of design of experiments as an effective tool for hyperparameter tuning is demonstrated.


2020 ◽  
Author(s):  
Saptarshi Bej ◽  
Narek Davtyan ◽  
Markus Wolfien ◽  
Mariam Nassar ◽  
Olaf Wolkenhauer

AbstractThe Synthetic Minority Oversampling TEchnique (SMOTE) is widely-used for the analysis of imbalanced datasets. It is known that SMOTE frequently over-generalizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model. In this article, we present an approach that overcomes this limitation of SMOTE, employing Localized Random Affine Shadowsampling (LoRAS) to oversample from an approximated data manifold of the minority class. We benchmarked our algorithm with 14 publicly available imbalanced datasets using three different Machine Learning (ML) algorithms and compared the performance of LoRAS, SMOTE and several SMOTE extensions that share the concept of using convex combinations of minority class data points for oversampling with LoRAS. We observed that LoRAS, on average generates better ML models in terms of F1-Score and Balanced accuracy. Another key observation is that while most of the extensions of SMOTE we have tested, improve the F1-Score with respect to SMOTE on an average, they compromise on the Balanced accuracy of a classification model. LoRAS on the contrary, improves both F1 Score and the Balanced accuracy thus produces better classification models. Moreover, to explain the success of the algorithm, we have constructed a mathematical framework to prove that LoRAS oversampling technique provides a better estimate for the mean of the underlying local data distribution of the minority class data space.


2020 ◽  
Vol 11 ◽  
Author(s):  
Guofeng Yang ◽  
Yong He ◽  
Yong Yang ◽  
Beibei Xu

Fine-grained image classification is a challenging task because of the difficulty in identifying discriminant features, it is not easy to find the subtle features that fully represent the object. In the fine-grained classification of crop disease, visual disturbances such as light, fog, overlap, and jitter are frequently encountered. To explore the influence of the features of crop leaf images on the classification results, a classification model should focus on the more discriminative regions of the image while improving the classification accuracy of the model in complex scenes. This paper proposes a novel attention mechanism that effectively utilizes the informative regions of an image, and describes the use of transfer learning to quickly construct several fine-grained image classification models of crop disease based on this attention mechanism. This study uses 58,200 crop leaf images as a dataset, including 14 different crops and 37 different categories of healthy/diseased crops. Among them, different diseases of the same crop have strong similarities. The NASNetLarge fine-grained classification model based on the proposed attention mechanism achieves the best classification effect, with an F1 score of up to 93.05%. The results show that the proposed attention mechanism effectively improves the fine-grained classification of crop disease images.


2013 ◽  
Vol 12 (06) ◽  
pp. 1287-1308 ◽  
Author(s):  
JOÃO BÁRTOLO GOMES ◽  
MOHAMED MEDHAT GABER ◽  
PEDRO A. C. SOUSA ◽  
ERNESTINA MENASALVAS

In ubiquitous data stream mining, different devices often aim to learn concepts that are similar to some extent. In many applications, such as spam filtering or news recommendation, the data stream underlying concept (e.g., interesting mail/news) is likely to change over time. Therefore, the resultant model must be continuously adapted to such changes. This paper presents a novel Collaborative Data Stream Mining (Coll-Stream) approach that explores the similarities in the knowledge available from other devices to improve local classification accuracy. Coll-Stream integrates the community knowledge using an ensemble method where the classifiers are selected and weighted based on their local accuracy for different partitions of the feature space. We evaluate Coll-Stream classification accuracy in situations with concept drift, noise, partition granularity and concept similarity in relation to the local underlying concept. The experimental results show that Coll-Stream resultant model achieves stability and accuracy in a variety of situations using both synthetic and real-world datasets.


2020 ◽  
Vol 11 (1) ◽  
pp. 15-26
Author(s):  
Jay Gandhi ◽  
Vaibhav Gandhi

Data stream mining has become an interesting analysis topic and it is a growing interest in data discovery method. There are several applications supporting stream data processing like device network, electronic network, etc. Our approach AhtNODE (Adaptive Hoeffding Tree based NOvel class DEtection) detects novel class in the presence of concept drift in streaming data. It addresses there are three challenges of streaming data: infinite length, concept drift, and concept evolution. This approach automatically detects the novel class whenever it arrives in the data stream. It is a multi-class approach that distinguishes novel class from existing classes. The authors tend to apply the Adaptive Hoeffding Tree as a classification model that is also used to handle the concept drift situation. Previous approaches used the ensemble model to handle concept drift. In AHT, classification is done in the single pass. The experiment result proves the effectiveness of AhtNODE compared to existing ensemble classifier in terms of classification accuracy, speed and use of memory.


Molecules ◽  
2019 ◽  
Vol 24 (8) ◽  
pp. 1550 ◽  
Author(s):  
Liang Xu ◽  
Wen Sun ◽  
Cui Wu ◽  
Yucui Ma ◽  
Zhimao Chao

Near infrared (NIR) spectroscopy with chemometric techniques was applied to discriminate the geographical origins of crude drugs (i.e., dried ripe fruits of Trichosanthes kirilowii) and prepared slices of Trichosanthis Fructus in this work. The crude drug samples (120 batches) from four growing regions (i.e., Shandong, Shanxi, Hebei, and Henan Provinces) were collected, dried, and used and the prepared slice samples (30 batches) were purchased from different drug stores. The raw NIR spectra were acquired and preprocessed with multiplicative scatter correction (MSC). Principal component analysis (PCA) was used to extract relevant information from the spectral data and gave visible cluster trends. Four different classification models, namely K-nearest neighbor (KNN), soft independent modeling of class analogy (SIMCA), partial least squares-discriminant analysis (PLS-DA), and support vector machine-discriminant analysis (SVM-DA), were constructed and their performances were compared. The corresponding classification model parameters were optimized by cross-validation (CV). Among the four classification models, SVM-DA model was superior over the other models with a classification accuracy up to 100% for both the calibration set and the prediction set. The optimal SVM-DA model was achieved when C =100, γ = 0.00316, and the number of principal components (PCs) = 6. While PLS-DA model had the classification accuracy of 95% for the calibration set and 98% for the prediction set. The KNN model had a classification accuracy of 92% for the calibration set and 94% for prediction set. The non-linear classification method was superior to the linear ones. Generally, the results demonstrated that the crude drugs from different geographical origins and the crude drugs and prepared slices of Trichosanthis Fructus could be distinguished by NIR spectroscopy coupled with SVM-DA model rapidly, nondestructively, and reliably.


2021 ◽  
Author(s):  
Priya S ◽  
Annie Uthra

Abstract As the data mining applications are increasing popularly, large volumes of data streams are generated over the period of time. The main problem in data streams is that it exhibits a high degree of class imbalance and distribution of data changes over time. In this paper, Timely Drift Detection and Minority Resampling Technique (TDDMRT) based on K-nearest neighbor and Jaccard similarity is proposed to handle the class imbalance by finding the current ratio of class labels. The Enhanced Early Drift Detection Method (EEDDM) is proposed for detecting the concept drift and the Minority Resampling Method (KNN-JS) determines whether the current data stream should be regarded as imbalance and it resamples the minority instances in the drifting data stream. The K-Nearest Neighbors technique is used to resample the minority classes and the Jaccard similarity measure is established over the resampled data to generate the synthetic data similar to the original data and it is handled by ensemble classifiers. The proposed ensemble based classification model outperforms the existing over sampling and under sampling techniques with accuracy of 98.52%.


Author(s):  
Amirmahyar Abdolsamadi ◽  
Pingfeng Wang

Health diagnosis interprets data streams acquired by smart sensors and makes inferences about health conditions of an engineering system thereby making critical operational decisions. A data stream is a flow of continuous data that face some challenges in data mining. This paper addresses concept drift and concept evolution as two major challenges in the classification of streaming data. Concept drift occurs as a result of data distribution changes. Concept evolution happens when new classes appear in the stream. These changes may cause the degradation of classification results over time. This paper presents an adaptive fusion learning approach to build a robust classification model. The proposed approach consists of three steps: (i) proposed fusion formulation using weighted majority voting (ii) active learning to labels selectively instead of querying for all true labels (iii) distance-based approach to monitoring the movement of data distribution. A diagnosis case study has been used to demonstrate the developed fusion diagnosis methodology.


Sign in / Sign up

Export Citation Format

Share Document