Countering the concept-drift problems in big data by an incrementally optimized stream mining model

Abstract Background: Internet of Things (IoT), earth observation and big scientific experiments are sources of extensive amounts of sensor big data today. We are faced with large amounts of data with low measurement costs. A standard approach in such cases is a stream mining approach, implying that we look at a particular measurement only once during the real-time processing. This requires the methods to be completely autonomous. In the past, very little attention was given to the most time-consuming part of the data mining process, i.e. data pre-processing. Objectives: In this paper we propose an algorithm for data cleaning, which can be applied to real-world streaming big data. Methods/Approach: We use the short-term prediction method based on the Kalman filter to detect admissible intervals for future measurements. The model can be adapted to the concept drift and is useful for detecting random additive outliers in a sensor data stream. Results: For datasets with low noise, our method has proven to perform better than the method currently commonly used in batch processing scenarios. Our results on higher noise datasets are comparable. Conclusions: We have demonstrated a successful application of the proposed method in real-world scenarios including the groundwater level, server load and smart-grid data

Download Full-text

Kennard-Stone Balance Algorithm for Time-series Big Data Stream Mining

2020 International Conference on Data Mining Workshops (ICDMW) ◽

10.1109/icdmw51313.2020.00122 ◽

2020 ◽

Author(s):

Tengyue Li ◽

Simon Fong ◽

Yaoyang Wu ◽

Antonio J. Tallon-Ballesteros

Keyword(s):

Time Series ◽

Big Data ◽

Data Stream ◽

Data Stream Mining ◽

Stream Mining

Download Full-text

Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams

Entropy ◽

10.3390/e23070859 ◽

2021 ◽

Vol 23 (7) ◽

pp. 859

Author(s):

Abdulaziz O. AlQabbany ◽

Aqil M. Azmi

Keyword(s):

Big Data ◽

Random Forest ◽

Real Time ◽

Data Streams ◽

Learning Algorithm ◽

Concept Drift ◽

The United States ◽

Careful Consideration ◽

Data Sets ◽

Stream Data

We are living in the age of big data, a majority of which is stream data. The real-time processing of this data requires careful consideration from different perspectives. Concept drift is a change in the data’s underlying distribution, a significant issue, especially when learning from data streams. It requires learners to be adaptive to dynamic changes. Random forest is an ensemble approach that is widely used in classical non-streaming settings of machine learning applications. At the same time, the Adaptive Random Forest (ARF) is a stream learning algorithm that showed promising results in terms of its accuracy and ability to deal with various types of drift. The incoming instances’ continuity allows for their binomial distribution to be approximated to a Poisson(1) distribution. In this study, we propose a mechanism to increase such streaming algorithms’ efficiency by focusing on resampling. Our measure, resampling effectiveness (ρ), fuses the two most essential aspects in online learning; accuracy and execution time. We use six different synthetic data sets, each having a different type of drift, to empirically select the parameter λ of the Poisson distribution that yields the best value for ρ. By comparing the standard ARF with its tuned variations, we show that ARF performance can be enhanced by tackling this important aspect. Finally, we present three case studies from different contexts to test our proposed enhancement method and demonstrate its effectiveness in processing large data sets: (a) Amazon customer reviews (written in English), (b) hotel reviews (in Arabic), and (c) real-time aspect-based sentiment analysis of COVID-19-related tweets in the United States during April 2020. Results indicate that our proposed method of enhancement exhibited considerable improvement in most of the situations.

Download Full-text

Knowledge Discovery From Evolving Data Streams

Advances in Business Information Systems and Analytics - Machine Learning Techniques for Improved Business Analytics ◽

10.4018/978-1-5225-3534-8.ch002 ◽

2019 ◽

pp. 19-39

Author(s):

Prasanna Lakshmi Kompalli

Keyword(s):

Real Time ◽

Data Streams ◽

Data Stream ◽

Concept Drift ◽

Data Stream Mining ◽

Time Data ◽

Stream Mining ◽

New Challenges ◽

Mining Data Streams ◽

Different Sources

Data coming from different sources is referred to as data streams. Data stream mining is an online learning technique where each data point must be processed as the data arrives and discarded as the processing is completed. Progress of technologies has resulted in the monitoring these data streams in real time. Data streams has created many new challenges to the researchers in real time. The main features of this type of data are they are fast flowing, large amounts of data which are continuous and growing in nature, and characteristics of data might change in course of time which is termed as concept drift. This chapter addresses the problems in mining data streams with concept drift. Due to which, isolating the correct literature would be a grueling task for researchers and practitioners. This chapter tries to provide a solution as it would be an amalgamation of all techniques used for data stream mining with concept drift.

Download Full-text

Data Stream Mining Using Ensemble Classifier

Collaborative Filtering Using Data Mining and Analysis - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-0489-4.ch013 ◽

2017 ◽

pp. 236-249

Author(s):

Snehlata Sewakdas Dongre ◽

Latesh G. Malik

Keyword(s):

Collaborative Filtering ◽

Data Stream ◽

Concept Drift ◽

Ensemble Classifier ◽

Ensemble Classification ◽

Data Stream Mining ◽

Main Concern ◽

Stream Mining ◽

Stream Classification ◽

Data Stream Classification

A data stream is giant amount of data which is generated uncontrollably at a rapid rate from many applications like call detail records, log records, sensors applications etc. Data stream mining has grasped the attention of so many researchers. A rising problem in Data Streams is the handling of concept drift. To be a good algorithm it should adapt the changes and handle the concept drift properly. Ensemble classification method is the group of classifiers which works in collaborative manner. Overall this chapter will cover all the aspects of the data stream classification. The mission of this chapter is to discuss various techniques which use collaborative filtering for the data stream mining. The main concern of this chapter is to make reader familiar with the data stream domain and data stream mining. Instead of single classifier the group of classifiers is used to enhance the accuracy of classification. The collaborative filtering will play important role here how the different classifiers work collaborative within the ensemble to achieve a goal.

Download Full-text

Data-driven decision support under concept drift in streamed big data

Complex & Intelligent Systems ◽

10.1007/s40747-019-00124-4 ◽

2019 ◽

Vol 6 (1) ◽

pp. 157-163 ◽

Cited By ~ 2

Author(s):

Jie Lu ◽

Anjin Liu ◽

Yiliao Song ◽

Guangquan Zhang

Keyword(s):

Decision Making ◽

Big Data ◽

Real Time ◽

Concept Drift ◽

High Volume ◽

Streaming Data ◽

Data Driven ◽

Research Directions ◽

Decision Outcomes ◽

Past Data

Abstract Data-driven decision-making ($$\mathrm {D^3}$$D3M) is often confronted by the problem of uncertainty or unknown dynamics in streaming data. To provide real-time accurate decision solutions, the systems have to promptly address changes in data distribution in streaming data—a phenomenon known as concept drift. Past data patterns may not be relevant to new data when a data stream experiences significant drift, thus to continue using models based on past data will lead to poor prediction and poor decision outcomes. This position paper discusses the basic framework and prevailing techniques in streaming type big data and concept drift for $$\mathrm {D^3}$$D3M. The study first establishes a technical framework for real-time $$\mathrm {D^3}$$D3M under concept drift and details the characteristics of high-volume streaming data. The main methodologies and approaches for detecting concept drift and supporting $$\mathrm {D^3}$$D3M are highlighted and presented. Lastly, further research directions, related methods and procedures for using streaming data to support decision-making in concept drift environments are identified. We hope the observations in this paper could support researchers and professionals to better understand the fundamentals and research directions of $$\mathrm {D^3}$$D3M in streamed big data environments.

Download Full-text

Improvised methods for tackling big data stream mining challenges: case study of human activity recognition

The Journal of Supercomputing ◽

10.1007/s11227-016-1639-5 ◽

2016 ◽

Vol 72 (10) ◽

pp. 3927-3959 ◽

Cited By ~ 6

Author(s):

Simon Fong ◽

Kexing Liu ◽

Kyungeun Cho ◽

Raymond Wong ◽

Sabah Mohammed ◽

...

Keyword(s):

Big Data ◽

Activity Recognition ◽

Human Activity ◽

Data Stream ◽

Human Activity Recognition ◽

Data Stream Mining ◽

Stream Mining

Download Full-text

Context-adaptive big data stream mining

2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton) ◽

10.1109/allerton.2014.7028494 ◽

2014 ◽

Cited By ~ 8

Author(s):

Cem Tekin ◽

Luca Canzian ◽

Mihaela van der Schaar

Keyword(s):

Big Data ◽

Data Stream ◽

Data Stream Mining ◽

Stream Mining

Download Full-text

Big Data Stream Mining

Machine Learning for Data Streams ◽

10.7551/mitpress/10654.003.0006 ◽

2018 ◽

Keyword(s):

Big Data ◽

Data Stream ◽

Data Stream Mining ◽

Stream Mining

Download Full-text

Addressing Complexities of Machine Learning in Big Data: Principles, Trends and Challenges from Systematical Perspectives

10.20944/preprints201710.0076.v1 ◽

2017 ◽

Cited By ~ 2

Author(s):

Qi Wang ◽

Xia Zhao ◽

Jincai Huang ◽

Yanghe Feng ◽

Zhong Liu ◽

...

Keyword(s):

Machine Learning ◽

Big Data ◽

Concept Drift ◽

Model Development ◽

Machine Learning Techniques ◽

Comprehensive Overview ◽

Robust Learning ◽

Human Interactions ◽

Learning Techniques ◽

History Of

The concept of ‘big data’ has been widely discussed, and its value has been illuminated throughout a variety of domains. To quickly mine potential values and alleviate the ever-increasing volume of information, machine learning is playing an increasingly important role and faces more challenges than ever. Because few studies exist regarding how to modify machine learning techniques to accommodate big data environments, we provide a comprehensive overview of the history of the evolution of big data, the foundations of machine learning, and the bottlenecks and trends of machine learning in the big data era. More specifically, based on learning principals, we discuss regularization to enhance generalization. The challenges of quality in big data are reduced to the curse of dimensionality, class imbalances, concept drift and label noise, and the underlying reasons and mainstream methodologies to address these challenges are introduced. Learning model development has been driven by domain specifics, dataset complexities, and the presence or absence of human involvement. In this paper, we propose a robust learning paradigm by aggregating the aforementioned factors. Over the next few decades, we believe that these perspectives will lead to novel ideas and encourage more studies aimed at incorporating knowledge and establishing data-driven learning systems that involve both data quality considerations and human interactions.

Download Full-text