Autonomous Sensor Data Cleaning in Stream Mining Setting

Abstract Background: Internet of Things (IoT), earth observation and big scientific experiments are sources of extensive amounts of sensor big data today. We are faced with large amounts of data with low measurement costs. A standard approach in such cases is a stream mining approach, implying that we look at a particular measurement only once during the real-time processing. This requires the methods to be completely autonomous. In the past, very little attention was given to the most time-consuming part of the data mining process, i.e. data pre-processing. Objectives: In this paper we propose an algorithm for data cleaning, which can be applied to real-world streaming big data. Methods/Approach: We use the short-term prediction method based on the Kalman filter to detect admissible intervals for future measurements. The model can be adapted to the concept drift and is useful for detecting random additive outliers in a sensor data stream. Results: For datasets with low noise, our method has proven to perform better than the method currently commonly used in batch processing scenarios. Our results on higher noise datasets are comparable. Conclusions: We have demonstrated a successful application of the proposed method in real-world scenarios including the groundwater level, server load and smart-grid data

Download Full-text

Countering the concept-drift problems in big data by an incrementally optimized stream mining model

Journal of Systems and Software ◽

10.1016/j.jss.2014.07.010 ◽

2015 ◽

Vol 102 ◽

pp. 158-166 ◽

Cited By ~ 17

Author(s):

Hang Yang ◽

Simon Fong

Keyword(s):

Big Data ◽

Concept Drift ◽

Stream Mining ◽

Mining Model

Download Full-text

On the application of Big Data in future large-scale intelligent Smart City installations

International Journal of Pervasive Computing and Communications ◽

10.1108/ijpcc-03-2014-0022 ◽

2014 ◽

Vol 10 (2) ◽

pp. 168-182 ◽

Cited By ~ 10

Author(s):

Sylva Girtelschmid ◽

Matthias Steinbauer ◽

Vikash Kumar ◽

Anna Fensel ◽

Gabriele Kotsis

Keyword(s):

Big Data ◽

Real World ◽

Smart City ◽

Large Scale ◽

Smart Cities ◽

Sensor Data ◽

Data Streaming ◽

Content Type ◽

Semantic Models ◽

Intelligent Filtering

Purpose – The purpose of this article is to propose and evaluate a novel system architecture for Smart City applications which uses ontology reasoning and a distributed stream processing framework on the cloud. In the domain of Smart City, often methodologies of semantic modeling and automated inference are applied. However, semantic models often face performance problems when applied in large scale. Design/methodology/approach – The problem domain is addressed by using methods from Big Data processing in combination with semantic models. The architecture is designed in a way that for the Smart City model still traditional semantic models and rule engines can be used. However, sensor data occurring at such Smart Cities are pre-processed by a Big Data streaming platform to lower the workload to be processed by the rule engine. Findings – By creating a real-world implementation of the proposed architecture and running simulations of Smart Cities of different sizes, on top of this implementation, the authors found that the combination of Big Data streaming platforms with semantic reasoning is a valid approach to the problem. Research limitations/implications – In this article, real-world sensor data from only two buildings were extrapolated for the simulations. Obviously, real-world scenarios will have a more complex set of sensor input values, which needs to be addressed in future work. Originality/value – The simulations show that merely using a streaming platform as a buffer for sensor input values already increases the sensor data throughput and that by applying intelligent filtering in the streaming platform, the actual number of rule executions can be limited to a minimum.

Download Full-text

Big Data Integration with IoT to Achieve the Challenges

Asian Journal of Computer Science and Technology ◽

10.51983/ajcst-2019.8.s3.2089 ◽

2019 ◽

Vol 8 (S3) ◽

pp. 45-49

Author(s):

V. Bhagyasree ◽

K. Rohitha ◽

K. Kusuma ◽

S. Kokila

Keyword(s):

Big Data ◽

Internet Of Things ◽

Real World ◽

Physical World ◽

Sensor Data ◽

The Internet ◽

The Future ◽

Iot Devices ◽

The Right ◽

The Internet Of Things

The Internet of Things anticipates the combination of physical gadgets to the Internet and their access to wireless sensor data which makes it useful to restrain the physical world. Big Data convergence has many aspects and new opportunities ahead of business ventures to get into a new market or enhance their operations in the current market. The existing techniques and technologies is probably safe to say that the best solution is to use big data tools to provide an analytical solution to the Internet of Things. Based on the current technology deployment and adoption trends, it is visioned that the Internet of Things is the technology of the future; while to-day’s real-world devices can provide best and valuable analytics, and people in the real world use many IOT devices. In spite of all the advertisements that companies offer in connection with the Internet of Things, you as a liable consumer, have the right to be suspicious about IoT advertisements. This paper focuses on the Internet of things concerning reality and what are the prospects for the future.

Download Full-text

Entity Resolution for Big Data using Combination of Supervised Meta-Blocking and pay-as-you-go Configuration

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.f1031.0486s419 ◽

2019 ◽

Vol 8 (6S4) ◽

pp. 166-167

Keyword(s):

Big Data ◽

Real World ◽

Data Cleaning ◽

Entity Resolution ◽

Detailed Comparison ◽

Data Sets ◽

Important Process ◽

Improve Performance ◽

Multiple Data ◽

Multiple Data Sets

Entity resolution refers to the method of identifying the same real world object from multiple data sets. In Data cleaning and data integration application, entity resolution is an important process. When data is large the task of entity resolution becomes complex and time consuming. End-to-end entity resolution proposal involves stages like blocking (efficiently identifies duplicates), detailed comparison (refines blocking output) and clustering (identifies the set of records which may refer to the same entity). In this paper, an approach for feedback based optimization of complete entity resolution is proposed in which supervised meta-blocking is used for blocking stage. This paper proposes a technique for entity resolution which does optimization of each phase of entity resolution with benefits of supervised Meta-blocking to improve performance of entity resolution for big data

Download Full-text

Contemporary Update on Prognostic Nomograms for Endemic Nasopharyngeal Carcinoma: Real-World Results from a Big-Data Intelligence Platform-Based Analysis

SSRN Electronic Journal ◽

10.2139/ssrn.3473272 ◽

2019 ◽

Author(s):

Ji-Jin Yao ◽

Li Lin ◽

Zhen-Yu Qi ◽

Wei Wei ◽

Jun-Jie Mao ◽

...

Keyword(s):

Big Data ◽

Nasopharyngeal Carcinoma ◽

Real World

Download Full-text

Treatment of Mycobacterium avium–intracellulare complex lung disease in the real world: a retrospective big data analysis

Drugs & Therapy Perspectives ◽

10.1007/s40267-019-00687-9 ◽

2019 ◽

Vol 36 (2) ◽

pp. 75-82

Author(s):

Tomohide Iwao ◽

Genta Kato ◽

Isao Ito ◽

Toyohiro Hirai ◽

Tomohiro Kuroda

Keyword(s):

Big Data ◽

Data Analysis ◽

Lung Disease ◽

Mycobacterium Avium ◽

Real World ◽

Big Data Analysis ◽

The Real ◽

Mycobacterium Avium Intracellulare

Download Full-text

Kennard-Stone Balance Algorithm for Time-series Big Data Stream Mining

2020 International Conference on Data Mining Workshops (ICDMW) ◽

10.1109/icdmw51313.2020.00122 ◽

2020 ◽

Author(s):

Tengyue Li ◽

Simon Fong ◽

Yaoyang Wu ◽

Antonio J. Tallon-Ballesteros

Keyword(s):

Time Series ◽

Big Data ◽

Data Stream ◽

Data Stream Mining ◽

Stream Mining

Download Full-text

Visualizing Effects of COVID-19 Social Isolation with Residential Activity Big Data Sensor Data

2020 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata50022.2020.9377830 ◽

2020 ◽

Author(s):

Anuradha Rajkumar ◽

Bruce Wallace ◽

Laura Ault ◽

Julien Lariviere-Chartier ◽

Frank Knoefel ◽

...

Keyword(s):

Big Data ◽

Social Isolation ◽

Sensor Data

Download Full-text

A Real Time Processing system for big data in astronomy: Applications to HERA

Astronomy and Computing ◽

10.1016/j.ascom.2021.100489 ◽

2021 ◽

pp. 100489

Author(s):

Paul La Plante ◽

P.K.G. Williams ◽

M. Kolopanis ◽

J.S. Dillon ◽

A.P. Beardsley ◽

...

Keyword(s):

Big Data ◽

Real Time ◽

Processing System ◽

Real Time Processing ◽

Time Processing

Download Full-text

Non Stationary Multi-Armed Bandit: Empirical Evaluation of a New Concept Drift-Aware Algorithm

Entropy ◽

10.3390/e23030380 ◽

2021 ◽

Vol 23 (3) ◽

pp. 380

Author(s):

Emanuele Cavenaghi ◽

Gabriele Sottocornola ◽

Fabio Stella ◽

Markus Zanker

Keyword(s):

Real World ◽

Concept Drift ◽

Empirical Evaluation ◽

Sliding Window ◽

Discount Factor ◽

Data Streaming ◽

Sources Of Information ◽

Sequential Decision ◽

Time Step ◽

Thompson Sampling

The Multi-Armed Bandit (MAB) problem has been extensively studied in order to address real-world challenges related to sequential decision making. In this setting, an agent selects the best action to be performed at time-step t, based on the past rewards received by the environment. This formulation implicitly assumes that the expected payoff for each action is kept stationary by the environment through time. Nevertheless, in many real-world applications this assumption does not hold and the agent has to face a non-stationary environment, that is, with a changing reward distribution. Thus, we present a new MAB algorithm, named f-Discounted-Sliding-Window Thompson Sampling (f-dsw TS), for non-stationary environments, that is, when the data streaming is affected by concept drift. The f-dsw TS algorithm is based on Thompson Sampling (TS) and exploits a discount factor on the reward history and an arm-related sliding window to contrast concept drift in non-stationary environments. We investigate how to combine these two sources of information, namely the discount factor and the sliding window, by means of an aggregation function f(.). In particular, we proposed a pessimistic (f=min), an optimistic (f=max), as well as an averaged (f=mean) version of the f-dsw TS algorithm. A rich set of numerical experiments is performed to evaluate the f-dsw TS algorithm compared to both stationary and non-stationary state-of-the-art TS baselines. We exploited synthetic environments (both randomly-generated and controlled) to test the MAB algorithms under different types of drift, that is, sudden/abrupt, incremental, gradual and increasing/decreasing drift. Furthermore, we adapt four real-world active learning tasks to our framework—a prediction task on crimes in the city of Baltimore, a classification task on insects species, a recommendation task on local web-news, and a time-series analysis on microbial organisms in the tropical air ecosystem. The f-dsw TS approach emerges as the best performing MAB algorithm. At least one of the versions of f-dsw TS performs better than the baselines in synthetic environments, proving the robustness of f-dsw TS under different concept drift types. Moreover, the pessimistic version (f=min) results as the most effective in all real-world tasks.

Download Full-text