Unsupervised Feature Selection for Outlier Detection on Streaming Data to Enhance Network Security

Michael Heigl; Enrico Weigelt; Dalibor Fiala; Martin Schramm

doi:10.3390/app112412073

Unsupervised Feature Selection for Outlier Detection on Streaming Data to Enhance Network Security

Applied Sciences ◽

10.3390/app112412073 ◽

2021 ◽

Vol 11 (24) ◽

pp. 12073

Author(s):

Michael Heigl ◽

Enrico Weigelt ◽

Dalibor Fiala ◽

Martin Schramm

Keyword(s):

Feature Selection ◽

Outlier Detection ◽

Data Streams ◽

State Of The Art ◽

Streaming Data ◽

Detection Methods ◽

Unsupervised Feature Selection ◽

Detection Algorithms ◽

Efficient Detection ◽

Selection For

Over the past couple of years, machine learning methods—especially the outlier detection ones—have anchored in the cybersecurity field to detect network-based anomalies rooted in novel attack patterns. However, the ubiquity of massive continuously generated data streams poses an enormous challenge to efficient detection schemes and demands fast, memory-constrained online algorithms that are capable to deal with concept drifts. Feature selection plays an important role when it comes to improve outlier detection in terms of identifying noisy data that contain irrelevant or redundant features. State-of-the-art work either focuses on unsupervised feature selection for data streams or (offline) outlier detection. Substantial requirements to combine both fields are derived and compared with existing approaches. The comprehensive review reveals a research gap in unsupervised feature selection for the improvement of outlier detection methods in data streams. Thus, a novel algorithm for Unsupervised Feature Selection for Streaming Outlier Detection, denoted as UFSSOD, will be proposed, which is able to perform unsupervised feature selection for the purpose of outlier detection on streaming data. Furthermore, it is able to determine the amount of top-performing features by clustering their score values. A generic concept that shows two application scenarios of UFSSOD in conjunction with off-the-shell online outlier detection algorithms has been derived. Extensive experiments have shown that a promising feature selection mechanism for streaming data is not applicable in the field of outlier detection. Moreover, UFSSOD, as an online capable algorithm, yields comparable results to a state-of-the-art offline method trimmed for outlier detection.

Download Full-text

TADILOF: Time Aware Density-Based Incremental Local Outlier Detection in Data Streams

Sensors ◽

10.3390/s20205829 ◽

2020 ◽

Vol 20 (20) ◽

pp. 5829 ◽

Cited By ~ 1

Author(s):

Jen-Wei Huang ◽

Meng-Xun Zhong ◽

Bijay Prasad Jaysawal

Keyword(s):

Outlier Detection ◽

Data Streams ◽

Data Stream ◽

State Of The Art ◽

Streaming Data ◽

Current State ◽

Data Points ◽

Local Outlier ◽

Time Aware ◽

Over Time

Outlier detection in data streams is crucial to successful data mining. However, this task is made increasingly difficult by the enormous growth in the quantity of data generated by the expansion of Internet of Things (IoT). Recent advances in outlier detection based on the density-based local outlier factor (LOF) algorithms do not consider variations in data that change over time. For example, there may appear a new cluster of data points over time in the data stream. Therefore, we present a novel algorithm for streaming data, referred to as time-aware density-based incremental local outlier detection (TADILOF) to overcome this issue. In addition, we have developed a means for estimating the LOF score, termed "approximate LOF," based on historical information following the removal of outdated data. The results of experiments demonstrate that TADILOF outperforms current state-of-the-art methods in terms of AUC while achieving similar performance in terms of execution time. Moreover, we present an application of the proposed scheme to the development of an air-quality monitoring system.

Download Full-text

Unsupervised Feature Selection for Outlier Detection by Modelling Hierarchical Value-Feature Couplings

2016 IEEE 16th International Conference on Data Mining (ICDM) ◽

10.1109/icdm.2016.0052 ◽

2016 ◽

Cited By ~ 10

Author(s):

Guansong Pang ◽

Longbing Cao ◽

Ling Chen ◽

Huan Liu

Keyword(s):

Feature Selection ◽

Outlier Detection ◽

Unsupervised Feature Selection ◽

Selection For

Download Full-text

Anomaly Pattern Detection in Streaming Data Based on the Transformation to Multiple Binary-Valued Data Streams

Journal of Artificial Intelligence and Soft Computing Research ◽

10.2478/jaiscr-2022-0002 ◽

2021 ◽

Vol 12 (1) ◽

pp. 19-27

Author(s):

Taegong Kim ◽

Cheong Hee Park

Keyword(s):

Outlier Detection ◽

Data Streams ◽

Data Stream ◽

Detection Method ◽

Binary Classification ◽

Streaming Data ◽

Pattern Detection ◽

Detection Methods ◽

Anomaly Pattern ◽

Isolation Forest

Abstract Anomaly pattern detection in a data stream aims to detect a time point where outliers begin to occur abnormally. Recently, a method for anomaly pattern detection has been proposed based on binary classification for outliers and statistical tests in the data stream of binary labels of normal or an outlier. It showed that an anomaly pattern can be detected accurately even when outlier detection performance is relatively low. However, since the anomaly pattern detection method is based on the binary classification for outliers, most well-known outlier detection methods, with the output of real-valued outlier scores, can not be used directly. In this paper, we propose an anomaly pattern detection method in a data stream using the transformation to multiple binary-valued data streams from real-valued outlier scores. By using three outlier detection methods, Isolation Forest(IF), Autoencoder-based outlier detection, and Local outlier factor(LOF), the proposed anomaly pattern detection method is tested using artificial and real data sets. The experimental results show that anomaly pattern detection using Isolation Forest gives the best performance.

Download Full-text

A Review of Local Outlier Factor Algorithms for Outlier Detection in Big Data Streams

Big Data and Cognitive Computing ◽

10.3390/bdcc5010001 ◽

2020 ◽

Vol 5 (1) ◽

pp. 1

Author(s):

Omar Alghushairy ◽

Raed Alsini ◽

Terence Soule ◽

Xiaogang Ma

Keyword(s):

Big Data ◽

Outlier Detection ◽

Data Streams ◽

Detection Methods ◽

Normal Range ◽

Local Outlier Factor ◽

Detection Algorithms ◽

Network Intrusion ◽

Entire Dataset ◽

Local Outlier

Outlier detection is a statistical procedure that aims to find suspicious events or items that are different from the normal form of a dataset. It has drawn considerable interest in the field of data mining and machine learning. Outlier detection is important in many applications, including fraud detection in credit card transactions and network intrusion detection. There are two general types of outlier detection: global and local. Global outliers fall outside the normal range for an entire dataset, whereas local outliers may fall within the normal range for the entire dataset, but outside the normal range for the surrounding data points. This paper addresses local outlier detection. The best-known technique for local outlier detection is the Local Outlier Factor (LOF), a density-based technique. There are many LOF algorithms for a static data environment; however, these algorithms cannot be applied directly to data streams, which are an important type of big data. In general, local outlier detection algorithms for data streams are still deficient and better algorithms need to be developed that can effectively analyze the high velocity of data streams to detect local outliers. This paper presents a literature review of local outlier detection algorithms in static and stream environments, with an emphasis on LOF algorithms. It collects and categorizes existing local outlier detection algorithms and analyzes their characteristics. Furthermore, the paper discusses the advantages and limitations of those algorithms and proposes several promising directions for developing improved local outlier detection methods for data streams.

Download Full-text

Genetic-based Summarization for Local Outlier Detection in Data Stream

International Journal of Intelligent Systems and Applications ◽

10.5815/ijisa.2021.01.05 ◽

2021 ◽

Vol 13 (1) ◽

pp. 58-68

Author(s):

Mohamed Sakr ◽

◽

Walid Atwa ◽

Arabi Keshk

Keyword(s):

Outlier Detection ◽

Data Streams ◽

Approximate Solutions ◽

Streaming Data ◽

Detection Algorithms ◽

Processing Power ◽

Static Data ◽

Large Memory ◽

Two Phases ◽

Local Outlier

Outlier detection is one of the important tasks in data mining. Detecting outliers over streaming data has become an important task in many applications, such as network analysis, fraud detections, and environment monitoring. One of the well-known outlier detection algorithms called Local Outlier Factor (LOF). However, the original LOF has many drawbacks that can’t be used with data streams: 1- it needs a lot of processing power (CPU) and large memory to detect the outliers. 2- it deals with static data which mean that in any change in data the LOF recalculates the outliers from the beginning on the whole data. These drawbacks make big challenges for existing outlier detection algorithms in terms of their accuracies when they are implemented in the streaming environment. In this paper, we propose a new algorithm called GSILOF that focuses on detecting outliers from data streams using genetics. GSILOF solve the problem of large memory needed as it has fixed memory bound. GSILOF has two phases. First, the summarization phase that tries to summarize the past data arrived. Second, the detection phase detects the outliers from the new arriving data. The summarization phase uses a genetic algorithm to try to find the subset of points that can represent the whole original set. our experiments have been done over real datasets. Our experiments confirming the effectiveness of the proposed approach and the high quality of approximate solutions in a set of real-world streaming data.

Download Full-text

Unsupervised feature selection for outlier detection in categorical data using mutual information

2012 12th International Conference on Hybrid Intelligent Systems (HIS) ◽

10.1109/his.2012.6421343 ◽

2012 ◽

Cited By ~ 2

Author(s):

N N R Ranga Suri ◽

M Narasimha Murty ◽

G Athithan

Keyword(s):

Feature Selection ◽

Mutual Information ◽

Outlier Detection ◽

Categorical Data ◽

Unsupervised Feature Selection ◽

Selection For

Download Full-text

On the Improvement of the Isolation Forest Algorithm for Outlier Detection with Streaming Data

Electronics ◽

10.3390/electronics10131534 ◽

2021 ◽

Vol 10 (13) ◽

pp. 1534

Author(s):

Michael Heigl ◽

Kumar Ashutosh Anand ◽

Andreas Urmann ◽

Dalibor Fiala ◽

Martin Schramm ◽

...

Keyword(s):

Outlier Detection ◽

Real World ◽

High Speed ◽

State Of The Art ◽

High Volume ◽

Streaming Data ◽

Steady Increase ◽

Efficient Detection ◽

Real World Datasets ◽

Isolation Forest

In recent years, detecting anomalies in real-world computer networks has become a more and more challenging task due to the steady increase of high-volume, high-speed and high-dimensional streaming data, for which ground truth information is not available. Efficient detection schemes applied on networked embedded devices need to be fast and memory-constrained, and must be capable of dealing with concept drifts when they occur. Different approaches for unsupervised online outlier detection have been designed to deal with these circumstances in order to reliably detect malicious activity. In this paper, we introduce a novel framework called PCB-iForest, which generalized, is able to incorporate any ensemble-based online OD method to function on streaming data. Carefully engineered requirements are compared to the most popular state-of-the-art online methods with an in-depth focus on variants based on the widely accepted isolation forest algorithm, thereby highlighting the lack of a flexible and efficient solution which is satisfied by PCB-iForest. Therefore, we integrate two variants into PCB-iForest—an isolation forest improvement called extended isolation forest and a classic isolation forest variant equipped with the functionality to score features according to their contributions to a sample’s anomalousness. Extensive experiments were performed on 23 different multi-disciplinary and security-related real-world datasets in order to comprehensively evaluate the performance of our implementation compared with off-the-shelf methods. The discussion of results, including AUC, F1 score and averaged execution time metric, shows that PCB-iForest clearly outperformed the state-of-the-art competitors in 61% of cases and even achieved more promising results in terms of the tradeoff between classification and computational costs.

Download Full-text

Multivariate Anomaly Detection for Earth Observations: A Comparison of Algorithms and Feature Extraction Techniques

10.5194/esd-2016-51 ◽

2016 ◽

Cited By ~ 1

Author(s):

Milan Flach ◽

Fabian Gans ◽

Alexander Brenning ◽

Joachim Denzler ◽

Markus Reichstein ◽

...

Keyword(s):

Feature Extraction ◽

Anomaly Detection ◽

Data Streams ◽

Multivariate Data ◽

Detection Methods ◽

Earth System ◽

Earth System Science ◽

System Science ◽

Detection Algorithms ◽

Earth Observations

Abstract. Today, many processes at the Earth's surface are constantly monitored by multiple data streams. These observations have become central to advance our understanding of e.g. vegetation dynamics in response to climate or land use change. Another set of important applications is monitoring effects of climatic extreme events, other disturbances such as fires, or abrupt land transitions. One important methodological question is how to reliably detect anomalies in an automated and generic way within multivariate data streams, which typically vary seasonally and are interconnected across variables. Although many algorithms have been proposed for detecting anomalies in multivariate data, only few have been investigated in the context of Earth system science applications. In this study, we systematically combine and compare feature extraction and anomaly detection algorithms for detecting anomalous events. Our aim is to identify suitable workflows for automatically detecting anomalous patterns in multivariate Earth system data streams. We rely on artificial data that mimic typical properties and anomalies in multivariate spatiotemporal Earth observations. This artificial experiment is needed as there is no 'gold standard' for the identification of anomalies in real Earth observations. Our results show that a well chosen feature extraction step (e.g. subtracting seasonal cycles, or dimensionality reduction) is more important than the choice of a particular anomaly detection algorithm. Nevertheless, we identify 3 detection algorithms (k-nearest neighbours mean distance, kernel density estimation, a recurrence approach) and their combinations (ensembles) that outperform other multivariate approaches as well as univariate extreme event detection methods. Our results therefore provide an effective workflow to automatically detect anomalies in Earth system science data.

Download Full-text

Unsupervised Feature Selection for Multi-cluster Data via Smooth Distributed Score

Communications in Computer and Information Science - Emerging Intelligent Computing Technology and Applications ◽

10.1007/978-3-642-31837-5_11 ◽

2012 ◽

pp. 74-79 ◽

Cited By ~ 1

Author(s):

Furui Liu ◽

Xiyan Liu

Keyword(s):

Feature Selection ◽

Unsupervised Feature Selection ◽

Cluster Data ◽

Selection For

Download Full-text

Outlier Detection Methods for Uncovering of Critical Events in Historical Phasor Measurement Records

E3S Web of Conferences ◽

10.1051/e3sconf/20186408006 ◽

2018 ◽

Vol 64 ◽

pp. 08006 ◽

Cited By ~ 1

Author(s):

Kummerow André ◽

Nicolai Steffen ◽

Bretschneider Peter

Keyword(s):

Power Systems ◽

Outlier Detection ◽

Training Data ◽

Detection Methods ◽

Data Sets ◽

Critical Events ◽

Failure Patterns ◽

Detection Algorithms ◽

Reduction Techniques ◽

Dimension Reduction Techniques

The scope of this survey is the uncovering of potential critical events from mixed PMU data sets. An unsupervised procedure is introduced with the use of different outlier detection methods. For that, different techniques for signal analysis are used to generate features in time and frequency domain as well as linear and non-linear dimension reduction techniques. That approach enables the exploration of critical grid dynamics in power systems without prior knowledge about existing failure patterns. Furthermore new failure patterns can be extracted for the creation of training data sets used for online detection algorithms.

Download Full-text