scholarly journals Statistical challenges of high-dimensional data

Author(s):  
Iain M. Johnstone ◽  
D. Michael Titterington

Modern applications of statistical theory and methods can involve extremely large datasets, often with huge numbers of measurements on each of a comparatively small number of experimental units. New methodology and accompanying theory have emerged in response: the goal of this Theme Issue is to illustrate a number of these recent developments. This overview article introduces the difficulties that arise with high-dimensional data in the context of the very familiar linear statistical model: we give a taste of what can nevertheless be achieved when the parameter vector of interest is sparse, that is, contains many zero elements. We describe other ways of identifying low-dimensional subspaces of the data space that contain all useful information. The topic of classification is then reviewed along with the problem of identifying, from within a very large set, the variables that help to classify observations. Brief mention is made of the visualization of high-dimensional data and ways to handle computational problems in Bayesian analysis are described. At appropriate points, reference is made to the other papers in the issue.

2021 ◽  
Vol 50 (1) ◽  
pp. 138-152
Author(s):  
Mujeeb Ur Rehman ◽  
Dost Muhammad Khan

Recently, anomaly detection has acquired a realistic response from data mining scientists as a graph of its reputation has increased smoothly in various practical domains like product marketing, fraud detection, medical diagnosis, fault detection and so many other fields. High dimensional data subjected to outlier detection poses exceptional challenges for data mining experts and it is because of natural problems of the curse of dimensionality and resemblance of distant and adjoining points. Traditional algorithms and techniques were experimented on full feature space regarding outlier detection. Customary methodologies concentrate largely on low dimensional data and hence show ineffectiveness while discovering anomalies in a data set comprised of a high number of dimensions. It becomes a very difficult and tiresome job to dig out anomalies present in high dimensional data set when all subsets of projections need to be explored. All data points in high dimensional data behave like similar observations because of its intrinsic feature i.e., the distance between observations approaches to zero as the number of dimensions extends towards infinity. This research work proposes a novel technique that explores deviation among all data points and embeds its findings inside well established density-based techniques. This is a state of art technique as it gives a new breadth of research towards resolving inherent problems of high dimensional data where outliers reside within clusters having different densities. A high dimensional dataset from UCI Machine Learning Repository is chosen to test the proposed technique and then its results are compared with that of density-based techniques to evaluate its efficiency.


2020 ◽  
Vol 49 (3) ◽  
pp. 421-437
Author(s):  
Genggeng Liu ◽  
Lin Xie ◽  
Chi-Hua Chen

Dimensionality reduction plays an important role in the data processing of machine learning and data mining, which makes the processing of high-dimensional data more efficient. Dimensionality reduction can extract the low-dimensional feature representation of high-dimensional data, and an effective dimensionality reduction method can not only extract most of the useful information of the original data, but also realize the function of removing useless noise. The dimensionality reduction methods can be applied to all types of data, especially image data. Although the supervised learning method has achieved good results in the application of dimensionality reduction, its performance depends on the number of labeled training samples. With the growing of information from internet, marking the data requires more resources and is more difficult. Therefore, using unsupervised learning to learn the feature of data has extremely important research value. In this paper, an unsupervised multilayered variational auto-encoder model is studied in the text data, so that the high-dimensional feature to the low-dimensional feature becomes efficient and the low-dimensional feature can retain mainly information as much as possible. Low-dimensional feature obtained by different dimensionality reduction methods are used to compare with the dimensionality reduction results of variational auto-encoder (VAE), and the method can be significantly improved over other comparison methods.


Author(s):  
Bharat Gupta ◽  
Durga Toshniwal

In high dimensional data large no of outliers are embedded in low dimensional subspaces known as projected outliers, but most of existing outlier detection techniques are unable to find these projected outliers, because these methods perform detection of abnormal patterns in full data space. So, outlier detection in high dimensional data becomes an important research problem. In this paper we are proposing an approach for outlier detection of high dimensional data. Here we are modifying the existing SPOT approach by adding three new concepts namely Adaption of Sparse Sub-Space Template (SST), Different combination of PCS parameters and set of non outlying cells for testing data set.


2021 ◽  
Author(s):  
Klaus Eckelt ◽  
Andreas Hinterreiter ◽  
Patrick Adelberger ◽  
Conny Walchshofer ◽  
Vaishali Dhanoa ◽  
...  

In this work, we propose an interactive visual approach for the exploration of structural relationships in embeddings of high-dimensional data. These structural relationships, such as item sequences, associations of items with groups, and hierarchies between groups of items, are defining properties of many real-world datasets. Nevertheless, most existing methods for the visual exploration of embeddings treat these structures as second-class citizens or do not take them into account at all. In our proposed analysis workflow, users explore enriched scatterplots of the embedding, in which relationships between items and/or groups are visually highlighted. The original high-dimensional data for single items, groups of items, or differences between connected items and groups is accessible through additional summary visualizations. We carefully tailored these summary and difference visualizations to the various data types and semantic contexts. During their exploratory analysis, users can externalize their insights by setting up additional groups and relationships between items and/or groups, thereby creating graphs that represent visual data stories. We demonstrate the utility and potential impact of our approach by means of two use cases and multiple examples from various domains.


2013 ◽  
Vol 6 (3) ◽  
pp. 441-448 ◽  
Author(s):  
Sajid Nagi ◽  
Dhruba Kumar Bhattacharyya ◽  
Jugal K. Kalita

When clustering high dimensional data, traditional clustering methods are found to be lacking since they consider all of the dimensions of the dataset in discovering clusters whereas only some of the dimensions are relevant. This may give rise to subspaces within the dataset where clusters may be found. Using feature selection, we can remove irrelevant and redundant dimensions by analyzing the entire dataset. The problem of automatically identifying clusters that exist in multiple and maybe overlapping subspaces of high dimensional data, allowing better clustering of the data points, is known as Subspace Clustering. There are two major approaches to subspace clustering based on search strategy. Top-down algorithms find an initial clustering in the full set of dimensions and evaluate the subspaces of each cluster, iteratively improving the results. Bottom-up approaches start from finding low dimensional dense regions, and then use them to form clusters. Based on a survey on subspace clustering, we identify the challenges and issues involved with clustering gene expression data.


Clustering is a data mining task devoted to the automatic grouping of data based on mutual similarity. Clustering in high-dimensional spaces is a recurrent problem in many domains. It affects time complexity, space complexity, scalability and accuracy of clustering methods. Highdimensional non-linear datausually live in different low dimensional subspaces hidden in the original space. As high‐dimensional objects appear almost alike, new approaches for clustering are required. This research has focused on developing Mathematical models, techniques and clustering algorithms specifically for high‐dimensional data. The innocent growth in the fields of communication and technology, there is tremendous growth in high dimensional data spaces. As the variant of dimensions on high dimensional non-linear data increases, many clustering techniques begin to suffer from the curse of dimensionality, de-grading the quality of the results. In high dimensional non-linear data, the data becomes very sparse and distance measures become increasingly meaningless. The principal challenge for clustering high dimensional data is to overcome the “curse of dimensionality”. This research work concentrates on devising an enhanced algorithm for clustering high dimensional non-linear data.


Sign in / Sign up

Export Citation Format

Share Document