scholarly journals A SURVEY ON THE CURES FOR THE CURSE OF DIMENSIONALITY IN BIG DATA

2017 ◽  
Vol 10 (13) ◽  
pp. 355 ◽  
Author(s):  
Reshma Remesh ◽  
Pattabiraman. V

Dimensionality reduction techniques are used to reduce the complexity for analysis of high dimensional data sets. The raw input data set may have large dimensions and it might consume time and lead to wrong predictions if unnecessary data attributes are been considered for analysis. So using dimensionality reduction techniques one can reduce the dimensions of input data towards accurate prediction with less cost. In this paper the different machine learning approaches used for dimensionality reductions such as PCA, SVD, LDA, Kernel Principal Component Analysis and Artificial Neural Network  have been studied.

2019 ◽  
Vol 8 (S3) ◽  
pp. 66-71
Author(s):  
T. Sudha ◽  
P. Nagendra Kumar

Data mining is one of the major areas of research. Clustering is one of the main functionalities of datamining. High dimensionality is one of the main issues of clustering and Dimensionality reduction can be used as a solution to this problem. The present work makes a comparative study of dimensionality reduction techniques such as t-distributed stochastic neighbour embedding and probabilistic principal component analysis in the context of clustering. High dimensional data have been reduced to low dimensional data using dimensionality reduction techniques such as t-distributed stochastic neighbour embedding and probabilistic principal component analysis. Cluster analysis has been performed on the high dimensional data as well as the low dimensional data sets obtained through t-distributed stochastic neighbour embedding and Probabilistic principal component analysis with varying number of clusters. Mean squared error; time and space have been considered as parameters for comparison. The results obtained show that time taken to convert the high dimensional data into low dimensional data using probabilistic principal component analysis is higher than the time taken to convert the high dimensional data into low dimensional data using t-distributed stochastic neighbour embedding.The space required by the data set reduced through Probabilistic principal component analysis is less than the storage space required by the data set reduced through t-distributed stochastic neighbour embedding.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Van Hoan Do ◽  
Stefan Canzar

AbstractEmerging single-cell technologies profile multiple types of molecules within individual cells. A fundamental step in the analysis of the produced high-dimensional data is their visualization using dimensionality reduction techniques such as t-SNE and UMAP. We introduce j-SNE and j-UMAP as their natural generalizations to the joint visualization of multimodal omics data. Our approach automatically learns the relative contribution of each modality to a concise representation of cellular identity that promotes discriminative features but suppresses noise. On eight datasets, j-SNE and j-UMAP produce unified embeddings that better agree with known cell types and that harmonize RNA and protein velocity landscapes.


Author(s):  
Andrew J. Connolly ◽  
Jacob T. VanderPlas ◽  
Alexander Gray ◽  
Andrew J. Connolly ◽  
Jacob T. VanderPlas ◽  
...  

With the dramatic increase in data available from a new generation of astronomical telescopes and instruments, many analyses must address the question of the complexity as well as size of the data set. This chapter deals with how we can learn which measurements, properties, or combinations thereof carry the most information within a data set. It describes techniques that are related to concepts discussed when describing Gaussian distributions, density estimation, and the concepts of information content. The chapter begins with an exploration of the problems posed by high-dimensional data. It then describes the data sets used in this chapter, and introduces perhaps the most important and widely used dimensionality reduction technique, principal component analysis (PCA). The remainder of the chapter discusses several alternative techniques which address some of the weaknesses of PCA.


2020 ◽  
Vol 12 (22) ◽  
pp. 3703
Author(s):  
Adrian Doicu ◽  
Dmitry S. Efremenko ◽  
Thomas Trautmann

A spectral acceleration approach for the spherical harmonics discrete ordinate method (SHDOM) is designed. This approach combines the correlated k-distribution method and some dimensionality reduction techniques applied on the optical parameters of an atmospheric system. The dimensionality reduction techniques used in this study are the linear embedding methods: principal component analysis, locality pursuit embedding, locality preserving projection, and locally embedded analysis. Through a numerical analysis, it is shown that relative to the correlated k-distribution method, PCA in conjunction with a second-order of scattering approximation yields an acceleration factor of 12. This implies that SHDOM equipped with this acceleration approach is efficient enough to perform spectral integration of radiance fields in inhomogeneous multi-dimensional media.


Author(s):  
Shofiqul Islam ◽  
Sonia Anand ◽  
Jemila Hamid ◽  
Lehana Thabane ◽  
Joseph Beyene

AbstractLinear principal component analysis (PCA) is a widely used approach to reduce the dimension of gene or miRNA expression data sets. This method relies on the linearity assumption, which often fails to capture the patterns and relationships inherent in the data. Thus, a nonlinear approach such as kernel PCA might be optimal. We develop a copula-based simulation algorithm that takes into account the degree of dependence and nonlinearity observed in these data sets. Using this algorithm, we conduct an extensive simulation to compare the performance of linear and kernel principal component analysis methods towards data integration and death classification. We also compare these methods using a real data set with gene and miRNA expression of lung cancer patients. First few kernel principal components show poor performance compared to the linear principal components in this occasion. Reducing dimensions using linear PCA and a logistic regression model for classification seems to be adequate for this purpose. Integrating information from multiple data sets using either of these two approaches leads to an improved classification accuracy for the outcome.


2021 ◽  
Vol 54 (4) ◽  
pp. 1-34
Author(s):  
Felipe L. Gewers ◽  
Gustavo R. Ferreira ◽  
Henrique F. De Arruda ◽  
Filipi N. Silva ◽  
Cesar H. Comin ◽  
...  

Principal component analysis (PCA) is often applied for analyzing data in the most diverse areas. This work reports, in an accessible and integrated manner, several theoretical and practical aspects of PCA. The basic principles underlying PCA, data standardization, possible visualizations of the PCA results, and outlier detection are subsequently addressed. Next, the potential of using PCA for dimensionality reduction is illustrated on several real-world datasets. Finally, we summarize PCA-related approaches and other dimensionality reduction techniques. All in all, the objective of this work is to assist researchers from the most diverse areas in using and interpreting PCA.


2019 ◽  
pp. 85-98 ◽  
Author(s):  
Ana del Águila ◽  
Dmitry S. Efremenko ◽  
Thomas Trautmann

Hyper-spectral sensors take measurements in the narrow contiguous bands across the electromagnetic spectrum. Usually, the goal is to detect a certain object or a component of the medium with unique spectral signatures. In particular, the hyper-spectral measurements are used in atmospheric remote sensing to detect trace gases. To improve the efficiency of hyper-spectral processing algorithms, data reduction methods are applied. This paper outlines the dimensionality reduction techniques in the context of hyper-spectral remote sensing of the atmosphere. The dimensionality reduction excludes redundant information from the data and currently is the integral part of high-performance radiation transfer models. In this survey, it is shown how the principal component analysis can be applied for spectral radiance modelling and retrieval of atmospheric constituents, thereby speeding up the data processing by orders of magnitude. The discussed techniques are generic and can be readily applied for solving atmospheric as well as material science problems.


2009 ◽  
Vol 6 (2) ◽  
pp. 217-227 ◽  
Author(s):  
Aswani Kumar

Domains such as text, images etc contain large amounts of redundancies and ambiguities among the attributes which result in considerable noise effects (i.e. the data is high dimension). Retrieving the data from high dimensional datasets is a big challenge. Dimensionality reduction techniques have been a successful avenue for automatically extracting the latent concepts by removing the noise and reducing the complexity in processing the high dimensional data. In this paper we conduct a systematic study on comparing the unsupervised dimensionality reduction techniques for text retrieval task. We analyze these techniques from the view of complexity, approximation error and retrieval quality with experiments on four testing document collections.


Author(s):  
Damien François

In many applications, like function approximation, pattern recognition, time series prediction, and data mining, one has to build a model relating some features describing the data to some response value. Often, the features that are relevant for building the model are not known in advance. Feature selection methods allow removing irrelevant and/or redundant features to only keep the feature subset that are most useful to build a prediction model. The model is simpler and easier to interpret, reducing the risks of overfitting, non-convergence, etc. By contrast with other dimensionality reduction techniques such as principal component analysis or more recent nonlinear projection techniques (Lee & Verleysen 2007), which build a new, smaller set of features, the features that are selected by feature selection methods preserve their initial meaning, potentially bringing extra information about the process being modeled (Guyon 2006). Recently, the advent of high-dimensional data has raised new challenges for feature selection methods, both from the algorithmic point of view and the conceptual point of view (Liu & Motoda 2007). The problem of feature selection is exponential in nature, and many approximate algorithms are cubic with respect to the initial number of features, which may be intractable when the dimensionality of the data is large. Furthermore, high-dimensional data are often highly redundant, and two distinct subsets of features may have very similar predictive power, which can make it difficult to identify the best subset.


Sign in / Sign up

Export Citation Format

Share Document