Multi-Instance Dimensionality Reduction via Sparsity and Orthogonality

Hong Zhu; Li-Zhi Liao; Michael K. Ng

doi:10.1162/neco_a_01140

Multi-Instance Dimensionality Reduction via Sparsity and Orthogonality

Neural Computation ◽

10.1162/neco_a_01140 ◽

2018 ◽

Vol 30 (12) ◽

pp. 3281-3308

Author(s):

Hong Zhu ◽

Li-Zhi Liao ◽

Michael K. Ng

Keyword(s):

Dimensionality Reduction ◽

Optimization Problem ◽

Augmented Lagrangian ◽

Main Idea ◽

Real Data ◽

Learning Performance ◽

High Dimensional ◽

Data Sets ◽

Outer Loop ◽

Orthogonality Constraints

We study a multi-instance (MI) learning dimensionality-reduction algorithm through sparsity and orthogonality, which is especially useful for high-dimensional MI data sets. We develop a novel algorithm to handle both sparsity and orthogonality constraints that existing methods do not handle well simultaneously. Our main idea is to formulate an optimization problem where the sparse term appears in the objective function and the orthogonality term is formed as a constraint. The resulting optimization problem can be solved by using approximate augmented Lagrangian iterations as the outer loop and inertial proximal alternating linearized minimization (iPALM) iterations as the inner loop. The main advantage of this method is that both sparsity and orthogonality can be satisfied in the proposed algorithm. We show the global convergence of the proposed iterative algorithm. We also demonstrate that the proposed algorithm can achieve high sparsity and orthogonality requirements, which are very important for dimensionality reduction. Experimental results on both synthetic and real data sets show that the proposed algorithm can obtain learning performance comparable to that of other tested MI learning algorithms.

Download Full-text

Learning Strictly Orthogonal p-Order Nonnegative Laplacian Embedding via Smoothed Iterative Reweighted Method

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/561 ◽

2019 ◽

Cited By ~ 2

Author(s):

Haoxuan Yang ◽

Kai Liu ◽

Hua Wang ◽

Feiping Nie

Keyword(s):

Optimization Problem ◽

Empirical Studies ◽

High Dimensional Data ◽

Real Data ◽

High Dimensional ◽

Data Sets ◽

Powerful Method ◽

Intrinsic Geometry ◽

L2 Norm ◽

Nonnegative Constraints

Laplacian Embedding (LE) is a powerful method to reveal the intrinsic geometry of high-dimensional data by using graphs. Imposing the orthogonal and nonnegative constraints onto the LE objective has proved to be effective to avoid degenerate and negative solutions, which, though, are challenging to achieve simultaneously because they are nonlinear and nonconvex. In addition, recent studies have shown that using the p-th order of the L2-norm distances in LE can find the best solution for clustering and promote the robustness of the embedding model against outliers, although this makes the optimization objective nonsmooth and difficult to efficiently solve in general. In this work, we study LE that uses the p-th order of the L2-norm distances and satisfies both orthogonal and nonnegative constraints. We introduce a novel smoothed iterative reweighted method to tackle this challenging optimization problem and rigorously analyze its convergence. We demonstrate the effectiveness and potential of our proposed method by extensive empirical studies on both synthetic and real data sets.

Download Full-text

Mahalanobis distance informed by clustering

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iay011 ◽

2018 ◽

Vol 8 (2) ◽

pp. 377-406

Author(s):

Almog Lahav ◽

Ronen Talmon ◽

Yuval Kluger

Keyword(s):

Mahalanobis Distance ◽

High Dimensional Data ◽

Hidden Variables ◽

Real Data ◽

Risk Groups ◽

High Dimensional ◽

Data Sets ◽

Kaplan Meier ◽

Data Points ◽

Survival Plot

Abstract A fundamental question in data analysis, machine learning and signal processing is how to compare between data points. The choice of the distance metric is specifically challenging for high-dimensional data sets, where the problem of meaningfulness is more prominent (e.g. the Euclidean distance between images). In this paper, we propose to exploit a property of high-dimensional data that is usually ignored, which is the structure stemming from the relationships between the coordinates. Specifically, we show that organizing similar coordinates in clusters can be exploited for the construction of the Mahalanobis distance between samples. When the observable samples are generated by a nonlinear transformation of hidden variables, the Mahalanobis distance allows the recovery of the Euclidean distances in the hidden space. We illustrate the advantage of our approach on a synthetic example where the discovery of clusters of correlated coordinates improves the estimation of the principal directions of the samples. Our method was applied to real data of gene expression for lung adenocarcinomas (lung cancer). By using the proposed metric we found a partition of subjects to risk groups with a good separation between their Kaplan–Meier survival plot.

Download Full-text

Constructing a Lightweight Key-Value Store Based on the Windows Native Features

Applied Sciences ◽

10.3390/app9183801 ◽

2019 ◽

Vol 9 (18) ◽

pp. 3801 ◽

Cited By ~ 1

Author(s):

Hyuk-Yoon Kwon

Keyword(s):

State Of The Art ◽

Main Idea ◽

Real Data ◽

Data Sets ◽

Parameter Setting ◽

Data Set ◽

Multi Level ◽

Windows Registry ◽

Best Parameter ◽

Better Than

In this paper, we propose a method to construct a lightweight key-value store based on the Windows native features. The main idea is providing a thin wrapper for the key-value store on top of a built-in storage in Windows, called Windows registry. First, we define a mapping of the components in the key-value store onto the components in the Windows registry. Then, we present a hash-based multi-level registry index so as to distribute the key-value data balanced and to efficiently access them. Third, we implement basic operations of the key-value store (i.e., Get, Put, and Delete) by manipulating the Windows registry using the Windows native APIs. We call the proposed key-value store WR-Store. Finally, we propose an efficient ETL (Extract-Transform-Load) method to migrate data stored in WR-Store into any other environments that support existing key-value stores. Because the performance of the Windows registry has not been studied much, we perform the empirical study to understand the characteristics of WR-Store, and then, tune the performance of WR-Store to find the best parameter setting. Through extensive experiments using synthetic and real data sets, we show that the performance of WR-Store is comparable to or even better than the state-of-the-art systems (i.e., RocksDB, BerkeleyDB, and LevelDB). Especially, we show the scalability of WR-Store. That is, WR-Store becomes much more efficient than the other key-value stores as the size of data set increases. In addition, we show that the performance of WR-Store is maintained even in the case of intensive registry workloads where 1000 processes accessing to the registry actively are concurrently running.

Download Full-text

Using Self-Similarity to Incorporate Dimensionality Reduction and Cluster Evolution Tracking

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.336-338.2242 ◽

2013 ◽

Vol 336-338 ◽

pp. 2242-2247

Author(s):

Guang Hui Yan ◽

Yong Chen ◽

Hong Yun Zhao ◽

Ya Jin Ren ◽

Zhi Cheng Ma

Keyword(s):

Dimensionality Reduction ◽

Synthetic Data ◽

Fractal Dimensionality ◽

High Dimensional ◽

Data Sets ◽

Stream Data ◽

Self Similarity ◽

Cluster Evolution ◽

Dimensionality Reduction Technique ◽

On Line

Cluster evolution tracking and dimensionality reduction have been studied intensively but separately in the time decayed and high dimensional stream data environment during the past decades. However, the interaction between the cluster evolution and the dimensionality reduction is the most common scenario in the time decayed stream data. Therefore, the dimensionality reduction should interact with cluster operation in the endless life cycle of stream data. In this paper, we first investigate the interaction between dimensionality reduction and cluster evolution in the high dimensional time decayed stream data. Then, we integrate the on-line sequential forward fractal dimensionality reduction technique with self-adaptive technique for cluster evolution tracking based on multi-fractal. Our performance experiments over a number of real and synthetic data sets illustrate the effectiveness and efficiency provided by our approach.

Download Full-text

A SURVEY ON THE CURES FOR THE CURSE OF DIMENSIONALITY IN BIG DATA

Asian Journal of Pharmaceutical and Clinical Research ◽

10.22159/ajpcr.2017.v10s1.19755 ◽

2017 ◽

Vol 10 (13) ◽

pp. 355 ◽

Cited By ~ 1

Author(s):

Reshma Remesh ◽

Pattabiraman. V

Keyword(s):

Dimensionality Reduction ◽

Input Data ◽

Principal Component ◽

Kernel Principal Component Analysis ◽

High Dimensional ◽

Data Sets ◽

Learning Approaches ◽

Data Set ◽

Reduction Techniques ◽

Dimensionality Reduction Techniques

Dimensionality reduction techniques are used to reduce the complexity for analysis of high dimensional data sets. The raw input data set may have large dimensions and it might consume time and lead to wrong predictions if unnecessary data attributes are been considered for analysis. So using dimensionality reduction techniques one can reduce the dimensions of input data towards accurate prediction with less cost. In this paper the different machine learning approaches used for dimensionality reductions such as PCA, SVD, LDA, Kernel Principal Component Analysis and Artificial Neural Network have been studied.

Download Full-text

Dimensionality Reduction for Registration of High-Dimensional Data Sets

IEEE Transactions on Image Processing ◽

10.1109/tip.2013.2253480 ◽

2013 ◽

Vol 22 (8) ◽

pp. 3041-3049 ◽

Cited By ~ 8

Author(s):

Min Xu ◽

Hao Chen ◽

P. K. Varshney

Keyword(s):

Dimensionality Reduction ◽

High Dimensional Data ◽

High Dimensional ◽

Data Sets

Download Full-text

LASSO-Type Variable Selection Methods for High-Dimensional Data

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.444-445.604 ◽

2013 ◽

Vol 444-445 ◽

pp. 604-609

Author(s):

Guang Hui Fu ◽

Pan Wang

Keyword(s):

Variable Selection ◽

High Dimensional Data ◽

Real Data ◽

Oracle Property ◽

High Dimensional ◽

Data Sets ◽

Group Effect ◽

Selection Methods ◽

Variable Selection Method ◽

Type Variable

LASSO is a very useful variable selection method for high-dimensional data , But it does not possess oracle property [Fan and Li, 200 and group effect [Zou and Hastie, 200. In this paper, we firstly review four improved LASSO-type methods which satisfy oracle property and (or) group effect, and then give another two new ones called WFEN and WFAEN. The performance on both the simulation and real data sets shows that WFEN and WFAEN are competitive with other LASSO-type methods.

Download Full-text

Research on Dimensionality Reduction Based on Neighborhood Preserving Embedding and Sparse Representation

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.58-60.547 ◽

2011 ◽

Vol 58-60 ◽

pp. 547-550

Author(s):

Di Wu ◽

Zhao Zheng

Keyword(s):

Dimensionality Reduction ◽

Linear Subspace ◽

Rapid Development ◽

Linear Method ◽

High Dimensional ◽

Data Sets ◽

Artificial Data ◽

Nonlinear Method ◽

Nonlinear Methods ◽

Data Dimensionality Reduction

In real world, high-dimensional data are everywhere, but the nature structure behind them is always featured by only a few parameters. With the rapid development of computer vision, more and more data dimensionality reduction problems are involved, this leads to the rapid development of dimensionality reduction algorithms. Linear method such as LPP [1], NPE [2], nonlinear method such as LLE [3] and improvement version kernel NPE. One particularly simple but effective assumption in face recognition is that the samples from the same class lie on a linear subspace, so lots of nonlinear methods only perform well on some artificial data sets. This paper emphasizes on NPE and SPP [4] come up with recently, and combines these methods, the experiments show the effect of new method outperform some classic unsupervised methods.

Download Full-text

A Robust Supervised Variable Selection for Noisy High-Dimensional Data

BioMed Research International ◽

10.1155/2015/320385 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 6

Author(s):

Jan Kalina ◽

Anna Schlenker

Keyword(s):

Variable Selection ◽

Dimensionality Reduction ◽

Robust Statistics ◽

High Dimensional Data ◽

Real Data ◽

High Dimensional ◽

Adaptive Weights ◽

Novel Approach ◽

Reduction Methods ◽

Data Adaptive

The Minimum Redundancy Maximum Relevance (MRMR) approach to supervised variable selection represents a successful methodology for dimensionality reduction, which is suitable for high-dimensional data observed in two or more different groups. Various available versions of the MRMR approach have been designed to search for variables with the largest relevance for a classification task while controlling for redundancy of the selected set of variables. However, usual relevance and redundancy criteria have the disadvantages of being too sensitive to the presence of outlying measurements and/or being inefficient. We propose a novel approach called Minimum Regularized Redundancy Maximum Robust Relevance (MRRMRR), suitable for noisy high-dimensional data observed in two groups. It combines principles of regularization and robust statistics. Particularly, redundancy is measured by a new regularized version of the coefficient of multiple correlation and relevance is measured by a highly robust correlation coefficient based on the least weighted squares regression with data-adaptive weights. We compare various dimensionality reduction methods on three real data sets. To investigate the influence of noise or outliers on the data, we perform the computations also for data artificially contaminated by severe noise of various forms. The experimental results confirm the robustness of the method with respect to outliers.

Download Full-text

Cluster Weighted Model Based on TSNE Algorithm for High-Dimensional Data

10.21203/rs.3.rs-347795/v1 ◽

2021 ◽

Author(s):

Kehinde Olobatuyi

Keyword(s):

Mixture Models ◽

Dimensional Space ◽

High Dimensional Data ◽

Expectation Maximization Algorithm ◽

Real Data ◽

R Package ◽

High Dimensional ◽

Data Sets ◽

Dimensionality Reduction Technique ◽

Weighted Model

Abstract Similar to many Machine Learning models, both accuracy and speed of the Cluster weighted models (CWMs) can be hampered by high-dimensional data, leading to previous works on a parsimonious technique to reduce the effect of ”Curse of dimensionality” on mixture models. In this work, we review the background study of the cluster weighted models (CWMs). We further show that parsimonious technique is not sufficient for mixture models to thrive in the presence of huge high-dimensional data. We discuss a heuristic for detecting the hidden components by choosing the initial values of location parameters using the default values in the ”FlexCWM” R package. We introduce a dimensionality reduction technique called T-distributed stochastic neighbor embedding (TSNE) to enhance the parsimonious CWMs in high-dimensional space. Originally, CWMs are suited for regression but for classification purposes, all multi-class variables are transformed logarithmically with some noise. The parameters of the model are obtained via expectation maximization algorithm. The effectiveness of the discussed technique is demonstrated using real data sets from different fields.

Download Full-text