Exploring high-dimensional data space: Identifying optimal process conditions in photovoltaics

Author(s):  
Changwon Suh ◽  
David Biagioni ◽  
Stephen Glynn ◽  
John Scharf ◽  
Miguel A. Contreras ◽  
...  
2003 ◽  
Vol 2 (4) ◽  
pp. 232-246 ◽  
Author(s):  
Diansheng Guo

Unknown (and unexpected) multivariate patterns lurking in high-dimensional datasets are often very hard to find. This paper describes a human-centered exploration environment, which incorporates a coordinated suite of computational and visualization methods to explore high-dimensional data for uncovering patterns in multivariate spaces. Specifically, it includes: (1) an interactive feature selection method for identifying potentially interesting, multidimensional subspaces from a high-dimensional data space, (2) an interactive, hierarchical clustering method for searching multivariate clusters of arbitrary shape, and (3) a suite of coordinated visualization and computational components centered around the above two methods to facilitate a human-led exploration. The implemented system is used to analyze a cancer dataset and shows that it is efficient and effective for discovering unknown and unexpected multivariate patterns from high-dimensional data.


Author(s):  
Iain M. Johnstone ◽  
D. Michael Titterington

Modern applications of statistical theory and methods can involve extremely large datasets, often with huge numbers of measurements on each of a comparatively small number of experimental units. New methodology and accompanying theory have emerged in response: the goal of this Theme Issue is to illustrate a number of these recent developments. This overview article introduces the difficulties that arise with high-dimensional data in the context of the very familiar linear statistical model: we give a taste of what can nevertheless be achieved when the parameter vector of interest is sparse, that is, contains many zero elements. We describe other ways of identifying low-dimensional subspaces of the data space that contain all useful information. The topic of classification is then reviewed along with the problem of identifying, from within a very large set, the variables that help to classify observations. Brief mention is made of the visualization of high-dimensional data and ways to handle computational problems in Bayesian analysis are described. At appropriate points, reference is made to the other papers in the issue.


2020 ◽  
Vol 9 (1) ◽  
pp. 45-63
Author(s):  
Sumana B.V. ◽  
Punithavalli M.

Researchers working on real world classification data have identified that a combination of class overlap with class imbalance and high dimensional data is a crucial problem and are important factors for degrading performance of the classifier. Hence, it has received significant attention in recent years. Misclassification often occurs in the overlapped region as there is no clear distinction between the class boundaries and the presence of high dimensional data with an imbalanced proportion poses an additional challenge. Only a few studies have ever been attempted to address all these issues simultaneously; therefore; a model is proposed which initially divides the data space into overlapped and non-overlapped regions using a K-means algorithm, then the classifier is allowed to learn from two data space regions separately and finally, the results are combined. The experiment is conducted using the Heart dataset selected from the Keel repository and results prove that the proposed model improves the efficiency of the classifier based on accuracy, kappa, precision, recall, f-measure, FNR, FPR, and time.


2017 ◽  
Author(s):  
◽  
Avimanyou Kumar Vatsa

Recently emerging approaches to high-throughput phenotyping have become important tools in unraveling the biological basis of agronomically and medically important phenotypes. These experiments produce very large sets of either low or high-dimensional data. Finding clusters in the entire space of high-dimensional data (HDD) is a challenging task, because the relative distances between any two objects converge to zero with increasing dimensionality. Additionally, real data may not be mathematically well behaved. Finally, many clusters are expected on biological grounds to be "natural" -- that is, to have irregular, overlapping boundaries in different subsets of the dimensions. More precisely, the natural clusters of the data could differ in shape, size, density, and dimensionality; and they might not be disjoint. In principle, clustering such data could be done by dimension reduction methods. However, these methods convert many dimensions to a smaller set of dimensions that make the clustering results difficult to interpret and may also lead to a significant loss of information. Another possible approach is to find subspaces (subsets of dimensions) in the entire data space of the HDD. However, the existing subspace methods don't discover natural clusters. Therefore, in this dissertation I propose a novel data preprocessing method, demonstrating that a group of phenotypes are interdependent, and propose a novel density-based subspace clustering algorithm for high-dimensional data, called Dynamic Locally Density Adaptive Scalable Subspace Clustering (DynaDASC). This algorithm is relatively locally density adaptive, scalable, dynamic, and nonmetric in nature, and discovers natural clusters.


Author(s):  
Tsvetan Asamov ◽  
Adi Ben-Israel

In general, the clustering problem is NP-hard, and global optimality cannot be established for non-trivial instances. For high-dimensional data, distance-based methods for clustering or classification face an additional difficulty, the unreliability of distances in very high-dimensional spaces. We propose a probabilistic, distance-based, iterative method for clustering data in very high-dimensional space, using the ℓ1-metric that is less sensitive to high dimensionality than the Euclidean distance. For K clusters in ℝ n , the problem decomposes to K problems coupled by probabilities, and an iteration reduces to finding Kn weighted medians of points on a line. The complexity of the algorithm is linear in the dimension of the data space, and its performance was observed to improve significantly as the dimension increases.


Sign in / Sign up

Export Citation Format

Share Document