scholarly journals Dimensionality reduction and data integration for scRNA-seq data based on integrative hierarchical Poisson factorisation

2021 ◽  
Author(s):  
Thomas Wong ◽  
Mauricio Barahona

Single-cell RNA sequencing (scRNA-seq) data sets consist of high-dimensional, sparse and noisy feature vectors, and pose a challenge for classic methods for dimensionality reduction. We show that application of Hierarchical Poisson Factorisation (HPF) to scRNA-seq data produces robust factors, and outperforms other popular methods. To account for batch variability in composite data sets, we introduce Integrative Hierarchical Poisson Factorisation (IHPF), an extension of HPF that makes use of a noise ratio hyper-parameter to tune the variability attributed to technical (batches) vs. biological (cell phenotypes) sources. We exemplify the advantageous application of IHPF under data integration scenarios with varying alignments of technical noise and cell diversity, and show that IHPF produces latent factors with a dual block structure in both cell and gene spaces for enhanced biological interpretability.

2018 ◽  
Vol 30 (12) ◽  
pp. 3281-3308
Author(s):  
Hong Zhu ◽  
Li-Zhi Liao ◽  
Michael K. Ng

We study a multi-instance (MI) learning dimensionality-reduction algorithm through sparsity and orthogonality, which is especially useful for high-dimensional MI data sets. We develop a novel algorithm to handle both sparsity and orthogonality constraints that existing methods do not handle well simultaneously. Our main idea is to formulate an optimization problem where the sparse term appears in the objective function and the orthogonality term is formed as a constraint. The resulting optimization problem can be solved by using approximate augmented Lagrangian iterations as the outer loop and inertial proximal alternating linearized minimization (iPALM) iterations as the inner loop. The main advantage of this method is that both sparsity and orthogonality can be satisfied in the proposed algorithm. We show the global convergence of the proposed iterative algorithm. We also demonstrate that the proposed algorithm can achieve high sparsity and orthogonality requirements, which are very important for dimensionality reduction. Experimental results on both synthetic and real data sets show that the proposed algorithm can obtain learning performance comparable to that of other tested MI learning algorithms.


Author(s):  
M. Sulaiman Khan ◽  
Maybin Muyeba ◽  
Frans Coenen ◽  
David Reid ◽  
Hissam Tawfik

In this paper, a composite fuzzy association rule mining mechanism (CFARM), directed at identifying patterns in datasets comprised of composite attributes, is described. Composite attributes are defined as attributes that can take simultaneously two or more values that subscribe to a common schema. The objective is to generate fuzzy association rules using “properties” associated with these composite attributes. The exemplar application is the analysis of the nutrients contained in items found in grocery data sets. The paper commences with a review of the back ground and related work, and a formal definition of the CFARM concepts. The CFARM algorithm is then fully described and evaluated using both real and synthetic data sets.


Author(s):  
Diego Milone ◽  
Georgina Stegmayer ◽  
Matías Gerard ◽  
Laura Kamenetzky ◽  
Mariana López ◽  
...  

The volume of information derived from post genomic technologies is rapidly increasing. Due to the amount of involved data, novel computational methods are needed for the analysis and knowledge discovery into the massive data sets produced by these new technologies. Furthermore, data integration is also gaining attention for merging signals from different sources in order to discover unknown relations. This chapter presents a pipeline for biological data integration and discovery of a priori unknown relationships between gene expressions and metabolite accumulations. In this pipeline, two standard clustering methods are compared against a novel neural network approach. The neural model provides a simple visualization interface for identification of coordinated patterns variations, independently of the number of produced clusters. Several quality measurements have been defined for the evaluation of the clustering results obtained on a case study involving transcriptomic and metabolomic profiles from tomato fruits. Moreover, a method is proposed for the evaluation of the biological significance of the clusters found. The neural model has shown a high performance in most of the quality measures, with internal coherence in all the identified clusters and better visualization capabilities.


2015 ◽  
Vol 4 (2) ◽  
pp. 336
Author(s):  
Alaa Najim

<p><span lang="EN-GB">Using dimensionality reduction idea to visualize graph data sets can preserve the properties of the original space and reveal the underlying information shared among data points. Continuity Trustworthy Graph Embedding (CTGE) is new method we have introduced in this paper to improve the faithfulness of the graph visualization. We will use CTGE in graph field to find new understandable representation to be more easy to analyze and study. Several experiments on real graph data sets are applied to test the effectiveness and efficiency of the proposed method, which showed CTGE generates highly faithfulness graph representation when compared its representation with other methods.</span></p>


2018 ◽  
Vol 14 (4) ◽  
pp. 20-37 ◽  
Author(s):  
Yinglei Song ◽  
Yongzhong Li ◽  
Junfeng Qu

This article develops a new approach for supervised dimensionality reduction. This approach considers both global and local structures of a labelled data set and maximizes a new objective that includes the effects from both of them. The objective can be approximately optimized by solving an eigenvalue problem. The approach is evaluated based on a few benchmark data sets and image databases. Its performance is also compared with a few other existing approaches for dimensionality reduction. Testing results show that, on average, this new approach can achieve more accurate results for dimensionality reduction than existing approaches.


Complexity ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-12 ◽  
Author(s):  
Jincai Chang ◽  
Qiuling Pan ◽  
Zhihao Shen ◽  
Hao Qin

In a refrigeration unit, the amount of refrigerant has a substantial influence on the entire refrigeration system. To predict the amount of refrigerant in refrigerators with the best performance, this study used refrigerator data collected in real time via the Internet of Things, which were screened to include only the effective parameters related to the compressor and refrigeration properties (based on their practical significance and the research background) and cleaned by applying longitudinal dimensionality reduction and transverse dimensionality reduction. Then, on the basis of an idealized model for refrigerator data, a model of the relationships between refrigerant amount (the dependent variable) and temperature variation, refrigerator compartment temperature, freezer temperature, and other relevant parameters (independent variables) was established. A refrigeration model based on a neural network was then established for predicting the amount of refrigerant and was used to predict five unknown amounts of refrigerant from data sets. BP neural network and RBF neural network models were used to compare the prediction results and analyze the loss functions. From the results, it was concluded that the unknown amount of refrigerant was most likely to be 32.5 g. It is of great practical significance for refrigerator production and maintenance to study the prediction of the amount of refrigerant remaining in a refrigerator.


1989 ◽  
Vol 20 (2) ◽  
pp. 31 ◽  
Author(s):  
G.A. Spencer ◽  
D.F. Pridmore ◽  
D.J. Isles

lmage processing in exploration has rapidly evolved into the field of data integration, whereby independent data sets which coincide in space are displayed concurrently. Interrelation-ships between data sets which may be crucial to exploration can thus be identified much more effectively than with conventional hard copy overlays. The use of perceptual colour space; hue, saturation and luminosity (HSL) provides an effective means for integrating raster data sets, as illustrated with the multi-spectral scanner and airborne geophysical data from the Kambalda area in Western Australia. The integration process must also cater for data in vector format, which is more appropriate for geological, topographic and cultural information, but to date, image processing systems have poorly captured and managed such data. As a consequence, the merging of vector data management software such as GIS (geographic information system) with existing advanced image enhancement packages is an area of active development in the exploration industry.


2006 ◽  
Vol 71 (3) ◽  
pp. 567-578 ◽  
Author(s):  
Keith Kintigh

This forum reports the results of a National Science Foundation—funded workshop that focused on the integration and preservation of digital databases and other structured data derived from archaeological contexts. The workshop concluded that for archaeology to achieve its potential to advance long-term, scientific understandings of human history, there is a pressing need for an archaeological information infrastructure that will allow us to archive, access, integrate, and mine disparate data sets. This report provides an assessment of the informatics needs of archaeology, articulates an ambitious vision for a distributed disciplinary information infrastructure (cyberinfrastructure), discusses the challenges posed by its development, and outlines initial steps toward its realization. Finally, it argues that such a cyberinfrastructure has enormous potential to contribute to anthropology and science more generally. Concept-oriented archaeological data integration will enable the use of existing data to answer compelling new questions and permit syntheses of archaeological data that rely not on other investigators' conclusions but on analyses of meaningfully integrated new and legacy data sets.


2005 ◽  
Vol 03 (05) ◽  
pp. 1021-1038
Author(s):  
AO YUAN ◽  
GUANJIE CHEN ◽  
CHARLES ROTIMI ◽  
GEORGE E. BONNEY

The existence of haplotype blocks transmitted from parents to offspring has been suggested recently. This has created an interest in the inference of the block structure and length. The motivation is that haplotype blocks that are characterized well will make it relatively easier to quickly map all the genes carrying human diseases. To study the inference of haplotype block systematically, we propose a statistical framework. In this framework, the optimal haplotype block partitioning is formulated as the problem of statistical model selection; missing data can be handled in a standard statistical way; population strata can be implemented; block structure inference/hypothesis testing can be performed; prior knowledge, if present, can be incorporated to perform a Bayesian inference. The algorithm is linear in the number of loci, instead of NP-hard for many such algorithms. We illustrate the applications of our method to both simulated and real data sets.


Sign in / Sign up

Export Citation Format

Share Document