baredSC: Bayesian approach to retrieve expression distribution of single-cell data

Lucille Lopez-Delisle; Jean-Baptiste Delisle

doi:10.1186/s12859-021-04507-8

baredSC: Bayesian approach to retrieve expression distribution of single-cell data

BMC Bioinformatics ◽

10.1186/s12859-021-04507-8 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Lucille Lopez-Delisle ◽

Jean-Baptiste Delisle

Keyword(s):

Single Cell ◽

Bayesian Approach ◽

Genetic Interaction ◽

Gaussian Mixture ◽

Two Dimensions ◽

Biological Data ◽

Specific Gene ◽

Trimodal Distribution ◽

Embryonic Limb ◽

Cell Data

Abstract Background The number of studies using single-cell RNA sequencing (scRNA-seq) is constantly growing. This powerful technique provides a sampling of the whole transcriptome of a cell. However, sparsity of the data can be a major hurdle when studying the distribution of the expression of a specific gene or the correlation between the expressions of two genes. Results We show that the main technical noise associated with these scRNA-seq experiments is due to the sampling, i.e., Poisson noise. We present a new tool named baredSC, for Bayesian Approach to Retrieve Expression Distribution of Single-Cell data, which infers the intrinsic expression distribution in scRNA-seq data using a Gaussian mixture model. baredSC can be used to obtain the distribution in one dimension for individual genes and in two dimensions for pairs of genes, in particular to estimate the correlation in the two genes’ expressions. We apply baredSC to simulated scRNA-seq data and show that the algorithm is able to uncover the expression distribution used to simulate the data, even in multi-modal cases with very sparse data. We also apply baredSC to two real biological data sets. First, we use it to measure the anti-correlation between Hoxd13 and Hoxa11, two genes with known genetic interaction in embryonic limb. Then, we study the expression of Pitx1 in embryonic hindlimb, for which a trimodal distribution has been identified through flow cytometry. While other methods to analyze scRNA-seq are too sensitive to sampling noise, baredSC reveals this trimodal distribution. Conclusion baredSC is a powerful tool which aims at retrieving the expression distribution of few genes of interest from scRNA-seq data.

Download Full-text

baredSC: Bayesian Approach to Retrieve Expression Distribution of Single-Cell

10.1101/2021.05.26.445740 ◽

2021 ◽

Author(s):

Lucille Lopez-Delisle ◽

Jean-Baptiste Delisle

Keyword(s):

Single Cell ◽

Bayesian Approach ◽

Genetic Interaction ◽

Gaussian Mixture ◽

Two Dimensions ◽

Biological Data ◽

Specific Gene ◽

Trimodal Distribution ◽

Embryonic Limb ◽

Sparse Samples

The number of studies using single-cell RNA sequencing (scRNA-seq) is constantly growing. This powerful technique provides a sampling of the whole transcriptome of a cell. However, the commonly used droplet-based method often produces very sparse samples. Sparsity can be a major hurdle when studying the distribution of the expression of a specific gene or the correlation between the expressions of two genes. We show that the main technical noise associated with these scRNA-seq experiments is due to the sampling (i.e. Poisson noise). We developed a new tool named baredSC, for Bayesian Approach to Retrieve Expression Distribution of Single-Cell, which infers the intrinsic expression distribution in noisy single-cell data using a Gaussian mixture model (GMM). baredSC can be used to obtain the distribution in one dimension for individual genes and in two dimensions for pairs of genes, in particular to estimate the correlation in the two genes' expressions. We apply baredSC to simulated scRNA-seq data and show that the algorithm is able to uncover the expression distribution used to simulate the data, even in multi-modal cases with very sparse data. We also apply baredSC to two real biological data sets. First, we use it to measure the anti-correlation between Hoxd13 and Hoxa11, two genes with known genetic interaction in embryonic limb. Then, we study the expression of Pitx1 in embryonic hindlimb, for which a trimodal distribution has been identified through flow cytometry. While other methods to analyze scRNA-seq are too sensitive to sampling noise, baredSC reveals this trimodal distribution.

Download Full-text

Insights into therapeutic targets and biomarkers using integrated multi-‘omics’ approaches for dilated and ischemic cardiomyopathies

Integrative Biology ◽

10.1093/intbio/zyab007 ◽

2021 ◽

Author(s):

Austė Kanapeckaitė ◽

Neringa Burokienė

Keyword(s):

Machine Learning ◽

Single Cell ◽

Learning Algorithm ◽

Expression Profiles ◽

Therapeutic Targets ◽

Development Stage ◽

Biological Data ◽

Specific Gene ◽

Tissue Remodelling ◽

Pharmacological Management

Abstract At present, heart failure (HF) treatment only targets the symptoms based on the left ventricle dysfunction severity; however, the lack of systemic ‘omics’ studies and available biological data to uncover the heterogeneous underlying mechanisms signifies the need to shift the analytical paradigm towards network-centric and data mining approaches. This study, for the first time, aimed to investigate how bulk and single cell RNA-sequencing as well as the proteomics analysis of the human heart tissue can be integrated to uncover HF-specific networks and potential therapeutic targets or biomarkers. We also aimed to address the issue of dealing with a limited number of samples and to show how appropriate statistical models, enrichment with other datasets as well as machine learning-guided analysis can aid in such cases. Furthermore, we elucidated specific gene expression profiles using transcriptomic and mined data from public databases. This was achieved using the two-step machine learning algorithm to predict the likelihood of the therapeutic target or biomarker tractability based on a novel scoring system, which has also been introduced in this study. The described methodology could be very useful for the target or biomarker selection and evaluation during the pre-clinical therapeutics development stage as well as disease progression monitoring. In addition, the present study sheds new light into the complex aetiology of HF, differentiating between subtle changes in dilated cardiomyopathies (DCs) and ischemic cardiomyopathies (ICs) on the single cell, proteome and whole transcriptome level, demonstrating that HF might be dependent on the involvement of not only the cardiomyocytes but also on other cell populations. Identified tissue remodelling and inflammatory processes can be beneficial when selecting targeted pharmacological management for DCs or ICs, respectively.

Download Full-text

Modeling latent flows on single-cell data using the Hodge decomposition

10.1101/592089 ◽

2019 ◽

Author(s):

Kazumitsu Maehara ◽

Yasuyuki Ohkawa

Keyword(s):

Diffusion Process ◽

Single Cell ◽

Trajectory Analysis ◽

Single Cells ◽

Hodge Decomposition ◽

Biological Data ◽

Graph Representation ◽

Specific Cell ◽

Sparse Graph ◽

Cell Data

AbstractSingle-cell analysis is a powerful technique used to identify a specific cell population of interest during differentiation, aging, or oncogenesis. Individual cells occupy a particular transient state in the cell cycle, circadian rhythm, or during cell death. An appealing concept of pseudo-time trajectory analysis of single-cell RNA sequencing data was proposed in the software Monocle, and several methods of trajectory analysis have since been published to date. These aim to infer the ordering of cells and enable the tracing of gene expression profile trajectories in cell differentiation and reprogramming. However, the methods are restricted in terms of time structure because of the pre-specified structure of trajectories (linear, branched, tree or cyclic) which contrasts with the mixed state of single cells.Here, we propose a technique to extract underlying flows in single-cell data based on the Hodge decomposition (HD). HD is a theorem of vector fields on a manifold which guarantees that any given flow can decompose into three types of orthogonal component: gradient-flow (acyclic), curl-, and harmonic-flow (cyclic). HD is generalized on a simplicial complex (graph) and the discretized HD has only a weak assumption that the graph is directed. Therefore, in principle, HD can extract flows from any mixture of tree and cyclic time flows of observed cells. The decomposed flows provide intuitive interpretations about complex flow because of their linearity and orthogonality. Thus, each extracted flow can be focused on separately with no need to consider crosstalk.We developed ddhodge software, which aims to model the underlying flow structure that implies unobserved time or causal relations in the hodge-podge collection of data points. We demonstrated that the mathematical framework of HD is suitable to reconstruct a sparse graph representation of diffusion process as a candidate model of differentiation while preserving the divergence of the original fully-connected graph. The preserved divergence can be used as an indicator of the source and sink cells in the observed population. A sparse graph representation of the diffusion process transforms data analysis of the non-linear structure embedded in the high-dimensional space of single-cell data into inspection of the visible flow using graph algorithms. Hence, ddhodge is a suitable toolkit to visualize, inspect, and subsequently interpret large data sets including, but not limited to, high-throughput measurements of biological data.The beta version of ddhodge R package is available at:https://github.com/kazumits/ddhodge

Download Full-text

Novel insights into potential therapeutic targets and biomarkers using integrated multi-omics approaches for dilated and ischemic cardiomyopathies

10.1101/2020.12.15.422946 ◽

2020 ◽

Author(s):

Auste Kanapeckaite ◽

Neringa Burokiene

Keyword(s):

Heart Failure ◽

Single Cell ◽

Expression Profiles ◽

Target Selection ◽

Therapeutic Targets ◽

Gene Expression Profiles ◽

Development Stage ◽

Biological Data ◽

Machine Learning Algorithms ◽

Specific Gene

At present heart failure treatment targets symptoms based on the left ventricle dysfunction severity; however, lack of systemic studies and available biological data to uncover heterogeneous underlying mechanisms on the scale of genomic, transcriptional and expressed protein level signifies the need to shift the analytical paradigm toward network centric and data mining approaches. This study, for the first time, aimed to investigate how bulk and single cell RNA-sequencing as well as the proteomics analysis of the human heart tissue can be integrated to uncover heart failure specific networks and potential therapeutic targets or biomarkers. Furthermore, it was demonstrated that transcriptomics data in combination with minded data from public databases can be used to elucidate specific gene expression profiles. This was achieved using machine learning algorithms to predict the likelihood of the therapeutic target or biomarker tractability based on a novel scoring system also introduced in this study. The described methodology could be very useful for the target selection and evaluation during the pre-clinical therapeutics development stage. Finally, the present study shed new light into the complex etiology of the heart failure differentiating between subtle changes in dilated and ischemic cardiomyopathy on the single cell, proteome and whole transcriptome level.

Download Full-text

Implication of specific retinal cell-type involvement and gene expression changes in AMD progression using integrative analysis of single-cell and bulk RNA-seq profiling

Scientific Reports ◽

10.1038/s41598-021-95122-3 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Yafei Lyu ◽

Randy Zauhar ◽

Nicholas Dana ◽

Christianne E. Strang ◽

Jian Hu ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Age Related Macular Degeneration ◽

Specific Gene ◽

Cell Type ◽

Adult Human ◽

Single Cell Rna Sequencing ◽

Cell Type Specific ◽

Cell Data

AbstractAge‐related macular degeneration (AMD) is a blinding eye disease with no unifying theme for its etiology. We used single-cell RNA sequencing to analyze the transcriptomes of ~ 93,000 cells from the macula and peripheral retina from two adult human donors and bulk RNA sequencing from fifteen adult human donors with and without AMD. Analysis of our single-cell data identified 267 cell-type-specific genes. Comparison of macula and peripheral retinal regions found no cell-type differences but did identify 50 differentially expressed genes (DEGs) with about 1/3 expressed in cones. Integration of our single-cell data with bulk RNA sequencing data from normal and AMD donors showed compositional changes more pronounced in macula in rods, microglia, endothelium, Müller glia, and astrocytes in the transition from normal to advanced AMD. KEGG pathway analysis of our normal vs. advanced AMD eyes identified enrichment in complement and coagulation pathways, antigen presentation, tissue remodeling, and signaling pathways including PI3K-Akt, NOD-like, Toll-like, and Rap1. These results showcase the use of single-cell RNA sequencing to infer cell-type compositional and cell-type-specific gene expression changes in intact bulk tissue and provide a foundation for investigating molecular mechanisms of retinal disease that lead to new therapeutic targets.

Download Full-text

Poincaré Maps for Analyzing Complex Hierarchies in Single-Cell Data

10.1101/689547 ◽

2019 ◽

Cited By ~ 2

Author(s):

Anna Klimovskaia ◽

David Lopez-Paz ◽

Léon Bottou ◽

Maximilian Nickel

Keyword(s):

Data Analysis ◽

Single Cell ◽

Hyperbolic Geometry ◽

Continuous Extension ◽

Two Dimensions ◽

Biological Processes ◽

Poincaré Maps ◽

Poincare Maps ◽

Cell Trajectories ◽

Cell Data

AbstractThe need to understand cell developmental processes spawned a plethora of computational methods for discovering hierarchies from scRNAseq data. However, existing techniques are based on Euclidean geometry, a suboptimal choice for modeling complex cell trajectories with multiple branches. To overcome this fundamental representation issue we propose Poincaré maps, a method that harness the power of hyperbolic geometry into the realm of single-cell data analysis. Often understood as a continuous extension of trees, hyperbolic geometry enables the embedding of complex hierarchical data in only two dimensions while preserving the pairwise distances between points in the hierarchy. This enables direct exploratory analysis and the use of our embeddings in a wide variety of downstream data analysis tasks, such as visualization, clustering, lineage detection and pseudo-time inference. When compared to existing methods —unable to address all these important tasks using a single embedding— Poincaré maps produce state-of-the-art two-dimensional representations of cell trajectories on multiple scRNAseq datasets. More specifically, we demonstrate that Poincaré maps allow in a straightforward manner to formulate new hypotheses about biological processes unbeknown to prior methods.Significance statementThe discovery of hierarchies in biological processes is central to developmental biology. We propose Poincaré maps, a new method based on hyperbolic geometry to discover continuous hierarchies from pairwise similarities. We demonstrate the efficacy of our method on multiple single-cell datasets on tasks such as visualization, clustering, lineage identification, and pseudo-time inference.

Download Full-text

Natian and Ryabhatta—graphical user interfaces to create, analyze and visualize single-cell transcriptomic datasets

10.1101/2021.06.17.448424 ◽

2021 ◽

Author(s):

Sathiyanarayanan Manivannan ◽

Vidu Garg

Keyword(s):

Quality Control ◽

Single Cell ◽

User Interfaces ◽

Dimensional Reduction ◽

Life Sciences ◽

Principal Component ◽

Specific Gene ◽

Gene Count ◽

The Individual ◽

Cell Data

Single-cell transcriptomic analyses permit a high-resolution investigation of biological processes at the individual cell level. Single-cell transcriptomics technologies such as Drop-seq, Smart-seq, MARS-seq, sci-RNA-seq, and CELL-seq produce large volumes of data in the form of sequence reads. In general, the alignment of the reads to genomes and the enumeration of reads mapping to a specific gene results in a gene-count matrix. These gene-count matrix data require robust quality control and statistical analytical pipelines before data mining and interpretation. Among these post-alignment pipelines, the 'Seurat' package in 'R' is the most popular analytical pipeline for the analysis of single-cell data. This package provides quality control, normalization, principal component analysis, dimensional reduction, clustering, and marker identification among other functions needed to process and mine the single-cell transcriptomic data. While the Seurat package is continuously updated and includes a variety of functionalities, the user is still required to be proficient in the 'R' programming language and its data structures to be able to execute the Seurat functions. Hence, there is a demand for a graphical user interface (GUI) that takes in relevant input information and processes the single-cell data using the Seurat pipeline. A GUI will also highly improve the access to single-cell data for life sciences researchers who are not trained in the command-line operation of the 'R' platform. To meet this demand, we present R Shiny apps 'Natian' and 'Ryabhatta' to assist in the generation and analysis of Seurat files from a variety of different sources. The apps and example data can be downloaded from https://singlecelltranscriptomics.org. Natian allows users to create Seurat files from the output of multiple pipelines, integrate existing Seurat files, add metadata information, perform dimensional reduction analysis or upload dimensional reduction data, resume partially processed Seurat files and find cluster markers. Ryabhatta allows users to visualize gene expression using a variety of plotting options, analyze cluster markers, rename clusters, select cells from a graph or based on expression levels of markers, perform differential expression, count the number of cells in each condition, and perform pseudotime analysis using Monocle. We found that the use of these apps substantially improved the analytical and processing time and remove needless troubleshooting due to incompatible commands, typographical errors in scripts, and cluttering of the R environment with variables. We hope the use of these apps improves the use of single-cell data for life sciences research while also providing a tool to learn the functionalities of Seurat and R functions available for single-cell data analysis.

Download Full-text

Single-Cell Sequencing Reveals Lineage-Specific Dynamic Genetic Regulation of Gene Expression During Human Cardiomyocyte Differentiation

10.1101/2021.06.03.446970 ◽

2021 ◽

Author(s):

Reem Elorbany ◽

Joshua M Popp ◽

Katherine Rhodes ◽

Benjamin J Strober ◽

Kenneth Barr ◽

...

Keyword(s):

Single Cell ◽

Cell Lines ◽

Specific Gene ◽

Specific Cell ◽

Cardiomyocyte Differentiation ◽

Cell Type ◽

Dynamic Effects ◽

Regulatory Changes ◽

Gene Regulatory ◽

Cell Data

Dynamic and temporally specific gene regulatory changes may underlie unexplained genetic associations with complex disease. During a dynamic process such as cellular differentiation, the overall cell type composition of a tissue (or an in vitro culture) and the gene regulatory profile of each cell can both experience significant changes over time. To identify these dynamic effects in high resolution, we collected single-cell RNA-sequencing data over a differentiation time course from induced pluripotent stem cells to cardiomyocytes, sampled at 7 unique time points in 19 human cell lines. We employed a flexible approach to map dynamic eQTLs whose effects vary significantly over the course of bifurcating differentiation trajectories, including many whose effects are specific to one of these two lineages. Our study design allowed us to distinguish true dynamic eQTLs affecting a specific cell lineage from expression changes driven by potentially non-genetic differences between cell lines such as cell composition. Additionally, we used the cell type profiles learned from single-cell data to deconvolve and re-analyze data from matched bulk RNA-seq samples. Using this approach, we were able to identify a large number of novel dynamic eQTLs in single cell data while also attributing dynamic effects in bulk to a particular lineage. Overall, we found that using single cell data to uncover dynamic eQTLs can provide new insight into the gene regulatory changes that occur among heterogeneous cell types during cardiomyocyte differentiation.

Download Full-text

Bayesian estimation of cell type-specific gene expression with prior derived from single-cell data

Genome Research ◽

10.1101/gr.268722.120 ◽

2021 ◽

pp. gr.268722.120

Author(s):

Jiebiao Wang ◽

Kathryn Roeder ◽

Bernie Devlin

Keyword(s):

Gene Expression ◽

Single Cell ◽

Bayesian Estimation ◽

Specific Gene ◽

Cell Type ◽

Specific Gene Expression ◽

Cell Type Specific ◽

Cell Data

Download Full-text

A United Statistical Framework for Single Cell and Bulk Sequencing Data

10.1101/206532 ◽

2017 ◽

Cited By ~ 1

Author(s):

Lingxue Zhu ◽

Jing Lei ◽

Bernie Devlin ◽

Kathryn Roeder

Keyword(s):

Gene Expression ◽

Single Cell ◽

Cell Types ◽

Accurate Estimation ◽

Specific Gene ◽

Rna Seq ◽

Cell Type ◽

Cell Type Specific ◽

Different Cell Types ◽

Cell Data

Recent advances in technology have enabled the measurement of RNA levels for individual cells. Compared to traditional tissue-level bulk RNA-seq data, single cell sequencing yields valuable insights about gene expression profiles for different cell types, which is potentially critical for understanding many complex human diseases. However, developing quantitative tools for such data remains challenging because of high levels of technical noise, especially the “dropout” events. A “dropout” happens when the RNA for a gene fails to be amplified prior to sequencing, producing a “false” zero in the observed data. In this paper, we propose a Unified RNA-Sequencing Model (URSM) for both single cell and bulk RNA-seq data, formulated as a hierarchical model. URSM borrows the strength from both data sources and carefully models the dropouts in single cell data, leading to a more accurate estimation of cell type specific gene expression profile. In addition, URSM naturally provides inference on the dropout entries in single cell data that need to be imputed for downstream analyses, as well as the mixing proportions of different cell types in bulk samples. We adopt an empirical Bayes approach, where parameters are estimated using the EM algorithm and approximate inference is obtained by Gibbs sampling. Simulation results illustrate that URSM outperforms existing approaches both in correcting for dropouts in single cell data, as well as in deconvolving bulk samples. We also demonstrate an application to gene expression data on fetal brains, where our model successfully imputes the dropout genes and reveals cell type specific expression patterns.

Download Full-text