scholarly journals Bias, robustness and scalability in differential expression analysis of single-cell RNA-seq data

2017 ◽  
Author(s):  
Charlotte Soneson ◽  
Mark D. Robinson

AbstractBackgroundAs single-cell RNA-seq (scRNA-seq) is becoming increasingly common, the amount of publicly available data grows rapidly, generating a useful resource for computational method development and extension of published results. Although processed data matrices are typically made available in public repositories, the procedure to obtain these varies widely between data sets, which may complicate reuse and cross-data set comparison. Moreover, while many statistical methods for performing differential expression analysis of scRNA-seq data are becoming available, their relative merits and the performance compared to methods developed for bulk RNA-seq data are not sufficiently well understood.ResultsWe present conquer, a collection of consistently processed, analysis-ready public single-cell RNA-seq data sets. Each data set has count and transcripts per million (TPM) estimates for genes and transcripts, as well as quality control and exploratory analysis reports. We use a subset of the data sets available in conquer to perform an extensive evaluation of the performance and characteristics of statistical methods for differential gene expression analysis, evaluating a total of 30 statistical approaches on both experimental and simulated scRNA-seq data.ConclusionsConsiderable differences are found between the methods in terms of the number and characteristics of the genes that are called differentially expressed. Pre-filtering of lowly expressed genes can have important effects on the results, particularly for some of the methods originally developed for analysis of bulk RNA-seq data. Generally, however, methods developed for bulk RNA-seq analysis do not perform notably worse than those developed specifically for scRNA-seq.

2018 ◽  
Vol 34 (19) ◽  
pp. 3340-3348 ◽  
Author(s):  
Zhijin Wu ◽  
Yi Zhang ◽  
Michael L Stitzel ◽  
Hao Wu

2019 ◽  
Author(s):  
Mahmoud M Ibrahim ◽  
Rafael Kramann

ABSTRACTMarker genes identified in single cell experiments are expected to be highly specific to a certain cell type and highly expressed in that cell type. Detecting a gene by differential expression analysis does not necessarily satisfy those two conditions and is typically computationally expensive for large cell numbers.Here we present genesorteR, an R package that ranks features in single cell data in a manner consistent with the expected definition of marker genes in experimental biology research. We benchmark genesorteR using various data sets and show that it is distinctly more accurate in large single cell data sets compared to other methods. genesorteR is orders of magnitude faster than current implementations of differential expression analysis methods, can operate on data containing millions of cells and is applicable to both single cell RNA-Seq and single cell ATAC-Seq data.genesorteR is available at https://github.com/mahmoudibrahim/genesorteR.


2019 ◽  
Vol 16 (2) ◽  
pp. 163-166 ◽  
Author(s):  
Vasilis Ntranos ◽  
Lynn Yi ◽  
Páll Melsted ◽  
Lior Pachter

2018 ◽  
Author(s):  
Jesse M. Zhang ◽  
Govinda M. Kamath ◽  
David N. Tse

SummarySingle-cell computational pipelines involve two critical steps: organizing cells (clustering) and identifying the markers driving this organization (differential expression analysis). State-of-the-art pipelines perform differential analysis after clustering on the same dataset. We observe that because clustering forces separation, reusing the same dataset generates artificially low p-values and hence false discoveries. We introduce a valid post-clustering differential analysis framework which corrects for this problem. We provide software at https://github.com/jessemzhang/tn_test.


2017 ◽  
Author(s):  
Koen Van den Berge ◽  
Charlotte Soneson ◽  
Michael I. Love ◽  
Mark D. Robinson ◽  
Lieven Clement

AbstractDropout in single cell RNA-seq (scRNA-seq) applications causes many transcripts to go undetected. It induces excess zero counts, which leads to power issues in differential expression (DE) analysis and has triggered the development of bespoke scRNA-seq DE tools that cope with zero-inflation. Recent evaluations, however, have shown that dedicated scRNA-seq tools provide no advantage compared to traditional bulk RNA-seq tools. We introduce zingeR, a zero-inflated negative binomial model that identifies excess zero counts and generates observation weights to unlock bulk RNA-seq pipelines for zero-inflation, boosting performance in scRNA-seq differential expression analysis.


2021 ◽  
Author(s):  
Marine Gauthier ◽  
Denis Agniel ◽  
Rodolphe Thiébaut ◽  
Boris P. Hejblum

State-of-the-art methods for single-cell RNA-seq (scRNA-seq) Differential Expression Analysis (DEA) often rely on strong distributional assumptions that are difficult to verify in practice. Furthermore, while the increasing complexity of clinical and biological single-cell studies calls for greater tool versatility, the majority of existing methods only tackle the comparison between two conditions. We propose a novel, distribution-free, and flexible approach to DEA for single-cell RNA-seq data. This new method, called ccdf, tests the association of each gene expression with one or many variables of interest (that can be either continuous or discrete), while potentially adjusting for additional covariates. To test such complex hypotheses, ccdf uses a conditional independence test relying on the conditional cumulative distribution function, estimated through multiple regressions. We provide the asymptotic distribution of the ccdf test statistic as well as a permutation test (when the number of observed cells is not sufficiently large). ccdf substantially expands the possibilities for scRNA-seq DEA studies: it obtains good statistical performance in various simulation scenarios considering complex experimental designs i.e. beyond the two condition comparison), while retaining competitive performance with state-of-the-art methods in a two-condition benchmark.


2021 ◽  
Author(s):  
Mengqi Zhang ◽  
Si Liu ◽  
Zhen Miao ◽  
Fang Han ◽  
Raphael Gottardo ◽  
...  

Bulk RNA-seq data quantify the expression of a gene in an individual by one number (e.g., fragment count). In contrast, single cell RNA-seq (scRNA-seq) data provide much richer information: the distribution of gene expression across many cells. To assess differential expression across individuals using scRNA-seq data, a straightforward solution is to create ''pseudo'' bulk RNA-seq data by adding up the fragment counts of a gene across cells for each individual, and then apply methods designed for differential expression using bulk RNA-seq data. This pseudo-bulk solution reduces the distribution of gene expression across cells to a single number and thus loses a good amount of information. We propose to assess differential expression using the gene expression distribution measured by cell level data. We find denoising cell level data can substantially improve the power of this approach. We apply our method, named IDEAS (Individual level Differential Expression Analysis for scRNA-seq), to study the gene expression difference between autism subjects and controls. We find neurogranin-expressing neurons harbor a high proportion of differentially expressed genes, and ERBB signals in microglia are associated with autism.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Matthew Chung ◽  
Vincent M. Bruno ◽  
David A. Rasko ◽  
Christina A. Cuomo ◽  
José F. Muñoz ◽  
...  

AbstractAdvances in transcriptome sequencing allow for simultaneous interrogation of differentially expressed genes from multiple species originating from a single RNA sample, termed dual or multi-species transcriptomics. Compared to single-species differential expression analysis, the design of multi-species differential expression experiments must account for the relative abundances of each organism of interest within the sample, often requiring enrichment methods and yielding differences in total read counts across samples. The analysis of multi-species transcriptomics datasets requires modifications to the alignment, quantification, and downstream analysis steps compared to the single-species analysis pipelines. We describe best practices for multi-species transcriptomics and differential gene expression.


2017 ◽  
Vol 45 (19) ◽  
pp. 10978-10988 ◽  
Author(s):  
Cheng Jia ◽  
Yu Hu ◽  
Derek Kelly ◽  
Junhyong Kim ◽  
Mingyao Li ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document