scholarly journals dearseq: a variance component score test for RNA-Seq differential analysis that effectively controls the false discovery rate

2019 ◽  
Author(s):  
Marine Gauthier ◽  
Denis Agniel ◽  
Rodolphe Thiébaut ◽  
Boris P. Hejblum

AbstractRNA-seq studies are growing in size and popularity. We provide evidence that the most commonly used methods for differential expression analysis (DEA) may yield too many false positive results in some situations. We presentdearseq, a new method for DEA which controls the FDR without making any assumption about the true distribution of RNA-seq data. We show thatdearseqcontrols the FDR while maintaining strong statistical power compared to the most popular methods. We demonstrate this behavior with mathematical proofs, simulations, and a real data set from a study of Tuberculosis, where our method produces fewer apparent false positives.

2020 ◽  
Vol 2 (4) ◽  
Author(s):  
Marine Gauthier ◽  
Denis Agniel ◽  
Rodolphe Thiébaut ◽  
Boris P Hejblum

Abstract RNA-seq studies are growing in size and popularity. We provide evidence that the most commonly used methods for differential expression analysis (DEA) may yield too many false positive results in some situations. We present dearseq, a new method for DEA that controls the false discovery rate (FDR) without making any assumption about the true distribution of RNA-seq data. We show that dearseq controls the FDR while maintaining strong statistical power compared to the most popular methods. We demonstrate this behavior with mathematical proofs, simulations and a real data set from a study of tuberculosis, where our method produces fewer apparent false positives.


2018 ◽  
Vol 28 (8) ◽  
pp. 2418-2438
Author(s):  
Xi Shen ◽  
Chang-Xing Ma ◽  
Kam C Yuen ◽  
Guo-Liang Tian

Bilateral correlated data are often encountered in medical researches such as ophthalmologic (or otolaryngologic) studies, in which each unit contributes information from paired organs to the data analysis, and the measurements from such paired organs are generally highly correlated. Various statistical methods have been developed to tackle intra-class correlation on bilateral correlated data analysis. In practice, it is very important to adjust the effect of confounder on statistical inferences, since either ignoring the intra-class correlation or confounding effect may lead to biased results. In this article, we propose three approaches for testing common risk difference for stratified bilateral correlated data under the assumption of equal correlation. Five confidence intervals of common difference of two proportions are derived. The performance of the proposed test methods and confidence interval estimations is evaluated by Monte Carlo simulations. The simulation results show that the score test statistic outperforms other statistics in the sense that the former has robust type [Formula: see text] error rates with high powers. The score confidence interval induced from the score test statistic performs satisfactorily in terms of coverage probabilities with reasonable interval widths. A real data set from an otolaryngologic study is used to illustrate the proposed methodologies.


2020 ◽  
Author(s):  
Takayuki Osabe ◽  
Kentaro Shimizu ◽  
Koji Kadota

Abstract Background RNA-seq is a tool for measuring gene expression and is commonly used to identify differentially expressed genes (DEGs). Gene clustering is used to classify DEGs with similar expression patterns for the subsequent analyses of data from experiments such as time-courses or multi-group comparisons. However, gene clustering has rarely been used for analyzing simple two-group data or differential expression (DE). In this study, we report a model-based clustering algorithm, MBCluster.Seq, that can be implemented using an R package for DE analysis.Results The input data originally used by MBCluster.Seq is DEGs, and the proposed method (called MBCdeg) uses all genes for the analysis. The method uses posterior probabilities of genes assigned to a cluster displaying non-DEG pattern for overall gene ranking. We compared the performance of MBCdeg with conventional R packages such as edgeR, DESeq2, and TCC that are specialized for DE analysis using simulated and real data. Our results showed that MBCdeg outperformed other methods when the proportion of DEG was less than 50%. However, the DEG identification using MBCdeg was less consistent than with conventional methods. We compared the effects of different normalization algorithms using MBCdeg, and performed an analysis using MBCdeg in combination with a robust normalization algorithm (called DEGES) that was not implemented in MBCluster.Seq. The new analysis method showed greater stability than using the original MBCdeg with the default normalization algorithm.Conclusions MBCdeg with DEGES normalization can be used in the identification of DEGs when the PDEG is relatively low. As the method is based on gene clustering, the DE result includes information on which expression pattern the gene belongs to. The new method may be useful for the analysis of time-course and multi-group data, where the classification of expression patterns is often required.


2017 ◽  
Author(s):  
Charlotte Soneson ◽  
Mark D. Robinson

AbstractBackgroundAs single-cell RNA-seq (scRNA-seq) is becoming increasingly common, the amount of publicly available data grows rapidly, generating a useful resource for computational method development and extension of published results. Although processed data matrices are typically made available in public repositories, the procedure to obtain these varies widely between data sets, which may complicate reuse and cross-data set comparison. Moreover, while many statistical methods for performing differential expression analysis of scRNA-seq data are becoming available, their relative merits and the performance compared to methods developed for bulk RNA-seq data are not sufficiently well understood.ResultsWe present conquer, a collection of consistently processed, analysis-ready public single-cell RNA-seq data sets. Each data set has count and transcripts per million (TPM) estimates for genes and transcripts, as well as quality control and exploratory analysis reports. We use a subset of the data sets available in conquer to perform an extensive evaluation of the performance and characteristics of statistical methods for differential gene expression analysis, evaluating a total of 30 statistical approaches on both experimental and simulated scRNA-seq data.ConclusionsConsiderable differences are found between the methods in terms of the number and characteristics of the genes that are called differentially expressed. Pre-filtering of lowly expressed genes can have important effects on the results, particularly for some of the methods originally developed for analysis of bulk RNA-seq data. Generally, however, methods developed for bulk RNA-seq analysis do not perform notably worse than those developed specifically for scRNA-seq.


2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Jinyang Zhang ◽  
Shuai Chen ◽  
Jingwen Yang ◽  
Fangqing Zhao

AbstractDetection and quantification of circular RNAs (circRNAs) face several significant challenges, including high false discovery rate, uneven rRNA depletion and RNase R treatment efficiency, and underestimation of back-spliced junction reads. Here, we propose a novel algorithm, CIRIquant, for accurate circRNA quantification and differential expression analysis. By constructing pseudo-circular reference for re-alignment of RNA-seq reads and employing sophisticated statistical models to correct RNase R treatment biases, CIRIquant can provide more accurate expression values for circRNAs with significantly reduced false discovery rate. We further develop a one-stop differential expression analysis pipeline implementing two independent measures, which helps unveil the regulation of competitive splicing between circRNAs and their linear counterparts. We apply CIRIquant to RNA-seq datasets of hepatocellular carcinoma, and characterize two important groups of linear-circular switching and circular transcript usage switching events, which demonstrate the promising ability to explore extensive transcriptomic changes in liver tumorigenesis.


2020 ◽  
Vol 36 (10) ◽  
pp. 3115-3123 ◽  
Author(s):  
Teng Fei ◽  
Tianwei Yu

Abstract Motivation Batch effect is a frequent challenge in deep sequencing data analysis that can lead to misleading conclusions. Existing methods do not correct batch effects satisfactorily, especially with single-cell RNA sequencing (RNA-seq) data. Results We present scBatch, a numerical algorithm for batch-effect correction on bulk and single-cell RNA-seq data with emphasis on improving both clustering and gene differential expression analysis. scBatch is not restricted by assumptions on the mechanism of batch-effect generation. As shown in simulations and real data analyses, scBatch outperforms benchmark batch-effect correction methods. Availability and implementation The R package is available at github.com/tengfei-emory/scBatch. The code to generate results and figures in this article is available at github.com/tengfei-emory/scBatch-paper-scripts. Supplementary information Supplementary data are available at Bioinformatics online.


2009 ◽  
Vol 2009 ◽  
pp. 1-10
Author(s):  
Martina Bremer ◽  
R. W. Doerge

We present a statistical method to rank observed genes in gene expression time series experiments according to their degree of regulation in a biological process. The ranking may be used to focus on specific genes or to select meaningful subsets of genes from which gene regulatory networks can be built. Our approach is based on a state space model that incorporates hidden regulators of gene expression. Kalman (K) smoothing and maximum (M) likelihood estimation techniques are used to derive optimal estimates of the model parameters upon which a proposed regulation criterion is based. The statistical power of the proposed algorithm is investigated, and a real data set is analyzed for the purpose of identifying regulated genes in time dependent gene expression data. This statistical approach supports the concept that meaningful biological conclusions can be drawn from gene expression time series experiments by focusing on strong regulation rather than large expression values.


2018 ◽  
Author(s):  
Giulio Spinozzi ◽  
Valentina Tini ◽  
Laura Mincarelli ◽  
Brunangelo Falini ◽  
Maria Paola Martelli

There are many methods available for each phase of the RNA-Seq analysis and each of them uses different algorithms. It is therefore useful to identify a pipeline that combines the best tools in terms of time and results. For this purpose, we compared five different pipelines, obtained by combining the most used tools in RNA-Seq analysis. Using RNA-Seq data on samples of different Acute Myeloid Leukemia (AML) cell lines, we compared five pipelines from the alignment to the differential expression analysis (DEA). For each one we evaluated the peak of RAM and time and then compared the differentially expressed genes identified by each pipeline. It emerged that the pipeline with shorter times, lower consumption of RAM and more reliable results, is that which involves the use ofHISAT2for alignment, featureCountsfor quantification and edgeRfor differential analysis. Finally, we developed an automated pipeline that recurs by default to the cited pipeline, but it also allows to choose between different tools. In addition, the pipeline makes a final meta-analysis that includes a Gene Ontology and Pathway analysis. The results can be viewed in an interactive Shiny Appand exported in a report (pdf, word or html formats).


2018 ◽  
Author(s):  
Aanchal Mongia ◽  
Debarka Sengupta ◽  
Angshul Majumdar

AbstractSingle cell RNA-seq has fueled discovery and innovation in medicine over the past few years and is useful for studying cellular responses at individual cell resolution. But, due to paucity of starting RNA, the data acquired is highly sparse. To address this, We propose a deep matrix factorization based method, deepMc, to impute missing values in gene-expression data. For the deep architecture of our approach, We draw our motivation from great success of deep learning in solving various Machine learning problems. In this work, We support our method with positive results on several evaluation metrics like clustering of cell populations, differential expression analysis and cell type separability.


2019 ◽  
Author(s):  
David Gerard

AbstractWith the explosion in the number of methods designed to analyze bulk and single-cell RNA-seq data, there is a growing need for approaches that assess and compare these methods. The usual technique is to compare methods on data simulated according to some theoretical model. However, as real data often exhibit violations from theoretical models, this can result in un-substantiated claims of a method’s performance. Rather than generate data from a theoretical model, in this paper we develop methods to add signal to real RNA-seq datasets. Since the resulting simulated data are not generated from an unrealistic theoretical model, they exhibit realistic (annoying) attributes of real data. This lets RNA-seq methods developers assess their procedures in non-ideal (model-violating) scenarios. Our procedures may be applied to both single-cell and bulk RNA-seq. We show that our simulation method results in more realistic datasets and can alter the conclusions of a differential expression analysis study. We also demonstrate our approach by comparing various factor analysis techniques on RNA-seq datasets. Our tools are available in the seqgendiff R package on the Comprehensive R Archive Net-work: https://cran.r-project.org/package=seqgendiff.


Sign in / Sign up

Export Citation Format

Share Document