Comparative evaluation of full-length isoform quantification from RNA-Seq

Dimitra Sarantopoulou; Thomas G. Brooks; Soumyashant Nayak; Antonijo Mrčela; Nicholas F. Lahens; Gregory R. Grant

doi:10.1186/s12859-021-04198-1

Comparative evaluation of full-length isoform quantification from RNA-Seq

BMC Bioinformatics ◽

10.1186/s12859-021-04198-1 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Dimitra Sarantopoulou ◽

Thomas G. Brooks ◽

Soumyashant Nayak ◽

Antonijo Mrčela ◽

Nicholas F. Lahens ◽

...

Keyword(s):

Structural Parameters ◽

Differential Expression Analysis ◽

Real Data ◽

Simple Approach ◽

Full Length ◽

Rna Seq ◽

Rna Transcripts ◽

Quantification Accuracy ◽

Isoform Quantification ◽

Better Than

Abstract Background Full-length isoform quantification from RNA-Seq is a key goal in transcriptomics analyses and has been an area of active development since the beginning. The fundamental difficulty stems from the fact that RNA transcripts are long, while RNA-Seq reads are short. Results Here we use simulated benchmarking data that reflects many properties of real data, including polymorphisms, intron signal and non-uniform coverage, allowing for systematic comparative analyses of isoform quantification accuracy and its impact on differential expression analysis. Genome, transcriptome and pseudo alignment-based methods are included; and a simple approach is included as a baseline control. Conclusions Salmon, kallisto, RSEM, and Cufflinks exhibit the highest accuracy on idealized data, while on more realistic data they do not perform dramatically better than the simple approach. We determine the structural parameters with the greatest impact on quantification accuracy to be length and sequence compression complexity and not so much the number of isoforms. The effect of incomplete annotation on performance is also investigated. Overall, the tested methods show sufficient divergence from the truth to suggest that full-length isoform quantification and isoform level DE should still be employed selectively.

Download Full-text

Comparative evaluation of full-length isoform quantification from RNA-Seq

10.1101/698605 ◽

2019 ◽

Cited By ~ 1

Author(s):

Dimitra Sarantopoulou ◽

Soumyashant Nayak ◽

Thomas G. Brooks ◽

Nicholas F. Lahens ◽

Gregory R. Grant

Keyword(s):

Structural Parameters ◽

Differential Expression Analysis ◽

Full Length ◽

Rna Seq ◽

Rna Transcripts ◽

Fundamental Difficulty ◽

Isoform Quantification ◽

Realistic Data ◽

Naive Approach ◽

Better Than

AbstractFull-length isoform quantification from RNA-Seq is a key goal in transcriptomics analyses and an area of active development. The fundamental difficulty stems from the fact that RNA transcripts are long, while RNA-Seq reads are typically short. We have generated realistic benchmarking data, and have performed a comprehensive comparative analysis of isoform quantification, including evaluating them on the level of differential expression analysis. Genome, transcriptome and pseudo alignment-based methods are included; and a naive approach is included to establish a baseline. Kallisto, RSEM, and Cufflinks exhibit the highest accuracy on idealized data, while on more realistic data they do not perform considerably better than the naive approach. We determine the effect of structural parameters, such as number of exons or number of isoforms, on accuracy. Overall, the tested methods show sufficient divergence from the truth to suggest that full-length isoform quantification should be employed selectively.

Download Full-text

Differential Expression Analysis Using A Model-Based Gene Clustering Algorithm for RNA-Seq Data

10.21203/rs.3.rs-86123/v1 ◽

2020 ◽

Author(s):

Takayuki Osabe ◽

Kentaro Shimizu ◽

Koji Kadota

Keyword(s):

Differential Expression ◽

Time Course ◽

Clustering Algorithm ◽

Expression Patterns ◽

Differential Expression Analysis ◽

Real Data ◽

Gene Clustering ◽

Rna Seq ◽

Model Based ◽

Group Data

Abstract Background RNA-seq is a tool for measuring gene expression and is commonly used to identify differentially expressed genes (DEGs). Gene clustering is used to classify DEGs with similar expression patterns for the subsequent analyses of data from experiments such as time-courses or multi-group comparisons. However, gene clustering has rarely been used for analyzing simple two-group data or differential expression (DE). In this study, we report a model-based clustering algorithm, MBCluster.Seq, that can be implemented using an R package for DE analysis.Results The input data originally used by MBCluster.Seq is DEGs, and the proposed method (called MBCdeg) uses all genes for the analysis. The method uses posterior probabilities of genes assigned to a cluster displaying non-DEG pattern for overall gene ranking. We compared the performance of MBCdeg with conventional R packages such as edgeR, DESeq2, and TCC that are specialized for DE analysis using simulated and real data. Our results showed that MBCdeg outperformed other methods when the proportion of DEG was less than 50%. However, the DEG identification using MBCdeg was less consistent than with conventional methods. We compared the effects of different normalization algorithms using MBCdeg, and performed an analysis using MBCdeg in combination with a robust normalization algorithm (called DEGES) that was not implemented in MBCluster.Seq. The new analysis method showed greater stability than using the original MBCdeg with the default normalization algorithm.Conclusions MBCdeg with DEGES normalization can be used in the identification of DEGs when the PDEG is relatively low. As the method is based on gene clustering, the DE result includes information on which expression pattern the gene belongs to. The new method may be useful for the analysis of time-course and multi-group data, where the classification of expression patterns is often required.

Download Full-text

scBatch: batch-effect correction of RNA-seq data through sample distance matrix adjustment

Bioinformatics ◽

10.1093/bioinformatics/btaa097 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3115-3123 ◽

Cited By ~ 3

Author(s):

Teng Fei ◽

Tianwei Yu

Keyword(s):

Single Cell ◽

Differential Expression Analysis ◽

Distance Matrix ◽

Real Data ◽

R Package ◽

Batch Effect ◽

Supplementary Information ◽

Rna Seq ◽

Sequencing Data ◽

Gene Differential Expression

Abstract Motivation Batch effect is a frequent challenge in deep sequencing data analysis that can lead to misleading conclusions. Existing methods do not correct batch effects satisfactorily, especially with single-cell RNA sequencing (RNA-seq) data. Results We present scBatch, a numerical algorithm for batch-effect correction on bulk and single-cell RNA-seq data with emphasis on improving both clustering and gene differential expression analysis. scBatch is not restricted by assumptions on the mechanism of batch-effect generation. As shown in simulations and real data analyses, scBatch outperforms benchmark batch-effect correction methods. Availability and implementation The R package is available at github.com/tengfei-emory/scBatch. The code to generate results and figures in this article is available at github.com/tengfei-emory/scBatch-paper-scripts. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Data-based RNA-seq Simulations by Binomial Thinning

10.1101/758524 ◽

2019 ◽

Cited By ~ 1

Author(s):

David Gerard

Keyword(s):

Theoretical Model ◽

Single Cell ◽

Differential Expression Analysis ◽

Simulated Data ◽

Real Data ◽

Theoretical Models ◽

Simulation Method ◽

R Package ◽

Rna Seq ◽

Ideal Model

AbstractWith the explosion in the number of methods designed to analyze bulk and single-cell RNA-seq data, there is a growing need for approaches that assess and compare these methods. The usual technique is to compare methods on data simulated according to some theoretical model. However, as real data often exhibit violations from theoretical models, this can result in un-substantiated claims of a method’s performance. Rather than generate data from a theoretical model, in this paper we develop methods to add signal to real RNA-seq datasets. Since the resulting simulated data are not generated from an unrealistic theoretical model, they exhibit realistic (annoying) attributes of real data. This lets RNA-seq methods developers assess their procedures in non-ideal (model-violating) scenarios. Our procedures may be applied to both single-cell and bulk RNA-seq. We show that our simulation method results in more realistic datasets and can alter the conclusions of a differential expression analysis study. We also demonstrate our approach by comparing various factor analysis techniques on RNA-seq datasets. Our tools are available in the seqgendiff R package on the Comprehensive R Archive Net-work: https://cran.r-project.org/package=seqgendiff.

Download Full-text

A semi-parametric Bayesian approach, iSBA, for differential expression analysis of RNA-seq data

10.1101/558270 ◽

2019 ◽

Author(s):

Ran Bi ◽

Peng Liu

Keyword(s):

Data Analysis ◽

Differential Expression ◽

Expression Analysis ◽

Bayesian Approach ◽

Dirichlet Process ◽

Differential Expression Analysis ◽

Real Data ◽

Rna Seq ◽

Study Gene Expression ◽

Bayesian Mixture

AbstractRNA sequencing (RNA-seq) technologies have been popularly applied to study gene expression in recent years. Identifying differentially expressed (DE) genes across treatments is one of the major steps in RNA-seq data analysis. Most differential expression analysis methods rely on parametric assumptions, and it is not guaranteed that these assumptions are appropriate for real data analysis. In this paper, we develop a semi-parametric Bayesian approach for differential expression analysis. More specifically, we model the RNA-seq count data with a Poisson-Gamma mixture model, and propose a Bayesian mixture modeling procedure with a Dirichlet process as the prior model for the distribution of fold changes between the two treatment means. We develop Markov chain Monte Carlo (MCMC) posterior simulation using Metropolis Hastings algorithm to generate posterior samples for differential expression analysis while controlling false discovery rate. Simulation results demonstrate that our proposed method outperforms other popular methods used for detecting DE genes.

Download Full-text

MINTIE: identifying novel structural and splice variants in transcriptomes using RNA-seq data

10.1101/2020.06.03.131532 ◽

2020 ◽

Author(s):

Marek Cmero ◽

Breon Schmidt ◽

Ian J. Majewski ◽

Paul G. Ekert ◽

Alicia Oshlack ◽

...

Keyword(s):

De Novo ◽

Splice Variants ◽

Lymphoblastic Leukemia ◽

Differential Expression Analysis ◽

Real Data ◽

Tumour Suppressor Gene ◽

Rna Seq ◽

Structural Variants ◽

Transcriptional Variants ◽

Tandem Duplications

AbstractGenomic rearrangements can modify gene function by altering transcript sequences, and have been shown to be drivers in both cancer and rare diseases. Although there are now many methods to detect structural variants from Whole Genome Sequencing (WGS), RNA sequencing (RNA-seq) remains under-utilised as a technology for the detection of gene altering structural variants. Calling fusion genes from RNA-seq data is well established, but other transcriptional variants such as fusions with novel sequence, tandem duplications, large insertions and deletions, and novel splicing are difficult to detect using existing approaches.To identify all types of variants in transcriptomes, we developed MINTIE, an integrated pipeline for RNA-seq data. We take a reference free approach, which combines de novo assembly of transcripts with differential expression analysis, to identify up-regulated novel variants in a case sample.We validated MINTIE on simulated and real data sets and compared it with eight other approaches for finding novel transcriptional variants. We found MINTIE was able to detect all defined variant classes at high rates (>70%) while no other method was able to achieve this.We applied MINTIE to RNA-seq data from a cohort of acute lymphoblastic leukemia (ALL) patient samples and identified several novel clinically relevant variants, including an unpartnered recurrent fusion involving the tumour suppressor gene RB1, and variants in ALL-associated genes: tandem duplications in IKZF1 and PAX5, and novel splicing in ETV6. We further demonstrate the utility of MINTIE to identify rare disease variants using RNA-seq, including the discovery of an inter-chromosomal translocation in the DMD gene in a patient with muscular dystrophy. We posit that MINTIE will be able to identify new disease variants across a range of cancers and other disease types.

Download Full-text

Alternating EM algorithm for a bilinear model in isoform quantification from RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz640 ◽

2019 ◽

Author(s):

Wenjiang Deng ◽

Tian Mou ◽

Krishna R Kalari ◽

Nifang Niu ◽

Liewei Wang ◽

...

Keyword(s):

Differential Expression Analysis ◽

Gc Content ◽

Real Data ◽

Joint Estimation ◽

Supplementary Information ◽

Design Matrix ◽

Rna Seq ◽

Bilinear Model ◽

Simplifying Assumptions ◽

Correction Step

Abstract Motivation Estimation of isoform-level gene expression from RNA-seq data depends on simplifying assumptions, such as uniform read distribution, that are easily violated in real data. Such violations typically lead to biased estimates. Most existing methods provide bias correction step(s), which is based on biological considerations—such as GC content—and applied in single samples separately. The main problem is that not all biases are known. Results We have developed a novel method called XAEM based on a more flexible and robust statistical model. Existing methods are essentially based on a linear model Xβ, where the design matrix X is known and is computed based on the simplifying assumptions. In contrast XAEM considers Xβ as a bilinear model with both X and β unknown. Joint estimation of X and β is made possible by a simultaneous analysis of multi-sample RNA-seq data. Compared to existing methods, XAEM automatically performs empirical correction of potentially unknown biases. We use an alternating expectation-maximization (AEM) algorithm, alternating between estimation of X and β. For speed XAEM utilizes quasi-mapping for read alignment, thus leading to a fast algorithm. Overall XAEM performs favorably compared to recent advanced methods. For simulated datasets, XAEM obtains higher accuracy for multiple-isoform genes. In a differential-expression analysis of a real single-cell RNA-seq dataset, XAEM achieves substantially better rediscovery rates in independent validation sets. Availability and implementation The method and pipeline are implemented as a tool and freely available for use at http://fafner.meb.ki.se/biostatwiki/xaem/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

dearseq: a variance component score test for RNA-seq differential analysis that effectively controls the false discovery rate

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa093 ◽

2020 ◽

Vol 2 (4) ◽

Author(s):

Marine Gauthier ◽

Denis Agniel ◽

Rodolphe Thiébaut ◽

Boris P Hejblum

Keyword(s):

False Discovery Rate ◽

Statistical Power ◽

Differential Expression Analysis ◽

Score Test ◽

Real Data ◽

Differential Analysis ◽

Rna Seq ◽

Data Set ◽

Mathematical Proofs ◽

False Discovery

Abstract RNA-seq studies are growing in size and popularity. We provide evidence that the most commonly used methods for differential expression analysis (DEA) may yield too many false positive results in some situations. We present dearseq, a new method for DEA that controls the false discovery rate (FDR) without making any assumption about the true distribution of RNA-seq data. We show that dearseq controls the FDR while maintaining strong statistical power compared to the most popular methods. We demonstrate this behavior with mathematical proofs, simulations and a real data set from a study of tuberculosis, where our method produces fewer apparent false positives.

Download Full-text

Differential expression analysis using a model-based gene clustering algorithm for RNA-seq data

BMC Bioinformatics ◽

10.1186/s12859-021-04438-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Takayuki Osabe ◽

Kentaro Shimizu ◽

Koji Kadota

Keyword(s):

Differential Expression ◽

Time Course ◽

Clustering Algorithm ◽

Expression Patterns ◽

Differential Expression Analysis ◽

Real Data ◽

Gene Clustering ◽

Rna Seq ◽

Model Based ◽

Group Data

Abstract Background RNA-seq is a tool for measuring gene expression and is commonly used to identify differentially expressed genes (DEGs). Gene clustering is used to classify DEGs with similar expression patterns for the subsequent analyses of data from experiments such as time-courses or multi-group comparisons. However, gene clustering has rarely been used for analyzing simple two-group data or differential expression (DE). In this study, we report that a model-based clustering algorithm implemented in an R package, MBCluster.Seq, can also be used for DE analysis. Results The input data originally used by MBCluster.Seq is DEGs, and the proposed method (called MBCdeg) uses all genes for the analysis. The method uses posterior probabilities of genes assigned to a cluster displaying non-DEG pattern for overall gene ranking. We compared the performance of MBCdeg with conventional R packages such as edgeR, DESeq2, and TCC that are specialized for DE analysis using simulated and real data. Our results showed that MBCdeg outperformed other methods when the proportion of DEG (PDEG) was less than 50%. However, the DEG identification using MBCdeg was less consistent than with conventional methods. We compared the effects of different normalization algorithms using MBCdeg, and performed an analysis using MBCdeg in combination with a robust normalization algorithm (called DEGES) that was not implemented in MBCluster.Seq. The new analysis method showed greater stability than using the original MBCdeg with the default normalization algorithm. Conclusions MBCdeg with DEGES normalization can be used in the identification of DEGs when the PDEG is relatively low. As the method is based on gene clustering, the DE result includes information on which expression pattern the gene belongs to. The new method may be useful for the analysis of time-course and multi-group data, where the classification of expression patterns is often required.

Download Full-text

dearseq: a variance component score test for RNA-Seq differential analysis that effectively controls the false discovery rate

10.1101/635714 ◽

2019 ◽

Cited By ~ 1

Author(s):

Marine Gauthier ◽

Denis Agniel ◽

Rodolphe Thiébaut ◽

Boris P. Hejblum

Keyword(s):

Statistical Power ◽

Differential Expression Analysis ◽

Score Test ◽

Real Data ◽

Differential Analysis ◽

Rna Seq ◽

Data Set ◽

Mathematical Proofs ◽

False Discovery ◽

Positive Results

AbstractRNA-seq studies are growing in size and popularity. We provide evidence that the most commonly used methods for differential expression analysis (DEA) may yield too many false positive results in some situations. We presentdearseq, a new method for DEA which controls the FDR without making any assumption about the true distribution of RNA-seq data. We show thatdearseqcontrols the FDR while maintaining strong statistical power compared to the most popular methods. We demonstrate this behavior with mathematical proofs, simulations, and a real data set from a study of Tuberculosis, where our method produces fewer apparent false positives.

Download Full-text