EBSeq: improving mixing computations for multi-group differential expression analysis

Background Currently quantitative RNA-Seq methods are pushed to work with increasingly small starting amounts of RNA that require PCR amplification to generate libraries. However, it is unclear how much noise or bias amplification introduces and how this effects precision and accuracy of RNA quantification. To assess the effects of amplification, reads that originated from the same RNA molecule (PCR-duplicates) need to be identified. Computationally, read duplicates are defined via their mapping position, which does not distinguish PCR- from natural duplicates that are bound to occur for highly transcribed RNAs. Hence, it is unclear how to treat duplicate reads and how important it is to reduce PCR amplification experimentally. Here, we generate and analyse RNA-Seq datasets that were prepared with three different protocols (Smart-Seq, TruSeq and UMI-seq). We find that a large fraction of computationally identified read duplicates can be explained by sampling and fragmentation bias. Consequently, the computational removal of duplicates does not improve accuracy, power or false discovery rates, but can actually worsen them. Even when duplicates are experimentally identified by unique molecular identifiers (UMIs), power and false discovery rate are only mildly improved. However, we do find that power does improve with fewer PCR amplification cycles across datasets and that early barcoding of samples and hence PCR amplification in one reaction can restore this loss of power. Conclusions Computational removal of read duplicates is not recommended for differential expression analysis. However, the pooling of samples as made possible by the early barcoding of the UMI-protocol leads to an appreciable increase in the power to detect differentially expressed genes.

Download Full-text

Error estimates for the analysis of differential expression from RNA-seq count data

10.7287/peerj.preprints.400 ◽

2014 ◽

Author(s):

Conrad Burden ◽

Sumaira Qureshi ◽

Susan R Wilson

Keyword(s):

Differential Expression ◽

Count Data ◽

Statistical Models ◽

Full Range ◽

Synthetic Data ◽

Biological Data ◽

P Value ◽

Sequencing Data ◽

False Discovery Rates ◽

Poisson Data

A number of algorithms exist for analysing RNA-sequencing data to infer profiles of differential gene expression. Problems inherent in building algorithms around statistical models of over dispersed count data are formidable and frequently lead to non-uniform p-value distributions for null-hypothesis data and to inaccurate estimates of false discovery rates (FDRs). This can lead to an inaccurate measure of significance and loss of power to detect differential expression. We use synthetic and real biological data to assess the ability of several available R packages to accurately estimate FDRs. The packages surveyed are based on statistical models of overdispersed Poisson data and include edgeR, DESeq, DESeq2, PoissonSeq and QuasiSeq. Also tested is an add-on package to edgeR and DESeq which we introduce called Polyfit. Polyfit aims to address the problem of a non-uniform null p-value distribution for two-class datasets by adapting the Storey-Tibshirani procedure. We find the best performing package in the sense that it achieves a low FDR which is accurately estimated over the full range of p-values, albeit with a very slow run time, is the QLSpline implementation of QuasiSeq. This finding holds provided the number of biological replicates in each condition is at least 4. The next best performing packages are edgeR and DESeq2. When the number of biological replicates is sufficiently high, and within a range accessible to multiplexed experimental designs, the Polyfit extension improves the performance DESeq (for approximately 6 or more replicates per condition), making its performance comparable with that of edgeR and DESeq2 in our tests with synthetic data.

Download Full-text

Signatures and Prognostic Values of Related Immune Targets in Tongue Cancer

10.21203/rs.3.rs-997544/v1 ◽

2021 ◽

Author(s):

Xi Yu ◽

Xiaofei Lv

Keyword(s):

Differential Expression ◽

Expression Analysis ◽

Tongue Cancer ◽

Differential Expression Analysis ◽

Marker Genes ◽

Data Sets ◽

Expression Data ◽

Oral Cancers ◽

Limma Package ◽

Cancer Bioinformatics

Abstract Tongue cancer, as one of the most malignant oral cancers, is highly invasive and has a high risk of recurrence. At present, tongue cancer in the advanced stage is not obvious, easy to miss the opportunity of early diagnosis. It is important to find markers that can predict the occurrence and progression of tongue cancer. Bioinformatics analysis plays an important role in the acquisition of marker genes. GEO and TCGA data are very important public databases. In addition to expression data, TCGA database also contains corresponding clinical data. In this study, we screened three GEO datasets included GSE13601, GSE34105 and GSE34106 that met the standard. These data sets were combined using the SVA package to prepare the data for differential expression analysis, and then the LIMMA package was used to set the standard to p<0.05 and |log2 (FC)| ≥1.5. We got 170 DEGs (104, raised 66 downgrade). Besides, the DEseq package was used for differential expression analysis using the same criteria for samples in TCGA database. It ended up with 1589 DEGs (644 up-regulated, 945 down-regulated). By merging these two sets of DEGs, 5 common up-regulated DEGs (CCL20, SCG5, SPP1, KRT75 and FOLR3) and 15 common down-regulated DEGs were obtained. Further functional analysis of the DEGs showed that CCL20, SCG5 and SPP1 is closely related to prognosis and may be a therapeutic target of TSCC.

Download Full-text

Bias, robustness and scalability in differential expression analysis of single-cell RNA-seq data

10.1101/143289 ◽

2017 ◽

Cited By ~ 16

Author(s):

Charlotte Soneson ◽

Mark D. Robinson

Keyword(s):

Single Cell ◽

Differential Expression ◽

Statistical Methods ◽

Expression Analysis ◽

Method Development ◽

Differential Expression Analysis ◽

Data Sets ◽

Rna Seq ◽

Data Set ◽

Extensive Evaluation

AbstractBackgroundAs single-cell RNA-seq (scRNA-seq) is becoming increasingly common, the amount of publicly available data grows rapidly, generating a useful resource for computational method development and extension of published results. Although processed data matrices are typically made available in public repositories, the procedure to obtain these varies widely between data sets, which may complicate reuse and cross-data set comparison. Moreover, while many statistical methods for performing differential expression analysis of scRNA-seq data are becoming available, their relative merits and the performance compared to methods developed for bulk RNA-seq data are not sufficiently well understood.ResultsWe present conquer, a collection of consistently processed, analysis-ready public single-cell RNA-seq data sets. Each data set has count and transcripts per million (TPM) estimates for genes and transcripts, as well as quality control and exploratory analysis reports. We use a subset of the data sets available in conquer to perform an extensive evaluation of the performance and characteristics of statistical methods for differential gene expression analysis, evaluating a total of 30 statistical approaches on both experimental and simulated scRNA-seq data.ConclusionsConsiderable differences are found between the methods in terms of the number and characteristics of the genes that are called differentially expressed. Pre-filtering of lowly expressed genes can have important effects on the results, particularly for some of the methods originally developed for analysis of bulk RNA-seq data. Generally, however, methods developed for bulk RNA-seq analysis do not perform notably worse than those developed specifically for scRNA-seq.

Download Full-text

Accurate quantification of circular RNAs identifies extensive circular isoform switching events

Nature Communications ◽

10.1038/s41467-019-13840-9 ◽

2020 ◽

Vol 11 (1) ◽

Cited By ~ 16

Author(s):

Jinyang Zhang ◽

Shuai Chen ◽

Jingwen Yang ◽

Fangqing Zhao

Keyword(s):

False Discovery Rate ◽

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Circular Rnas ◽

Rna Seq ◽

Rnase R ◽

False Discovery ◽

Rrna Depletion ◽

Accurate Expression

AbstractDetection and quantification of circular RNAs (circRNAs) face several significant challenges, including high false discovery rate, uneven rRNA depletion and RNase R treatment efficiency, and underestimation of back-spliced junction reads. Here, we propose a novel algorithm, CIRIquant, for accurate circRNA quantification and differential expression analysis. By constructing pseudo-circular reference for re-alignment of RNA-seq reads and employing sophisticated statistical models to correct RNase R treatment biases, CIRIquant can provide more accurate expression values for circRNAs with significantly reduced false discovery rate. We further develop a one-stop differential expression analysis pipeline implementing two independent measures, which helps unveil the regulation of competitive splicing between circRNAs and their linear counterparts. We apply CIRIquant to RNA-seq datasets of hepatocellular carcinoma, and characterize two important groups of linear-circular switching and circular transcript usage switching events, which demonstrate the promising ability to explore extensive transcriptomic changes in liver tumorigenesis.

Download Full-text

A Novel Scalable Signature Based Subspace Clustering Approach for Big Data

International Journal of Information Technology and Web Engineering ◽

10.4018/ijitwe.2019040103 ◽

2019 ◽

Vol 14 (2) ◽

pp. 41-51 ◽

Cited By ~ 1

Author(s):

T. Gayathri ◽

D. Lalitha Bhaskari

Keyword(s):

Big Data ◽

Data Management ◽

Clustering Algorithms ◽

Synthetic Data ◽

Subspace Clustering ◽

Distance Measures ◽

Data Sets ◽

Management Tools ◽

Clustering Approach ◽

Different Dimensions

“Big data” as the name suggests is a collection of large and complicated data sets which are usually hard to process with on-hand data management tools or other conventional processing applications. A scalable signature based subspace clustering approach is presented in this article that would avoid identification of redundant clusters. Various distance measures are utilized to perform experiments that validate the performance of the proposed algorithm. Also, for the same purpose of validation, the synthetic data sets that are chosen have different dimensions, and their size will be distributed when opened with Weka. The F1 quality measure and the runtime of these synthetic data sets are computed. The performance of the proposed algorithm is compared with other existing clustering algorithms such as CLIQUE.INSCY and SUNCLU.

Download Full-text

Evaluating Statistical Methods Using Plasmode Data Sets in the Age of Massive Public Databases: An Illustration Using False Discovery Rates

PLoS Genetics ◽

10.1371/journal.pgen.1000098 ◽

2008 ◽

Vol 4 (6) ◽

pp. e1000098 ◽

Cited By ~ 23

Author(s):

Gary L. Gadbury ◽

Qinfang Xiang ◽

Lin Yang ◽

Stephen Barnes ◽

Grier P. Page ◽

...

Keyword(s):

Statistical Methods ◽

Data Sets ◽

False Discovery Rates ◽

False Discovery ◽

Discovery Rates

Download Full-text

genesorteR: Feature Ranking in Clustered Single Cell Data

10.1101/676379 ◽

2019 ◽

Cited By ~ 5

Author(s):

Mahmoud M Ibrahim ◽

Rafael Kramann

Keyword(s):

Single Cell ◽

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Large Cell ◽

R Package ◽

Marker Genes ◽

Data Sets ◽

Cell Type ◽

Cell Data

ABSTRACTMarker genes identified in single cell experiments are expected to be highly specific to a certain cell type and highly expressed in that cell type. Detecting a gene by differential expression analysis does not necessarily satisfy those two conditions and is typically computationally expensive for large cell numbers.Here we present genesorteR, an R package that ranks features in single cell data in a manner consistent with the expected definition of marker genes in experimental biology research. We benchmark genesorteR using various data sets and show that it is distinctly more accurate in large single cell data sets compared to other methods. genesorteR is orders of magnitude faster than current implementations of differential expression analysis methods, can operate on data containing millions of cells and is applicable to both single cell RNA-Seq and single cell ATAC-Seq data.genesorteR is available at https://github.com/mahmoudibrahim/genesorteR.

Download Full-text

Nonparametric expression analysis using inferential replicate counts

10.1101/561084 ◽

2019 ◽

Author(s):

Anqi Zhu ◽

Avi Srivastava ◽

Joseph G. Ibrahim ◽

Rob Patro ◽

Michael I. Love

Keyword(s):

False Discovery Rate ◽

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Transcript Level ◽

Parametric Model ◽

Statistical Testing ◽

Rna Seq ◽

Nonparametric Models ◽

False Discovery

AbstractA primary challenge in the analysis of RNA-seq data is to identify differentially expressed genes or transcripts while controlling for technical biases present in the observations. Ideally, a statistical testing procedure should incorporate information about the inherent uncertainty of the abundance estimates, whether at the gene or transcript level, that arise from quantification of abundance. Most popular methods for RNA-seq differential expression analysis fit a parametric model to the counts or scaled counts for each gene or transcript, and a subset of methods can incorporate information about the uncertainty of the counts. Previous work has shown that nonparametric models for RNA-seq differential expression may in some cases have better control of the false discovery rate, and adapt well to new data types without requiring reformulation of a parametric model. Existing nonparametric models do not take into account the inferential uncertainty of the observations, leading to an inflated false discovery rate, in particular at the transcript level. Here we propose a nonparametric model for differential expression analysis using inferential replicate counts, extending the existing SAMseq method to account for inferential uncertainty, batch effects, and sample pairing. We compare our method, “SAMseq With Inferential Samples Helps”, or Swish, with popular differential expression analysis methods. Swish has improved control of the false discovery rate, in particular for transcripts with high inferential uncertainty. We apply Swish to a singlecell RNA-seq dataset, assessing sensitivity to recover DE genes between sub-populations of cells, and compare its performance to the Wilcoxon rank sum test.

Download Full-text

Error estimates for the analysis of differential expression from RNA-seq count data

10.7287/peerj.preprints.400v3 ◽

2014 ◽

Author(s):

Conrad Burden ◽

Sumaira Qureshi ◽

Susan R Wilson

Keyword(s):

Differential Expression ◽

Count Data ◽

Statistical Models ◽

Full Range ◽

Synthetic Data ◽

Biological Data ◽

P Value ◽

Sequencing Data ◽

False Discovery Rates ◽

Poisson Data

A number of algorithms exist for analysing RNA-sequencing data to infer profiles of differential gene expression. Problems inherent in building algorithms around statistical models of over dispersed count data are formidable and frequently lead to non-uniform p-value distributions for null-hypothesis data and to inaccurate estimates of false discovery rates (FDRs). This can lead to an inaccurate measure of significance and loss of power to detect differential expression. We use synthetic and real biological data to assess the ability of several available R packages to accurately estimate FDRs. The packages surveyed are based on statistical models of overdispersed Poisson data and include edgeR, DESeq, DESeq2, PoissonSeq and QuasiSeq. Also tested is an add-on package to edgeR and DESeq which we introduce called Polyfit. Polyfit aims to address the problem of a non-uniform null p-value distribution for two-class datasets by adapting the Storey-Tibshirani procedure. We find the best performing package in the sense that it achieves a low FDR which is accurately estimated over the full range of p-values, albeit with a very slow run time, is the QLSpline implementation of QuasiSeq. This finding holds provided the number of biological replicates in each condition is at least 4. The next best performing packages are edgeR and DESeq2. When the number of biological replicates is sufficiently high, and within a range accessible to multiplexed experimental designs, the Polyfit extension improves the performance DESeq (for approximately 6 or more replicates per condition), making its performance comparable with that of edgeR and DESeq2 in our tests with synthetic data.

Download Full-text