scholarly journals Assembly-free rapid differential gene expression analysis in non-model organisms using DNA-protein alignment

2021 ◽  
Author(s):  
Anish M.S. Shrestha ◽  
Joyce Emlyn B. Guiao ◽  
Kyle Christian R. Santiago

AbstractRNA-seq is being increasingly adopted for gene expression studies in a panoply of non-model organisms, with applications spanning the fields of agriculture, aquaculture, ecology, and environment. Conventional differential expression analysis for organisms without reference sequences requires performing computationally expensive and error-prone de-novo transcriptome assembly, followed by homology search against a high-confidence protein database for functional annotation. We propose a shortcut, where we obtain counts for differential expression analysis by directly aligning RNA-seq reads to the protein database. Through experiments on simulated and real data, we show drastic reductions in run-time and memory usage, with no loss in accuracy. A Snakemake implementation of our workflow is available at:https://bitbucket.org/project_samar/samar

Author(s):  
Adam Voshall ◽  
Sairam Behera ◽  
Xiangjun Li ◽  
Xiao-Hong Yu ◽  
Kushagra Kapil ◽  
...  

AbstractSystems-level analyses, such as differential gene expression analysis, co-expression analysis, and metabolic pathway reconstruction, depend on the accuracy of the transcriptome. Multiple tools exist to perform transcriptome assembly from RNAseq data. However, assembling high quality transcriptomes is still not a trivial problem. This is especially the case for non-model organisms where adequate reference genomes are often not available. Different methods produce different transcriptome models and there is no easy way to determine which are more accurate. Furthermore, having alternative splicing events could exacerbate such difficult assembly problems. While benchmarking transcriptome assemblies is critical, this is also not trivial due to the general lack of true reference transcriptomes. In this study, we provide a pipeline to generate a set of the benchmark transcriptome and corresponding RNAseq data. Using the simulated benchmarking datasets, we compared the performance of various transcriptome assembly approaches including genome-guided, de novo, and ensemble methods. The results showed that the assembly performance deteriorates significantly when the reference is not available from the same genome (for genome-guided methods) or when alternative transcripts (isoforms) exist. We demonstrated the value of consensus between de novo assemblers in transcriptome assembly. Leveraging the overlapping predictions between the four de novo assemblers, we further present ConSemble, a consensus-based de novo ensemble transcriptome assembly pipeline. Without using a reference genome, ConSemble achieved an accuracy up to twice as high as any de novo assemblers we compared. It matched or exceeded the best performing genome-guided assemblers even when the transcriptomes included isoforms. The RNAseq simulation pipeline, the benchmark transcriptome datasets, and the ConSemble pipeline are all freely available from: http://bioinfolab.unl.edu/emlab/consemble/.Author summaryObtaining the accurate representation of the gene expression is critical in many analyses, such as differential gene expression analysis, co-expression analysis, and metabolic pathway reconstruction. The state of the art high-throughput RNA-sequencing (RNAseq) technologies can be used to sequence the set of all transcripts in a cell, the transcriptome. Although many computational tools are available for transcriptome assembly from RNAseq data, assembling high-quality transcriptomes is difficult especially for non-model organisms. Different methods often produce different transcriptome models and there is no easy way to determine which are more accurate. In this study, we present an approach to evaluate transcriptome assembly performance using simulated benchmarking read sets. The results showed that the assembly performance of genome-guided assembly methods deteriorates significantly when the adequate reference genome is not available. The assembly performance of all methods is affected when alternative transcripts (isoforms) exist. We further demonstrated the value of consensus among assemblers in improving transcriptome assembly. Leveraging the overlapping predictions between the four de novo assemblers, we present ConSemble. Without using a reference genome, ConSemble achieved a much higher accuracy than any de novo assemblers we compared. It matched or exceeded the best performing genome-guided assemblers even when the transcriptomes included isoforms.


2017 ◽  
Author(s):  
Alemu Takele Assefa ◽  
Katrijn De Paepe ◽  
Celine Everaert ◽  
Pieter Mestdagh ◽  
Olivier Thas ◽  
...  

ABSTRACTBackgroundProtein-coding RNAs (mRNA) have been the primary target of most transcriptome studies in the past, but in recent years, attention has expanded to include long non-coding RNAs (lncRNA). lncRNAs are typically expressed at low levels, and are inherently highly variable. This is a fundamental challenge for differential expression (DE) analysis. In this study, the performance of 14 popular tools for testing DE in RNA-seq data along with their normalization methods is comprehensively evaluated, with a particular focus on lncRNAs and low abundant mRNAs.ResultsThirteen performance metrics were used to evaluate DE tools and normalization methods using simulations and analyses of six diverse RNA-seq datasets. Non-parametric procedures are used to simulate gene expression data in such a way that realistic levels of expression and variability are preserved in the simulated data. Throughout the assessment, we kept track of the results for mRNA and lncRNA separately. All statistical models exhibited inferior performance for lncRNAs compared to mRNAs across all simulated scenarios and analysis of benchmark RNA-seq datasets. No single tool uniformly outperformed the others.ConclusionOverall, the linear modeling with empirical Bayes moderation (limma) and the nonparametric approach (SAMSeq) showed best performance: good control of the false discovery rate (FDR) and reasonable sensitivity. However, for achieving a sensitivity of at least 50%, more than 80 samples are required when studying expression levels in a realistic clinical settings such as in cancer research. About half of the methods showed severe excess of false discoveries, making these methods unreliable for differential expression analysis and jeopardizing reproducible science. The detailed results of our study can be consulted through a user-friendly web application, http://statapps.ugent.be/tools/AppDGE/


F1000Research ◽  
2021 ◽  
Vol 10 ◽  
pp. 654
Author(s):  
Margaux Haering ◽  
Bianca H Habermann

RNA sequencing (RNA-seq) is a widely adopted affordable method for large scale gene expression profiling. However, user-friendly and versatile tools for wet-lab biologists to analyse RNA-seq data beyond standard analyses such as differential expression, are rare. Especially, the analysis of time-series data is difficult for wet-lab biologists lacking advanced computational training. Furthermore, most meta-analysis tools are tailored for model organisms and not easily adaptable to other species. With RNfuzzyApp, we provide a user-friendly, web-based R shiny app for differential expression analysis, as well as time-series analysis of RNA-seq data. RNfuzzyApp offers several methods for normalization and differential expression analysis of RNA-seq data, providing easy-to-use toolboxes, interactive plots and downloadable results. For time-series analysis, RNfuzzyApp presents the first web-based, fully automated pipeline for soft clustering with the Mfuzz R package, including methods to aid in cluster number selection, cluster overlap analysis, Mfuzz loop computations, as well as cluster enrichments. RNfuzzyApp is an intuitive, easy to use and interactive R shiny app for RNA-seq differential expression and time-series analysis, offering a rich selection of interactive plots, providing a quick overview of raw data and generating rapid analysis results. Furthermore, its orthology assignment, enrichment analysis, as well as ID conversion functions are accessible to non-model organisms.


2019 ◽  
Vol 21 (1) ◽  
pp. 293 ◽  
Author(s):  
Giulio Ferrero ◽  
Nicola Licheri ◽  
Lucia Coscujuela Tarrero ◽  
Carlo De Intinis ◽  
Valentina Miano ◽  
...  

Recent improvements in cost-effectiveness of high-throughput technologies has allowed RNA sequencing of total transcriptomes suitable for evaluating the expression and regulation of circRNAs, a relatively novel class of transcript isoforms with suggested roles in transcriptional and post-transcriptional gene expression regulation, as well as their possible use as biomarkers, due to their deregulation in various human diseases. A limited number of integrated workflows exists for prediction, characterization, and differential expression analysis of circRNAs, none of them complying with computational reproducibility requirements. We developed Docker4Circ for the complete analysis of circRNAs from RNA-Seq data. Docker4Circ runs a comprehensive analysis of circRNAs in human and model organisms, including: circRNAs prediction; classification and annotation using six public databases; back-splice sequence reconstruction; internal alternative splicing of circularizing exons; alignment-free circRNAs quantification from RNA-Seq reads; and differential expression analysis. Docker4Circ makes circRNAs analysis easier and more accessible thanks to: (i) its R interface; (ii) encapsulation of computational tasks into docker images; (iii) user-friendly Java GUI Interface availability; and (iv) no need of advanced bash scripting skills for correct use. Furthermore, Docker4Circ ensures a reproducible analysis since all its tasks are embedded into a docker image following the guidelines provided by Reproducible Bioinformatics Project.


2021 ◽  
Author(s):  
Aedan G. K. Roberts ◽  
Daniel R. Catchpoole ◽  
Paul J. Kennedy

AbstractBackgroundDifferential expression analysis of RNA-seq data has advanced rapidly since the introduction of the technology, and methods such as edgeR and DESeq2 have become standard parts of analysis pipelines. However, there is a growing body of research showing that differences in variability of gene expression or overall differences in the distribution of expression values – differential distribution – are also important both in normal biology and in diseases including cancer. Genes whose expression differs in distribution without a difference in mean expression level are ignored by differential expression methods.ResultsWe have developed a Bayesian hierarchical model which improves on existing methods for identifying differential dispersion in RNA-seq data, and provides an overall test for differential distribution. We have applied these methods to investigate differential dispersion and distribution in cancer using RNA-seq datasets from The Cancer Genome Atlas. Our results show that differential dispersion and distribution are able to identify cancer-related genes. Further, we find that differential dispersion identifies cancer-related genes that are missed by differential expression analysis, and that differential expression and differential dispersion identify functionally distinct sets of genes.ConclusionThis work highlights the importance of considering changes beyond differences in mean in the analysis of gene expression data, and suggests that analysis of expression variability may provide insights into genetic aspects of cancer that would not be revealed by differential expression analysis alone. For identification of cancer-related genes, differential distribution analysis allows the identification of genes whose expression is disrupted in terms of either mean or variability.


F1000Research ◽  
2021 ◽  
Vol 10 ◽  
pp. 654
Author(s):  
Margaux Haering ◽  
Bianca H Habermann

RNA sequencing (RNA-seq) is a widely adopted affordable method for large scale gene expression profiling. However, user-friendly and versatile tools for wet-lab biologists to analyse RNA-seq data beyond standard analyses such as differential expression, are rare. Especially, the analysis of time-series data is difficult for wet-lab biologists lacking advanced computational training. Furthermore, most meta-analysis tools are tailored for model organisms and not easily adaptable to other species. With RNfuzzyApp, we provide a user-friendly, web-based R shiny app for differential expression analysis, as well as time-series analysis of RNA-seq data. RNfuzzyApp offers several methods for normalization and differential expression analysis of RNA-seq data, providing easy-to-use toolboxes, interactive plots and downloadable results. For time-series analysis, RNfuzzyApp presents the first web-based, fully automated pipeline for soft clustering with the Mfuzz R package, including methods to aid in cluster number selection, cluster overlap analysis, Mfuzz loop computations, as well as cluster enrichments. RNfuzzyApp is an intuitive, easy to use and interactive R shiny app for RNA-seq differential expression and time-series analysis, offering a rich selection of interactive plots, providing a quick overview of raw data and generating rapid analysis results. Furthermore, its assignment of orthologs, enrichment analysis, as well as ID conversion functions are accessible to non-model organisms.


2021 ◽  
Author(s):  
Mengqi Zhang ◽  
Si Liu ◽  
Zhen Miao ◽  
Fang Han ◽  
Raphael Gottardo ◽  
...  

Bulk RNA-seq data quantify the expression of a gene in an individual by one number (e.g., fragment count). In contrast, single cell RNA-seq (scRNA-seq) data provide much richer information: the distribution of gene expression across many cells. To assess differential expression across individuals using scRNA-seq data, a straightforward solution is to create ''pseudo'' bulk RNA-seq data by adding up the fragment counts of a gene across cells for each individual, and then apply methods designed for differential expression using bulk RNA-seq data. This pseudo-bulk solution reduces the distribution of gene expression across cells to a single number and thus loses a good amount of information. We propose to assess differential expression using the gene expression distribution measured by cell level data. We find denoising cell level data can substantially improve the power of this approach. We apply our method, named IDEAS (Individual level Differential Expression Analysis for scRNA-seq), to study the gene expression difference between autism subjects and controls. We find neurogranin-expressing neurons harbor a high proportion of differentially expressed genes, and ERBB signals in microglia are associated with autism.


Sign in / Sign up

Export Citation Format

Share Document