Partitioning RNAs by length improves transcriptome reconstruction from short-read RNA-seq data

Author(s):  
Francisca Rojas Ringeling ◽  
Shounak Chakraborty ◽  
Caroline Vissers ◽  
Derek Reiman ◽  
Akshay M. Patel ◽  
...  
iScience ◽  
2021 ◽  
pp. 102361
Author(s):  
Eliah G. Overbey ◽  
Amanda M. Saravia-Butler ◽  
Zhe Zhang ◽  
Komal S. Rathi ◽  
Homer Fogle ◽  
...  
Keyword(s):  
Rna Seq ◽  

2021 ◽  
Vol 3 (2) ◽  
Author(s):  
Xueyi Dong ◽  
Luyi Tian ◽  
Quentin Gouil ◽  
Hasaru Kariyawasam ◽  
Shian Su ◽  
...  

Abstract Application of Oxford Nanopore Technologies’ long-read sequencing platform to transcriptomic analysis is increasing in popularity. However, such analysis can be challenging due to the high sequence error and small library sizes, which decreases quantification accuracy and reduces power for statistical testing. Here, we report the analysis of two nanopore RNA-seq datasets with the goal of obtaining gene- and isoform-level differential expression information. A dataset of synthetic, spliced, spike-in RNAs (‘sequins’) as well as a mouse neural stem cell dataset from samples with a null mutation of the epigenetic regulator Smchd1 was analysed using a mix of long-read specific tools for preprocessing together with established short-read RNA-seq methods for downstream analysis. We used limma-voom to perform differential gene expression analysis, and the novel FLAMES pipeline to perform isoform identification and quantification, followed by DRIMSeq and limma-diffSplice (with stageR) to perform differential transcript usage analysis. We compared results from the sequins dataset to the ground truth, and results of the mouse dataset to a previous short-read study on equivalent samples. Overall, our work shows that transcriptomic analysis of long-read nanopore data using long-read specific preprocessing methods together with short-read differential expression methods and software that are already in wide use can yield meaningful results.


2014 ◽  
Vol 30 (12) ◽  
pp. i274-i282 ◽  
Author(s):  
Pavankumar Videm ◽  
Dominic Rose ◽  
Fabrizio Costa ◽  
Rolf Backofen

PLoS ONE ◽  
2015 ◽  
Vol 10 (4) ◽  
pp. e0123730 ◽  
Author(s):  
Fenggang Li ◽  
Lixin Wang ◽  
Qingjing Lan ◽  
Hui Yang ◽  
Yang Li ◽  
...  

2018 ◽  
Author(s):  
Elena Bushmanova ◽  
Dmitry Antipov ◽  
Alla Lapidus ◽  
Andrey D. Prjibelski

AbstractSummaryPossibility to generate large RNA-seq datasets has led to development of various reference-based and de novo transcriptome assemblers with their own strengths and limitations. While reference-based tools are widely used in various transcriptomic studies, their application is limited to the model organisms with finished and annotated genomes. De novo transcriptome reconstruction from short reads remains an open challenging problem, which is complicated by the varying expression levels across different genes, alternative splicing and paralogous genes. In this paper we describe a novel transcriptome assembler called rnaSPAdes, which is developed on top of SPAdes genome assembler and explores surprising computational parallels between assembly of transcriptomes and single-cell genomes. We also present quality assessment reports for rnaSPAdes assemblies, compare it with modern transcriptome assembly tools using several evaluation approaches on various RNA-Seq datasets, and briefly highlight strong and weak points of different assemblers.Availability and implementationrnaSPAdes is implemented in C++ and Python and is freely available at cab.spbu.ru/software/rnaspades/.


2020 ◽  
Author(s):  
Eliah G. Overbey ◽  
Amanda M. Saravia-Butler ◽  
Zhe Zhang ◽  
Komal S. Rathi ◽  
Homer Fogle ◽  
...  

SummaryWith the development of transcriptomic technologies, we are able to quantify precise changes in gene expression profiles from astronauts and other organisms exposed to spaceflight. Members of NASA GeneLab and GeneLab-associated analysis working groups (AWGs) have developed a consensus pipeline for analyzing short-read RNA-sequencing data from spaceflight-associated experiments. The pipeline includes quality control, read trimming, mapping, and gene quantification steps, culminating in the detection of differentially expressed genes. This data analysis pipeline and the results of its execution using data submitted to GeneLab are now all publicly available through the GeneLab database. We present here the full details and rationale for the construction of this pipeline in order to promote transparency, reproducibility and reusability of pipeline data, to provide a template for data processing of future spaceflight-relevant datasets, and to encourage cross-analysis of data from other databases with the data available in GeneLab.


Author(s):  
Marine Guilcher ◽  
Arnaud Liehrmann ◽  
Chloé Seyman ◽  
Thomas Blein ◽  
Guillem Rigaill ◽  
...  

Plastid gene expression involves many post-transcriptional maturation steps resulting in a complex transcriptome composed of multiple isoforms. Although short read RNA-seq has considerably improved our understanding of the molecular mechanisms controlling these processes, it is unable to sequence full-length transcripts. This information is however crucial when it comes to understand the interplay between the various steps of plastid gene expression. Here, the study of the Arabidopsis leaf plastid transcriptome using Nanopore sequencing showed that many splicing and editing events were not independent but co-occurring. For a given transcript, maturation events also appeared to be chronologically ordered with splicing happening after most sites are edited.


2015 ◽  
Author(s):  
Brad Solomon ◽  
Carleton Kingsford

Enormous databases of short-read RNA-seq sequencing experiments such as the NIH Sequence Read Archive (SRA) are now available. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. A natural question is which of these experiments contain sequences that indicate the expression of a particular sequence such as a gene isoform, lncRNA, or uORF. However, at present this is a computationally demanding question at the scale of these databases. We introduce an indexing scheme, the Sequence Bloom Tree (SBT), to support sequence-based querying of terabase-scale collections of thousands of short-read sequencing experiments. We apply SBT to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2652 publicly available RNA-seq experiments contained in the NIH for the breast, blood, and brain tissues, comprising 5 terabytes of sequence. SBTs of this size can be queried for a 1000 nt sequence in 19 minutes using less than 300 MB of RAM, over 100 times faster than standard usage of SRA-BLAST and 119 times faster than STAR. SBTs allow for fast identification of experiments with expressed novel isoforms, even if these isoforms were unknown at the time the SBT was built. We also provide some theoretical guidance about appropriate parameter selection in SBT and propose a sampling-based scheme for potentially scaling SBT to even larger collections of files. While SBT can handle any set of reads, we demonstrate the effectiveness of SBT by searching a large collection of blood, brain, and breast RNA-seq files for all 214,293 known human transcripts to identify tissue-specific transcripts. The implementation used in the experiments below is in C++ and is available as open source at http://www.cs.cmu.edu/~ckingsf/software/bloomtree.


Sign in / Sign up

Export Citation Format

Share Document