MGcount: a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts

Andrea Hita; Gilles Brocart; Ana Fernandez; Marc Rehmsmeier; Anna Alemany; Sol Schvartzman

doi:10.1186/s12859-021-04544-3

MGcount: a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts

BMC Bioinformatics ◽

10.1186/s12859-021-04544-3 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Andrea Hita ◽

Gilles Brocart ◽

Ana Fernandez ◽

Marc Rehmsmeier ◽

Anna Alemany ◽

...

Keyword(s):

Rna Sequencing ◽

Genomic Region ◽

Simultaneous Estimation ◽

Rna Seq ◽

Protein Coding ◽

Total Rna ◽

Simultaneous Study ◽

Downstream Analysis ◽

And Function ◽

Genomic Locations

Abstract Background Total-RNA sequencing (total-RNA-seq) allows the simultaneous study of both the coding and the non-coding transcriptome. Yet, computational pipelines have traditionally focused on particular biotypes, making assumptions that are not fullfilled by total-RNA-seq datasets. Transcripts from distinct RNA biotypes vary in length, biogenesis, and function, can overlap in a genomic region, and may be present in the genome with a high copy number. Consequently, reads from total-RNA-seq libraries may cause ambiguous genomic alignments, demanding for flexible quantification approaches. Results Here we present Multi-Graph count (MGcount), a total-RNA-seq quantification tool combining two strategies for handling ambiguous alignments. First, MGcount assigns reads hierarchically to small-RNA and long-RNA features to account for length disparity when transcripts overlap in the same genomic position. Next, MGcount aggregates RNA products with similar sequences where reads systematically multi-map using a graph-based approach. MGcount outputs a transcriptomic count matrix compatible with RNA-sequencing downstream analysis pipelines, with both bulk and single-cell resolution, and the graphs that model repeated transcript structures for different biotypes. The software can be used as a python module or as a single-file executable program. Conclusions MGcount is a flexible total-RNA-seq quantification tool that successfully integrates reads that align to multiple genomic locations or that overlap with multiple gene features. Its approach is suitable for the simultaneous estimation of protein-coding, long non-coding and small non-coding transcript concentration, in both precursor and processed forms. Both source code and compiled software are available at https://github.com/hitaandrea/MGcount.

Download Full-text

Emerging Roles of Estrogen-Regulated Enhancer and Long Non-Coding RNAs

International Journal of Molecular Sciences ◽

10.3390/ijms21103711 ◽

2020 ◽

Vol 21 (10) ◽

pp. 3711

Author(s):

Melina J. Sedano ◽

Alana L. Harrison ◽

Mina Zilaie ◽

Chandrima Das ◽

Ramesh Choudhari ◽

...

Keyword(s):

Rna Sequencing ◽

Expression Patterns ◽

Biological Significance ◽

Rna Seq ◽

Biological Functions ◽

Protein Coding ◽

Rna Molecules ◽

Non Coding Rna ◽

Genome Wide ◽

Non Coding Rnas

Genome-wide RNA sequencing has shown that only a small fraction of the human genome is transcribed into protein-coding mRNAs. While once thought to be “junk” DNA, recent findings indicate that the rest of the genome encodes many types of non-coding RNA molecules with a myriad of functions still being determined. Among the non-coding RNAs, long non-coding RNAs (lncRNA) and enhancer RNAs (eRNA) are found to be most copious. While their exact biological functions and mechanisms of action are currently unknown, technologies such as next-generation RNA sequencing (RNA-seq) and global nuclear run-on sequencing (GRO-seq) have begun deciphering their expression patterns and biological significance. In addition to their identification, it has been shown that the expression of long non-coding RNAs and enhancer RNAs can vary due to spatial, temporal, developmental, or hormonal variations. In this review, we explore newly reported information on estrogen-regulated eRNAs and lncRNAs and their associated biological functions to help outline their markedly prominent roles in estrogen-dependent signaling.

Download Full-text

RNAseq by Total RNA Library Identifies Additional RNAs Compared to Poly(A) RNA Library

BioMed Research International ◽

10.1155/2015/862130 ◽

2015 ◽

Vol 2015 ◽

pp. 1-9 ◽

Cited By ~ 13

Author(s):

Yan Guo ◽

Shilin Zhao ◽

Quanhu Sheng ◽

Mingsheng Guo ◽

Brian Lehmann ◽

...

Keyword(s):

Rna Sequencing ◽

Small Rnas ◽

Breast Cancer Cell Lines ◽

Rna Expression ◽

Library Preparation ◽

Sequencing Data ◽

Protein Coding ◽

Total Rna ◽

Capture Method ◽

Highly Correlated

The most popular RNA library used for RNA sequencing is the poly(A) captured RNA library. This library captures RNA based on the presence of poly(A) tails at the 3′ end. Another type of RNA library for RNA sequencing is the total RNA library which differs from the poly(A) library by capture method and price. The total RNA library costs more and its capture of RNA is not dependent on the presence of poly(A) tails. In practice, only ribosomal RNAs and small RNAs are washed out in the total RNA library preparation. To evaluate the ability of detecting RNA for both RNA libraries we designed a study using RNA sequencing data of the same two breast cancer cell lines from both RNA libraries. We found that the RNA expression values captured by both RNA libraries were highly correlated. However, the number of RNAs captured was significantly higher for the total RNA library. Furthermore, we identify several subsets of protein coding RNAs that were not captured efficiently by the poly(A) library. One of the most noticeable is the histone-encode genes, which lack the poly(A) tail.

Download Full-text

Comparative analysis of RNA enrichment methods for preparation of Cryptococcus neoformans RNA sequencing libraries

10.1101/2021.03.01.433483 ◽

2021 ◽

Author(s):

Calla L. Telzrow ◽

Paul J. Zwack ◽

Shannon Esher Righi ◽

Fred S. Dietrich ◽

Cliburn Chan ◽

...

Keyword(s):

Cryptococcus Neoformans ◽

Rna Sequencing ◽

Rnase H ◽

Rrna Genes ◽

Rna Seq ◽

Protein Coding ◽

Expression Levels ◽

Non Coding Rna ◽

Long Non Coding Rna ◽

Enrichment Methods

ABSTRACTRibosomal RNA (rRNA) is the major RNA constituent of cells, therefore most RNA sequencing (RNA-Seq) experiments involve removal of rRNA. This process, called RNA enrichment, is done primarily to reduce cost: without rRNA removal, deeper sequencing would need to be performed to balance the sequencing reads wasted on rRNA. The ideal RNA enrichment method would remove all rRNA without affecting other RNA in the sample. We have tested the performance of three RNA enrichment methods on RNA isolated from Cryptococcus neoformans, a fungal pathogen of humans. We show that the RNase H depletion method unambiguously outperforms the commonly used Poly(A) isolation method: the RNase H method more efficiently depletes rRNA while more accurately recapitulating the expression levels of other RNA observed in an unenriched “gold standard”. The RNase H depletion method is also superior to the Ribo-Zero depletion method as measured by rRNA depletion efficiency and recapitulation of protein-coding gene expression levels, while the Ribo-Zero depletion method performs moderately better in preserving non-coding RNA (ncRNA). Finally, we have leveraged this dataset to identify novel long non-coding RNA (lncRNA) genes and to accurately map the C. neoformans mitochondrial rRNA genes.ARTICLE SUMMARYWe compare the efficacy of three different RNA enrichment methods for RNA-Seq in Cryptococcus neoformans: RNase H depletion, Ribo-Zero depletion, and Poly(A) isolation. We show that the RNase H depletion method, which is evaluated in C. neoformans samples for the first time here, is highly efficient and specific in removing rRNA. Additionally, using data generated through these analyses, we identify novel long non-coding RNA genes in C. neoformans. We conclude that RNase H depletion is an effective and reliable method for preparation of C. neoformans RNA-Seq libraries.

Download Full-text

Targeted enrichment outperforms other enrichment techniques and enables more multi-species RNA-Seq analyses

10.1101/258640 ◽

2018 ◽

Author(s):

Matthew Chung ◽

Laura Teigen ◽

Hong Liu ◽

Silvia Libro ◽

Amol Shetty ◽

...

Keyword(s):

Systematic Bias ◽

Rna Seq ◽

Protein Coding ◽

Total Rna ◽

Ratio Difference ◽

Fold Enrichment ◽

Agilent Sureselect ◽

Positive Linear Correlation ◽

Enrichment Techniques ◽

Targeted Enrichment

AbstractEnrichment methodologies enable analysis of minor members in multi-species transcriptomic analyses. We compared standard enrichment of bacterial and eukaryotic mRNA to targeted enrichment with Agilent SureSelect (AgSS) capture for Brugia malayi, Aspergillus fumigatus, and the Wolbachia endosymbiont of B. malayi (wBm). Without introducing significant systematic bias, the AgSS quantitatively enriched samples, resulting in more reads mapping to the target organism. The AgSS-enriched libraries consistently had a positive linear correlation with its unenriched counterpart (r2=0.559-0.867). Up to a 2,242-fold enrichment of RNA from the target organism was obtained following a power law (r2=0.90), with the greatest fold enrichment achieved in samples with the largest ratio difference between the major and minor members. While using a single total library for prokaryote and eukaryote in a single sample could be beneficial for samples where RNA is limiting, we observed a decrease in reads mapping to protein coding genes and an increase of multi-mapping reads to rRNAs in AgSS enrichments from eukaryotic total RNA libraries as opposed to eukaryotic poly(A)-enriched libraries. Our results support a recommendation of using Agilent SureSelect targeted enrichment on poly(A)-enriched libraries for eukaryotic captures and total RNA libraries for prokaryotic captures to increase the robustness of multi-species transcriptomic studies.

Download Full-text

Identification, annotation and visualisation of extreme changes in splicing from RNA-seq experiments with SwitchSeq

10.1101/005967 ◽

2014 ◽

Cited By ~ 6

Author(s):

Mar Gonzàlez-Porta ◽

Alvis Brazma

Keyword(s):

Enrichment Analysis ◽

R Package ◽

Third Party ◽

Pathway Enrichment Analysis ◽

Rna Seq ◽

Differential Splicing ◽

Protein Coding ◽

Downstream Analysis ◽

Intuitive Manner ◽

Abundant Transcript

In the past years, RNA sequencing has become the method of choice for the study of transcriptome composition. When working with this type of data, several tools exist to quantify differences in splicing across conditions and to address the significance of those changes. However, the number of genes predicted to undergo differential splicing is often high, and further interpretation of the results becomes a challenging task. Here we present SwitchSeq, a novel set of tools designed to help the users in the interpretation of differential splicing events that affect protein coding genes. More specifically, we provide a framework to identify switch events, i.e., cases where, for a given gene, the identity of the most abundant transcript changes across conditions. The identified events are then annotated by incorporating information from several public databases and third-party tools, and are further visualised in an intuitive manner with the independent R package tviz. All the results are displayed in a self-contained HTML document, and are also stored in txt and json format to facilitate the integration with any further downstream analysis tools. Such analysis approach can be used complementarily to Gene Ontology and pathway enrichment analysis, and can also serve as an aid in the validation of predicted changes in mRNA and protein abundance. The latest version of SwitchSeq, including installation instructions and use cases, can be found at https://github.com/mgonzalezporta/SwitchSeq. Additionally, the plot capabilities are provided as an independent R package at https://github.com/mgonzalezporta/tviz.

Download Full-text

MAJIQ-SPEL: Web-Tool to interrogate classical and complex splicing variations from RNA-Seq data

10.1101/136077 ◽

2017 ◽

Author(s):

Christopher J. Green ◽

Matthew R. Gazzara ◽

Yoseph Barash

Keyword(s):

Alternative Splicing ◽

Rna Sequencing ◽

Experimental Validation ◽

Ucsc Genome Browser ◽

Rna Seq ◽

Web Tool ◽

Rt Pcr ◽

Design Algorithm ◽

Gene Isoforms ◽

Downstream Analysis

AbstractAnalysis of RNA sequencing (RNA-Seq) data have highlighted the fact that most genes undergo alternative splicing (AS) and that these patterns are tightly regulated. Many of these events are complex, resulting in numerous possible isoforms that quickly become difficult to visualize, interpret, and experimentally validate. To address these challenges, We developed MAJIQ-SPEL, a web-tool that takes as input local splicing variations (LSVs) quantified from RNA-Seq data and provides users with visualization and quantification of gene isoforms associated with those. Importantly, MAJIQ-SPEL is able to handle both classical (binary) and complex (non-binary) splicing variations. Using a matching primer design algorithm it also suggests users possible primers for experimental validation by RT-PCR and displays those, along with the matching protein domains affected by the LSV, on UCSC Genome Browser for further downstream analysis.Availability: Program and code will be available at http://majiq.biociphers.org/majiq-spel

Download Full-text

Comparative analysis of RNA enrichment methods for preparation of Cryptococcus neoformans RNA sequencing libraries

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab301 ◽

2021 ◽

Author(s):

Calla L Telzrow ◽

Paul J Zwack ◽

Shannon Esher Righi ◽

Fred S Dietrich ◽

Cliburn Chan ◽

...

Keyword(s):

Cryptococcus Neoformans ◽

Rna Sequencing ◽

Rnase H ◽

Rrna Genes ◽

Rna Seq ◽

Protein Coding ◽

Mitochondrial Rrna ◽

Non Coding Rna ◽

Rrna Depletion ◽

Enrichment Methods

Abstract RNA sequencing (RNA-Seq) experiments focused on gene expression involve removal of ribosomal RNA (rRNA) because it is the major RNA constituent of cells. This process, called RNA enrichment, is done primarily to reduce cost: without rRNA removal, deeper sequencing must be performed to compensate for the sequencing reads wasted on rRNA. The ideal RNA enrichment method removes all rRNA without affecting other RNA in the sample. We tested the performance of three RNA enrichment methods on RNA isolated from Cryptococcus neoformans, a fungal pathogen of humans. We find that the RNase H depletion method is more efficient in depleting rRNA and more specific in recapitulating non-rRNA levels present in unenriched controls than the commonly-used Poly(A) isolation method. The RNase H depletion method is also more effective than the Ribo-Zero depletion method as measured by rRNA depletion efficiency and recapitulation of protein-coding RNA levels present in unenriched controls, while the Ribo-Zero depletion method more closely recapitulates annotated non-coding RNA (ncRNA) levels. Finally, we leverage these data to accurately map the C. neoformans mitochondrial rRNA genes, and also demonstrate that RNA-Seq data generated with the RNase H and Ribo-Zero depletion methods can be used to explore novel C. neoformans long non-coding RNA genes.

Download Full-text

Performance assessment of total RNA sequencing of human biofluids and extracellular vesicles

10.1101/701524 ◽

2019 ◽

Author(s):

Celine Everaert ◽

Hetty Helsmoortel ◽

Anneleen Decock ◽

Eva Hulstaert ◽

Ruben Van Paemel ◽

...

Keyword(s):

Rna Sequencing ◽

Extracellular Vesicles ◽

Platelet Rich Plasma ◽

Rna Seq ◽

Total Rna ◽

Rna Molecules ◽

Rna Profiling ◽

Wide Range ◽

Read Distribution ◽

Free Plasma

AbstractRNA profiling has emerged as a powerful tool to investigate the biomarker potential of human biofluids. However, despite enormous interest in extracellular nucleic acids, RNA sequencing methods to quantify the total RNA content outside cells are rare. Here, we evaluate the performance of the SMARTer Stranded Total RNA-Seq method in human platelet-rich plasma, platelet-free plasma, urine, conditioned medium, and extracellular vesicles (EVs) from these biofluids. We found the method to be accurate, precise, compatible with low-input volumes and able to quantify a few thousand genes. We picked up distinct classes of RNA molecules, including mRNA, lncRNA, circRNA, miscRNA and pseudogenes. Notably, the read distribution and gene content drastically differ among biofluids. In conclusion, we are the first to show that the SMARTer method can be used for unbiased unraveling of the complete transcriptome of a wide range of biofluids and their extracellular vesicles.

Download Full-text

The Complex Transcriptional Landscape of the Human Platelet

Blood ◽

10.1182/blood.v120.21.390.390 ◽

2012 ◽

Vol 120 (21) ◽

pp. 390-390

Author(s):

Paul F. Bray ◽

Steven E. McKenzie ◽

Leonard C. Edelstein ◽

Srikanth Nagalla ◽

Kathleen Delgrosso ◽

...

Keyword(s):

Gene Expression ◽

Molecular Mechanisms ◽

Conflicts Of Interest ◽

Normal Population ◽

Rna Stability ◽

Protein Translation ◽

Rna Seq ◽

Antisense Transcripts ◽

Protein Coding ◽

Total Rna

Abstract Abstract 390 A conspicuous lesson that has emerged from the 1000 Genomes Project is the greater genetic variation in the population than previously appreciated. Transcriptomics is rapidly assuming a prominent role in the understanding of basic molecular mechanisms accounting for variation within the normal population and disease states. Besides protein-coding RNAs, the importance of non-coding RNAs (ncRNAs) – primarily as regulators of gene expression – is well recognized but largely unexplored. The platelet transcriptome reflects megakaryocyte RNA content at the time of proplatelet release, subsequent splicing events, selective packaging and platelet RNA stability. An accurate understanding of the platelet transcriptome has both biological (improved understanding of platelet protein translation and the mechanisms of megakaryocyte/platelet gene expression) and clinical (novel biomarkers of disease) relevance. We carried out transcriptome sequencing of total RNA isolated from leukocyte-depleted platelet preparations from four healthy adults using an AB/LT SOLiD™ system. For each individual, we constructed 3 libraries: a) long (≥ 40 nucleotides) total RNA, b) long RNA depleted of rRNA, and c) short (< 40 nucleotides) RNA. ∼1 billion reads from the 12 datasets were mapped on each chromosome and strand of the human genome. About one-third mapped uniquely, similar to other unbiased methods like SAGE. Normalizing for transcript length and scale using ß-actin expression level provided the ability to appropriately scale expression within a read-set and to compare expression levels across read-sets. Of the known protein-coding loci, ∼9,500 were present in human platelets. Plotting the number of protein-coding genes as a function of the level of normalized expression underscored different gene estimates between total and rRNA-depleted RNA preparations, and substantial inter-individual variation in the less abundant genes. RT-PCR validated the RNA-seq estimates of transcript levels exhibiting a range of >3 orders of magnitude of normalized read counts (r=0.7757; p=0.0001). A strong correlation was measured between mRNAs identified by RNA-seq and 3 published microarray datasets for well-expressed mRNAs, although RNA-seq identified many more transcripts of lower abundance. Unexpectedly, ribosomal RNA depletion significantly and adversely affected estimates of the relative abundance of transcripts including members of the RNA interference pathway DGCR8, DROSHA, XPO5, DICER1, EIF2C1-4, which exhibited large differences (up to 32-fold) between the total and rRNA-depleted preparations. A rigorous and highly stringent approach identified bona fide intronic regions that gave rise to 6,992 and 1,236 currently uncharacterized long and short RNA transcripts, respectively. We discovered numerous previously unreported antisense transcripts: 1) to known protein-coding regions of the genome, 2) 10 miRNA precursors where each locus generated 1–2 distinct antisense transcripts, presumably mature and “star” miRNAs, and 3) long and short RNAs antisense to several known repeat families. We did not observe enrichment of long-intergenic ncRNAs. We considered various possible explanations for the ∼60% sequence reads that could not be mapped on the genome. Much more lenient parameter settings only accounted for only ∼6.5% sequenced reads. An even smaller fraction of reads was observed when considering all possible combinations of exon-exon junctions in the genome (12,382,819 junctions) and the highly polymorphic HLA region of chr 6, indicating these did not contribute in any substantive manner to the platelet transcriptome. Lastly, RNA-seq was highly reproducible (>97 for 1 subject studied on 4 occasions). In summary, our work reveals a richness and diversity of platelet RNA molecules, suggesting a context where platelet biology transcends protein- and mRNA-centric descriptions. We will provide a publicly available web tool of these data embedded in a local mirror of the UCSC genome browser, facilitating the elucidation of previously unappreciated molecular species and molecular interactions. This will eventually permit an improved understanding of the molecular mechanisms that regulate platelet physiology and that contribute to disorders of thrombosis, hemostasis and inflammation. Disclosures: No relevant conflicts of interest to declare.

Download Full-text

Characterization of the Short RNA Transcriptome of the Anucleate Platelet

Blood ◽

10.1182/blood.v124.21.4990.4990 ◽

2014 ◽

Vol 124 (21) ◽

pp. 4990-4990

Author(s):

Eric R. Londin ◽

Phillipe Loher ◽

Leonard C. Edelstein ◽

Kathy Delgrosso ◽

Paolo M. Fortina ◽

...

Keyword(s):

Molecular Mechanisms ◽

Conflicts Of Interest ◽

Critical Role ◽

Differentially Expressed ◽

Rna Seq ◽

Ensembl Database ◽

Protein Coding ◽

Repeat Elements ◽

Short Rnas ◽

Genomic Locations

Abstract The anucleate platelets play a critical role in the formation of thrombi and prevention of bleeding. In recent years, next-generation RNA sequencing (RNA-seq) has proven very useful in shedding light on the specifics of the platelet transcriptome. For example, RNA-seq of the long RNAs in platelets has revealed many non-coding RNAs (ncRNAs) as well as a diverse set of protein-coding genes whose mRNAs are highly correlated amongst individuals but only weakly linked to the currently available platelet proteome. By comparison, the short RNA transcriptome has not been as thoroughly characterized. As a matter of fact, these studies have so far focused on the 100’s of microRNAs (miRNAs) that are present in platelets leaving large swaths of the short RNA-ome uncharacterized. To gauge the complexity of the platelet short RNA-ome we performed short RNA-seq of leukocyte-depleted platelets from 10 healthy males (5 white and 5 black). The sequencing was done on the SOLiD 5500 XL platform and generated over 1.5 billion sequenced reads. To comprehensively characterize the complete short RNA-ome we only considered sequence reads that mapped on the genome without any mismatches but allowed a read to map to as many as 10,000 locations within the genome. This approach gave us the ability to simultaneously examine both the uniquely-present and the repeat-derived expressed elements of the genome. Using this approach, we were able to map ~50% of the sequenced reads. We found that for ~55% of the mapped reads their sequences are present at multiple genomic locations whereas the remaining ~45% originated from unique locations. Of the RNAs with unique genomic origins: ~50% correspond to miRNAs (with miR-223-3p being the most abundant miRNA across all 10 individuals), ~20% originate from various classes of repeat elements, and, the remaining 30% correspond to non-annotated regions of the genome that were non-annotated a of Release 75 of the ENSEMBL database. By comparison, of the RNAs with ambiguous genomic origins: ~20% belong to miRNAs (with miR-103a-3p, a miRNA present in two locations in the genome, being the most abundant miRNA across all 10 individuals) and ~60% correspond to various classes of repeat elements (with members of the HY4 scRNA ncRNAs accounting for nearly a third of all sequence reads). These findings make it evident that the platelet transcriptome has a considerable richness in short RNAs that arise from repetitive elements. To further characterize those RNAs that map to regions of the genome that are not currently annotated, we considered the possibility that they may be novel miRNAs. Using the miRDeep2 algorithm, we sought novel miRNAs among the uncharacterized transcripts and identified 47 of them; the sequences for 18 of these 47 appear at multiple genomic locations in analogy to miR-103/107, miR-19a/19b, etc. Lastly, as our ten samples represented two races, we hypothesized that a subset of the identified sequences would be differentially expressed between the two groups. Using DESeq2, we identified over 157 sequences to be differentially expressed. The most highly differentially expressed sequences corresponded to a miRNA and a repeat element. In summary, our RNA-seq analyses have revealed a very diverse spectrum of platelet short RNAs that transcends the miRNA category. Indeed, we find that short transcripts that have their source in genomic loci that have not been previously discussed or analyzed in the platelet context represent a very significant portion of all short RNAs in platelets. This in turn highlights an unanticipated richness, and presumably commensurate complexity, for the platelet transcriptome. While the role of these novel non-protein coding short RNAs is currently unknown it is expected that at least some of them may be of functional significance. Consequently, they could contribute to processes beyond thrombosis and hemostasis and may permit a better understanding of the molecular mechanisms that regulate platelet physiology. Disclosures No relevant conflicts of interest to declare.

Download Full-text