scholarly journals Revealing missing isoforms encoded in the human genome by integrating genomic, transcriptomic and proteomic data

2014 ◽  
Author(s):  
Zhiqiang Hu ◽  
Hamish S. Scott ◽  
Guangrong Qin ◽  
Guangyong Zheng ◽  
Xixia Chu ◽  
...  

Biological and biomedical research relies on comprehensive understanding of protein-coding transcripts. However, the total number of human proteins is still unknown due to the prevalence of alternative splicing and is much larger than the number of human genes. In this paper, we detected 31,566 novel transcripts with coding potential by filtering our ab initio predictions with 50 RNA-seq datasets from diverse tissues/cell lines. PCR followed by MiSeq sequencing showed that at least 84.1% of these predicted novel splice sites could be validated. In contrast to known transcripts, the expression of these novel transcripts were highly tissue-specific. Based on these novel transcripts, at least 36 novel proteins were detected from shotgun proteomics data of 41 breast samples. We also showed L1 retrotransposons have a more significant impact on the origin of new transcripts/genes than previously thought. Furthermore, we found that alternative splicing is extraordinarily widespread for genes involved in specific biological functions like protein binding, nucleoside binding, neuron projection, membrane organization and cell adhesion. In the end, the total number of human transcripts with protein-coding potential was estimated to be at least 204,950.

Reproduction ◽  
2020 ◽  
Vol 159 (1) ◽  
pp. 15-26 ◽  
Author(s):  
Jaya Gamble ◽  
Joel Chick ◽  
Kelly Seltzer ◽  
Joel H Graber ◽  
Steven Gygi ◽  
...  

The testis transcriptome is exceptionally complex. Despite its complexity, previous testis transcriptome analyses relied on a reductive method for transcript identification, thus underestimating transcriptome complexity. We describe here a more complete testis transcriptome generated by combining Tuxedo, a reductive method, and spliced-RUM, a combinatorial transcript-building approach. Forty-two percent of the expanded testis transcriptome is composed of unannotated RNAs with novel isoforms of known genes and novel genes constituting 78 and 9.8% of the newly discovered transcripts, respectively. Across tissues, novel transcripts were predominantly expressed in the testis with the exception of novel isoforms which were also highly expressed in the adult ovary. Within the testis, novel isoform expression was distributed equally across all cell types while novel genes were predominantly expressed in meiotic and post-meiotic germ cells. The majority of novel isoforms retained their protein-coding potential while most novel genes had low protein-coding potential. However, a subset of novel genes had protein-coding potentials equivalent to known protein-coding genes. Shotgun mass spectrometry of round spermatid total protein identified unique peptides from four novel genes along with seven annotated non-coding RNAs. These analyses demonstrate the testis expresses a wide range of novel transcripts that give rise to novel proteins.


2019 ◽  
Author(s):  
Thomas F. Martinez ◽  
Qian Chu ◽  
Cynthia Donaldson ◽  
Dan Tan ◽  
Maxim N. Shokhirev ◽  
...  

Protein-coding small open reading frames (smORFs) are emerging as an important class of genes, however, the coding capacity of smORFs in the human genome is unclear. By integrating de novo transcriptome assembly and Ribo-Seq, we confidently annotate thousands of novel translated smORFs in three human cell lines. We find that smORF translation prediction is noisier than for annotated coding sequences, underscoring the importance of analyzing multiple experiments and footprinting conditions. These smORFs are located within non-coding and antisense transcripts, the UTRs of mRNAs, and unannotated transcripts. Analysis of RNA levels and translation efficiency during cellular stress identifies regulated smORFs, providing an approach to select smORFs for further investigation. Sequence conservation and signatures of positive selection indicate that encoded microproteins are likely functional. Additionally, proteomics data from enriched human leukocyte antigen complexes validates the translation of hundreds of smORFs and positions them as a source of novel antigens. Thus, smORFs represent a significant number of important, yet unexplored human genes.


2020 ◽  
Author(s):  
D.C.L. Handler ◽  
P.A. Haynes

AbstractAssessment of replicate quality is an important process for any shotgun proteomics experiment. One fundamental question in proteomics data analysis is whether any specific replicates in a set of analyses are biasing the downstream comparative quantitation. In this paper, we present an experimental method to address such a concern. PeptideMind uses a series of clustering Machine Learning algorithms to assess outliers when comparing proteomics data from two states with six replicates each. The program is a JVM native application written in the Kotlin language with Python sub-process calls to scikit-learn. By permuting the six data replicates provided into four hundred triplet non redundant pairwise comparisons, PeptideMind determines if any one replicate is biasing the downstream quantitation of the states. In addition, PeptideMind generates useful visual representations of the spread of the significance measures, allowing researchers a rapid, effective way to monitor the quality of those identified proteins found to be differentially expressed between sample states.


Author(s):  
Jesse G. Meyer

ABSTRACTShotgun proteomics techniques infer the presence and quantity of proteins using peptide proxies, which are produced by cleavage of all isolated protein by a protease. Most protein quantitation strategies assume that multiple peptides derived from a protein will behave quantitatively similar across treatment groups, but this assumption may be false for biological or technical reasons. Here, I describe a strategy called peptide correlation analysis (PeCorA) that detects quantitative disagreements between peptides mapped to the same protein. Simple linear models are used to assess whether the slope of a peptide’s change across treatment groups differs from the slope of all other peptides assigned to the same protein. Reanalysis of proteomic data from primary mouse microglia with PeCorA revealed that about 15% of proteins contain one discordant peptide. Inspection of the discordant peptides shows utility of PeCorA for direct and indirect detection of regulated PTMs, and also for discovery of poorly quantified peptides that should be excluded. PeCorA can be applied to an arbitrary list of quantified peptides, and is freely available as a script written in R.


2011 ◽  
Vol 21 (5) ◽  
pp. 756-767 ◽  
Author(s):  
M. Brosch ◽  
G. I. Saunders ◽  
A. Frankish ◽  
M. O. Collins ◽  
L. Yu ◽  
...  

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Svetlana Kalmykova ◽  
Marina Kalinina ◽  
Stepan Denisov ◽  
Alexey Mironov ◽  
Dmitry Skvortsov ◽  
...  

AbstractThe ability of nucleic acids to form double-stranded structures is essential for all living systems on Earth. Current knowledge on functional RNA structures is focused on locally-occurring base pairs. However, crosslinking and proximity ligation experiments demonstrated that long-range RNA structures are highly abundant. Here, we present the most complete to-date catalog of conserved complementary regions (PCCRs) in human protein-coding genes. PCCRs tend to occur within introns, suppress intervening exons, and obstruct cryptic and inactive splice sites. Double-stranded structure of PCCRs is supported by decreased icSHAPE nucleotide accessibility, high abundance of RNA editing sites, and frequent occurrence of forked eCLIP peaks. Introns with PCCRs show a distinct splicing pattern in response to RNAPII slowdown suggesting that splicing is widely affected by co-transcriptional RNA folding. The enrichment of 3’-ends within PCCRs raises the intriguing hypothesis that coupling between RNA folding and splicing could mediate co-transcriptional suppression of premature pre-mRNA cleavage and polyadenylation.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
David S. M. Lee ◽  
Joseph Park ◽  
Andrew Kromer ◽  
Aris Baras ◽  
Daniel J. Rader ◽  
...  

AbstractRibosome-profiling has uncovered pervasive translation in non-canonical open reading frames, however the biological significance of this phenomenon remains unclear. Using genetic variation from 71,702 human genomes, we assess patterns of selection in translated upstream open reading frames (uORFs) in 5’UTRs. We show that uORF variants introducing new stop codons, or strengthening existing stop codons, are under strong negative selection comparable to protein-coding missense variants. Using these variants, we map and validate gene-disease associations in two independent biobanks containing exome sequencing from 10,900 and 32,268 individuals, respectively, and elucidate their impact on protein expression in human cells. Our results suggest translation disrupting mechanisms relating uORF variation to reduced protein expression, and demonstrate that translation at uORFs is genetically constrained in 50% of human genes.


1987 ◽  
Vol 7 (12) ◽  
pp. 4266-4272 ◽  
Author(s):  
L W Stanton ◽  
J M Bishop

NMYC is a gene whose amplification and overexpression have been implicated in the generation of certain human malignancies. Little is known of how the expression of NMYC is normally controlled. We have therefore characterized transcription from the gene and the structure and stability of the resulting mRNAs. Transcription from NMYC is exceptionally complex: it initiates at numerous sites that may be grouped under the control of two promoters, and the multiplicity of initiation sites combines with alternative splicing to engender two forms of mRNA. The mRNAs have different 5' leader sequences (alternative first exons of the gene) but identical bodies (the second and third exons of the gene). Both forms of mRNA are unstable, with half-lives of ca. 15 min. Both encode the previously identified 65,000 and 67,000-dalton products of NMYC. However, the alternative first exons contain distinctive open reading frames that may diversify the coding potential of NMYC. The complexities in transcription of NMYC expand the means by which expression of the gene might be controlled.


2007 ◽  
Vol 283 (3) ◽  
pp. 1229-1233 ◽  
Author(s):  
Claudia Ben-Dov ◽  
Britta Hartmann ◽  
Josefin Lundgren ◽  
Juan Valcárcel

Alternative splicing of mRNA precursors allows the synthesis of multiple mRNAs from a single primary transcript, significantly expanding the information content and regulatory possibilities of higher eukaryotic genomes. High-throughput enabling technologies, particularly large-scale sequencing and splicing-sensitive microarrays, are providing unprecedented opportunities to address key questions in this field. The picture emerging from these pioneering studies is that alternative splicing affects most human genes and a significant fraction of the genes in other multicellular organisms, with the potential to greatly influence the evolution of complex genomes. A combinatorial code of regulatory signals and factors can deploy physiologically coherent programs of alternative splicing that are distinct from those regulated at other steps of gene expression. Pre-mRNA splicing and its regulation play important roles in human pathologies, and genome-wide analyses in this area are paving the way for improved diagnostic tools and for the identification of novel and more specific pharmaceutical targets.


2021 ◽  
Vol 72 (1) ◽  
Author(s):  
Andrzej T. Wierzbicki ◽  
Todd Blevins ◽  
Szymon Swiezewski

Plants have an extraordinary diversity of transcription machineries, including five nuclear DNA-dependent RNA polymerases. Four of these enzymes are dedicated to the production of long noncoding RNAs (lncRNAs), which are ribonucleic acids with functions independent of their protein-coding potential. lncRNAs display a broad range of lengths and structures, but they are distinct from the small RNA guides of RNA interference (RNAi) pathways. lncRNAs frequently serve as structural, catalytic, or regulatory molecules for gene expression. They can affect all elements of genes, including promoters, untranslated regions, exons, introns, and terminators, controlling gene expression at various levels, including modifying chromatin accessibility, transcription, splicing, and translation. Certain lncRNAs protect genome integrity, while others respond to environmental cues like temperature, drought, nutrients, and pathogens. In this review, we explain the challenge of defining lncRNAs, introduce the machineries responsible for their production, and organize this knowledge by viewing the functions of lncRNAs throughout the structure of a typical plant gene. Expected final online publication date for the Annual Review of Plant Biology, Volume 72 is May 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.


Sign in / Sign up

Export Citation Format

Share Document