scholarly journals Neanderthals had our de novo genes.

2014 ◽  
Author(s):  
John Stewart Taylor

In 2009 Knowles and McLysaght reported the discovery of three human genes derived from non-coding DNA. They provided evidence that these genes, CLUU1, C22orf45, and DNAH10OS, were transcribed and translated, they identified orthologous non-coding DNA in chimpanzee (Pan troglodytes) and macaque (Macaca mulatta), and for each gene they located the critical ?enabler? mutations that extended the open reading frames (ORFs) allowing the production of a protein. These genes had no BLASTp hits in any other genome and were considered to be novel human genes, possibly responsible for human-specific traits. Since the discovery of these genes, new high quality Denisovan and Neanderthal genomes have been reported. I used these resources in an effort to determine whether or not CLUU1, C22orf45, and DNAH10OS were truly human-specific.

2017 ◽  
Author(s):  
Jonathan Schmitz ◽  
Kristian Ullrich ◽  
Erich Bornberg-Bauer

AbstractA recent surge of studies suggested that many novel genes arise de novo from previously non-coding DNA and not by duplication. However, since most studies concentrated on longer evolutionary time scales and rarely considered protein structural properties, it remains unclear how these properties are shaped by evolution, depend on genetic mechanisms and influence gene survival. Here we compare open reading frames (ORFs) from high coverage transcriptomes from mouse and another four mammals covering 160 million years of evolution. We find that novel ORFs pervasively emerge from intergenic and intronic regions but are rapidly lost again while relatively fewer arise from duplications but are retained over much longer times. Surprisingly, disorder and other protein properties of young ORFs do not change with gene age. Only length and nucleotide composition change, probably to avoid aggregation. Thus de novo genes resemble frozen accidents of randomly emerged ORFs which survived initial purging, likely because they are functional.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
David S. M. Lee ◽  
Joseph Park ◽  
Andrew Kromer ◽  
Aris Baras ◽  
Daniel J. Rader ◽  
...  

AbstractRibosome-profiling has uncovered pervasive translation in non-canonical open reading frames, however the biological significance of this phenomenon remains unclear. Using genetic variation from 71,702 human genomes, we assess patterns of selection in translated upstream open reading frames (uORFs) in 5’UTRs. We show that uORF variants introducing new stop codons, or strengthening existing stop codons, are under strong negative selection comparable to protein-coding missense variants. Using these variants, we map and validate gene-disease associations in two independent biobanks containing exome sequencing from 10,900 and 32,268 individuals, respectively, and elucidate their impact on protein expression in human cells. Our results suggest translation disrupting mechanisms relating uORF variation to reduced protein expression, and demonstrate that translation at uORFs is genetically constrained in 50% of human genes.


2019 ◽  
Author(s):  
Thomas F. Martinez ◽  
Qian Chu ◽  
Cynthia Donaldson ◽  
Dan Tan ◽  
Maxim N. Shokhirev ◽  
...  

Protein-coding small open reading frames (smORFs) are emerging as an important class of genes, however, the coding capacity of smORFs in the human genome is unclear. By integrating de novo transcriptome assembly and Ribo-Seq, we confidently annotate thousands of novel translated smORFs in three human cell lines. We find that smORF translation prediction is noisier than for annotated coding sequences, underscoring the importance of analyzing multiple experiments and footprinting conditions. These smORFs are located within non-coding and antisense transcripts, the UTRs of mRNAs, and unannotated transcripts. Analysis of RNA levels and translation efficiency during cellular stress identifies regulated smORFs, providing an approach to select smORFs for further investigation. Sequence conservation and signatures of positive selection indicate that encoded microproteins are likely functional. Additionally, proteomics data from enriched human leukocyte antigen complexes validates the translation of hundreds of smORFs and positions them as a source of novel antigens. Thus, smORFs represent a significant number of important, yet unexplored human genes.


2018 ◽  
Author(s):  
Lisa K. Johnson ◽  
Harriet Alexander ◽  
C. Titus Brown

AbstractBackgroundDe novo transcriptome assemblies are required prior to analyzing RNAseq data from a species without an existing reference genome or transcriptome. Despite the prevalence of transcriptomic studies, the effects of using different workflows, or “pipelines”, on the resulting assemblies are poorly understood. Here, a pipeline was programmatically automated and used to assemble and annotate raw transcriptomic short read data collected by the Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP). The resulting transcriptome assemblies were evaluated and compared against assemblies that were previously generated with a different pipeline developed by the National Center for Genome Research (NCGR).ResultsNew transcriptome assemblies contained the majority of previous contigs as well as new content. On average, 7.8% of the annotated contigs in the new assemblies were novel gene names not found in the previous assemblies. Taxonomic trends were observed in the assembly metrics, with assemblies from the Dinoflagellata and Ciliophora phyla showing a higher percentage of open reading frames and number of contigs than transcriptomes from other phyla.ConclusionsGiven current bioinformatics approaches, there is no single ‘best’ reference transcriptome for a particular set of raw data. As the optimum transcriptome is a moving target, improving (or not) with new tools and approaches, automated and programmable pipelines are invaluable for managing the computationally-intensive tasks required for re-processing large sets of samples with revised pipelines and ensuring a common evaluation workflow is applied to all samples. Thus, re-assembling existing data with new tools using automated and programmable pipelines may yield more accurate identification of taxon-specific trends across samples in addition to novel and useful products for the community.Key PointsRe-assembly with new tools can yield new resultsAutomated and programmable pipelines can be used to process arbitrarily many samples.Analyzing many samples using a common pipeline identifies taxon-specific trends.


2019 ◽  
Vol 109 (2) ◽  
pp. 222-224 ◽  
Author(s):  
Margarita Gomila ◽  
Eduardo Moralejo ◽  
Antonio Busquets ◽  
Guillem Segui ◽  
Diego Olmo ◽  
...  

Xylella fastidiosa is a plant-pathogenic bacterium that causes serious diseases in many crops of economic importance and is a quarantine organism in the European Union. This study reports a de novo-assembled draft genome sequence of the first isolates causing Pierce’s disease in Europe: X. fastidiosa subsp. fastidiosa strains XYL1732/17 and XYL2055/17. Both strains were isolated from grapevines (Vitis vinifera) showing Pierce’s disease symptoms at two different locations in Mallorca, Spain. The XYL1732/17 genome is 2,444,109 bp long, with a G+C content of 51.5%; it contains 2,359 open reading frames and 48 tRNA genes. The XYL2055/17 genome is 2,456,780 bp long, with a G+C content of 51.5%; it contains 2,384 open reading frames and 48 tRNA genes.


2020 ◽  
Vol 12 (11) ◽  
pp. 2183-2195
Author(s):  
Daniel Dowling ◽  
Jonathan F Schmitz ◽  
Erich Bornberg-Bauer

Abstract In addition to known genes, much of the human genome is transcribed into RNA. Chance formation of novel open reading frames (ORFs) can lead to the translation of myriad new proteins. Some of these ORFs may yield advantageous adaptive de novo proteins. However, widespread translation of noncoding DNA can also produce hazardous protein molecules, which can misfold and/or form toxic aggregates. The dynamics of how de novo proteins emerge from potentially toxic raw materials and what influences their long-term survival are unknown. Here, using transcriptomic data from human and five other primates, we generate a set of transcribed human ORFs at six conservation levels to investigate which properties influence the early emergence and long-term retention of these expressed ORFs. As these taxa diverged from each other relatively recently, we present a fine scale view of the evolution of novel sequences over recent evolutionary time. We find that novel human-restricted ORFs are preferentially located on GC-rich gene-dense chromosomes, suggesting their retention is linked to pre-existing genes. Sequence properties such as intrinsic structural disorder and aggregation propensity—which have been proposed to play a role in survival of de novo genes—remain unchanged over time. Even very young sequences code for proteins with low aggregation propensities, suggesting that genomic regions with many novel transcribed ORFs are concomitantly less likely to produce ORFs which code for harmful toxic proteins. Our data indicate that the survival of these novel ORFs is largely stochastic rather than shaped by selection.


2006 ◽  
Vol 188 (17) ◽  
pp. 6261-6268 ◽  
Author(s):  
Jonathon P. Audia ◽  
Herbert H. Winkler

ABSTRACT The obligate intracytoplasmic pathogen Rickettsia prowazekii relies on the transport of many essential compounds from the cytoplasm of the eukaryotic host cell in lieu of de novo synthesis, an evolutionary outcome undoubtedly linked to obligatory growth in this metabolite-replete niche. The paradigm for the study of rickettsial transport systems is the ATP/ADP translocase Tlc1, which exchanges bacterial ADP for host cell ATP as a source of energy, rather than as a source of adenylate. Interestingly, the R. prowazekii genome encodes four open reading frames that are highly homologous to the well-characterized ATP/ADP translocase Tlc1. Therefore, by annotation, the R. prowazekii genome encodes a total of five ATP/ADP translocases: Tlc1, Tlc2, Tlc3, Tlc4, and Tlc5. We have confirmed by quantitative reverse transcriptase PCR that mRNAs corresponding to all five tlc homologues are expressed in R. prowazekii growing in L-929 cells and have shown their heterologous protein expression in Escherichia coli, suggesting that none of the tlc genes are pseudogenes in the process of evolutionary meltdown. However, we demonstrate by heterologous expression in E. coli that only Tlc1 functions as an ATP/ADP transporter. A survey of nucleotides and nucleosides has determined that Tlc4 transports CTP, UTP, and GDP. Intriguingly, although GTP was not transported by Tlc4, it was an inhibitor of CTP and UTP uptake and demonstrated a Ki similar to that of GDP. In addition, we demonstrate that Tlc5 transports GTP and GDP. We postulate that Tlc4 and Tlc5 serve the primary function of maintaining intracellular pools of nucleotides for rickettsial nucleic acid biosynthesis and do not provide the cell with nucleoside triphosphates as an energy source, as is the case for Tlc1. Although heterologous expression of Tlc2 and Tlc3 was observed in E. coli, we were unable to identify substrates for these proteins.


2015 ◽  
Author(s):  
Lorenzo Calviello ◽  
Neelanjan Mukherjee ◽  
Emanuel Wyler ◽  
Henrik Zauber ◽  
Antje Hirsekorn ◽  
...  

RNA sequencing protocols allow for quantifying gene expression regulation at each individual step, from transcription to protein synthesis. Ribosome Profiling (Ribo-seq) maps the positions of translating ribosomes over the entire transcriptome. Despite its great potential, a rigorous statistical approach to identify translated regions by means of the characteristic three-nucleotide periodicity of Ribo-seq data is not yet available. To fill this gap, we developed RiboTaper, which quantifies the significance of periodic Ribo-seq reads via spectral analysis methods. We applied RiboTaper on newly generated, deep Ribo-seq data in HEK293 cells, to derive an extensive map of translation that covers Open Reading Frame (ORF) annotations for more than 11,000 protein- coding genes. We also find distinct ribosomal signatures for several hundred detected upstream ORFs and ORFs in annotated non-coding genes (ncORFs). Mass spectrometry data confirms that RiboTaper achieves excellent coverage of the cellular proteome and validates dozens of novel peptide products. Collectively, RiboTaper (available at https://ohlerlab.mdc-berlin.de/software/ ) is a powerful method for comprehensive de novo identification of actively used ORFs in the human genome.


2021 ◽  
Author(s):  
Nikolaos Vakirlis ◽  
Kate M. Duggan ◽  
Aoife McLysaght

We now have a growing understanding that functional short proteins can be translated out of small Open Reading Frames (sORF). Such ″microproteins″ can perform crucial biological tasks and can have considerable phenotypic consequences. However, their size makes them less amenable to genomic analysis, and their evolutionary origins and conservation are poorly understood. Given their short length it is plausible that some of these functional microproteins have recently originated entirely de novo from non-coding sequence. Here we test the possibility that de novo gene birth can produce microproteins that are functional ″out-of-the-box″. We reconstructed the evolutionary origins of human microproteins previously found to have measurable, statistically significant fitness effects. By tracing the appearance of each ORF and its transcriptional activation, we were able to show that, indeed, novel small proteins with significant phenotypic effects have emerged de novo throughout animal evolution, including many after the human-chimpanzee split. We show that traditional methods for assessing the coding potential of such sequences often fall short, due to the high variability present in the alignments and the absence of telltale evolutionary signatures that are not yet measurable. Thus we provide evidence that the functional potential intrinsic to sORFs can be rapidly, and frequently realised through de novo gene birth.


2021 ◽  
Vol 22 (11) ◽  
pp. 5476
Author(s):  
Bing Wang ◽  
Zhiwei Wang ◽  
Ni Pan ◽  
Jiangmei Huang ◽  
Cuihong Wan

Small open reading frames (sORFs) have translational potential to produce peptides that play essential roles in various biological processes. Nevertheless, many sORF-encoded peptides (SEPs) are still on the prediction level. Here, we construct a strategy to analyze SEPs by combining top-down and de novo sequencing to improve SEP identification and sequence coverage. With de novo sequencing, we identified 1682 peptides mapping to 2544 human sORFs, which were all first characterized in this work. Two-thirds of these new sORFs have reading frame shifts and use a non-ATG start codon. The top-down approach identified 241 human SEPs, with high sequence coverage. The average length of the peptides from the bottom-up database search was 19 amino acids (AA); from de novo sequencing, it was 9 AA; and from the top-down approach, it was 25 AA. The longer peptide positively boosts the sequence coverage, more efficiently distinguishing SEPs from the known gene coding sequence. Top-down has the advantage of identifying peptides with sequential K/R or high K/R content, which is unfavorable in the bottom-up approach. Our method can explore new coding sORFs and obtain highly accurate sequences of their SEPs, which can also benefit future function research.


Sign in / Sign up

Export Citation Format

Share Document