scholarly journals Balrog: A universal protein model for prokaryotic gene prediction

2020 ◽  
Author(s):  
Markus J. Sommer ◽  
Steven L. Salzberg

AbstractLow-cost, high-throughput sequencing has led to an enormous increase in the number of sequenced microbial genomes, with well over 100,000 genomes in public archives today. Automatic genome annotation tools are integral to understanding these organisms, yet older gene finding methods must be retrained on each new genome. We have developed a universal model of prokaryotic genes by fitting a temporal convolutional network to amino-acid sequences from a large, diverse set of microbial genomes. We incorporated the new model into a gene finding system, Balrog (Bacterial Annotation by Learned Representation Of Genes), which does not require genome-specific training and which matches or outperforms other state-of-the-art gene finding tools. Balrog is freely available under the MIT license at https://github.com/salzberg-lab/Balrog.Author summaryAnnotating the protein-coding genes in a newly sequenced prokaryotic genome is a critical part of describing their biological function. Relative to eukaryotic genomes, prokaryotic genomes are small and structurally simple, with 90% of their DNA typically devoted to protein-coding genes. Current computational gene finding tools are therefore able to achieve close to 99% sensitivity to known genes using species-specific gene models.Though highly sensitive at finding known genes, all current prokaryotic gene finders also predict large numbers of additional genes, which are labelled as “hypothetical protein” in GenBank and other annotation databases. Many hypothetical gene predictions likely represent true protein-coding sequence, but it is not known how many of them represent false positives. Additionally, all current gene finding tools must be trained specifically for each genome as a preliminary step in order to achieve high sensitivity. This requirement limits their ability to detect genes in fragmented sequences commonly seen in metagenomic samples.We took a data-driven approach to prokaryotic gene finding, relying on the large and diverse collection of already-sequenced genomes. By training a single, universal model of bacterial genes on protein sequences from many different species, we were able to match the sensitivity of current gene finders while reducing the overall number of gene predictions. Our model does not need to be refit on any new genome. Balrog (Bacterial Annotation by Learned Representation of Genes) represents a fundamentally different yet effective method for prokaryotic gene finding.

2021 ◽  
Vol 17 (2) ◽  
pp. e1008727
Author(s):  
Markus J. Sommer ◽  
Steven L. Salzberg

Low-cost, high-throughput sequencing has led to an enormous increase in the number of sequenced microbial genomes, with well over 100,000 genomes in public archives today. Automatic genome annotation tools are integral to understanding these organisms, yet older gene finding methods must be retrained on each new genome. We have developed a universal model of prokaryotic genes by fitting a temporal convolutional network to amino-acid sequences from a large, diverse set of microbial genomes. We incorporated the new model into a gene finding system, Balrog (Bacterial Annotation by Learned Representation Of Genes), which does not require genome-specific training and which matches or outperforms other state-of-the-art gene finding tools. Balrog is freely available under the MIT license at https://github.com/salzberg-lab/Balrog.


2019 ◽  
Author(s):  
Wei Fang ◽  
Yi Wen ◽  
Xiangyun Wei

AbstractTissue-specific or cell type-specific transcription of protein-coding genes is controlled by both trans-regulatory elements (TREs) and cis-regulatory elements (CREs). However, it is challenging to identify TREs and CREs, which are unknown for most genes. Here, we describe a protocol for identifying two types of transcription-activating CREs—core promoters and enhancers—of zebrafish photoreceptor type-specific genes. This protocol is composed of three phases: bioinformatic prediction, experimental validation, and characterization of the CREs. To better illustrate the principles and logic of this protocol, we exemplify it with the discovery of the core promoter and enhancer of the mpp5b apical polarity gene (also known as ponli), whose red, green, and blue (RGB) cone-specific transcription requires its enhancer, a member of the rainbow enhancer family. While exemplified with an RGB cone-specific gene, this protocol is general and can be used to identify the core promoters and enhancers of other protein-coding genes.


2019 ◽  
Author(s):  
Deepank R Korandla ◽  
Jacob M Wozniak ◽  
Anaamika Campeau ◽  
David J Gonzalez ◽  
Erik S Wright

Abstract Motivation A core task of genomics is to identify the boundaries of protein coding genes, which may cover over 90% of a prokaryote's genome. Several programs are available for gene finding, yet it is currently unclear how well these programs perform and whether any offers superior accuracy. This is in part because there is no universal benchmark for gene finding and, therefore, most developers select their own benchmarking strategy. Results Here, we introduce AssessORF, a new approach for benchmarking prokaryotic gene predictions based on evidence from proteomics data and the evolutionary conservation of start and stop codons. We applied AssessORF to compare gene predictions offered by GenBank, GeneMarkS-2, Glimmer and Prodigal on genomes spanning the prokaryotic tree of life. Gene predictions were 88–95% in agreement with the available evidence, with Glimmer performing the worst but no clear winner. All programs were biased towards selecting start codons that were upstream of the actual start. Given these findings, there remains considerable room for improvement, especially in the detection of correct start sites. Availability and implementation AssessORF is available as an R package via the Bioconductor package repository. Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Vol 4 (11) ◽  
pp. 170758 ◽  
Author(s):  
Peter G. Foster ◽  
Tatiane Marques Porangaba de Oliveira ◽  
Eduardo S. Bergo ◽  
Jan E. Conn ◽  
Denise Cristina Sant’Ana ◽  
...  

Malaria is a vector-borne disease that is a great burden on the poorest and most marginalized communities of the tropical and subtropical world. Approximately 41 species of Anopheline mosquitoes can effectively spread species of Plasmodium parasites that cause human malaria. Proposing a natural classification for the subfamily Anophelinae has been a continuous effort, addressed using both morphology and DNA sequence data. The monophyly of the genus Anopheles , and phylogenetic placement of the genus Bironella , subgenera Kerteszia , Lophopodomyia and Stethomyia within the subfamily Anophelinae, remain in question. To understand the classification of Anophelinae, we inferred the phylogeny of all three genera ( Anopheles , Bironella , Chagasia ) and major subgenera by analysing the amino acid sequences of the 13 protein coding genes of 150 newly sequenced mitochondrial genomes of Anophelinae and 18 newly sequenced Culex species as outgroup taxa, supplemented with 23 mitogenomes from GenBank. Our analyses generally place genus Bironella within the genus Anopheles , which implies that the latter as it is currently defined is not monophyletic. With some inconsistencies, Bironella was placed within the major clade that includes Anopheles , Cellia , Kerteszia , Lophopodomyia , Nyssorhynchus and Stethomyia , which were found to be monophyletic groups within Anophelinae. Our findings provided robust evidence for elevating the monophyletic groupings Kerteszia , Lophopodomyia , Nyssorhynchus and Stethomyia to genus level; genus Anopheles to include subgenera Anopheles , Baimaia , Cellia and Christya ; Anopheles parvus to be placed into a new genus; Nyssorhynchus to be elevated to genus level; the genus Nyssorhynchus to include subgenera Myzorhynchella and Nyssorhynchus ; Anopheles atacamensis and Anopheles pictipennis to be transferred from subgenus Nyssorhynchus to subgenus Myzorhynchella ; and subgenus Nyssorhynchus to encompass the remaining species of Argyritarsis and Albimanus Sections.


Parasitology ◽  
2006 ◽  
Vol 134 (5) ◽  
pp. 749-759 ◽  
Author(s):  
J.-K. PARK ◽  
K.-H. KIM ◽  
S. KANG ◽  
H. K. JEON ◽  
J.-H. KIM ◽  
...  

SUMMARYThe complete nucleotide sequence of the mitochondrial genome was determined for the fish tapeworm Diphyllobothrium latum. This genome is 13 608 bp in length and encodes 12 protein-coding genes (but lacks the atp8), 22 transfer RNA (tRNA) and 2 ribosomal RNA (rRNA) genes, corresponding to the gene complement found thus far in other flatworm mitochondrial (mt) DNAs. The gene arrangement of this pseudophyllidean cestode is the same as the 6 cyclophyllidean cestodes characterized to date, with only minor variation in structure among these other genomes; the relative position of trnS2 and trnL1 is switched in Hymenolepis diminuta. Phylogenetic analyses of the concatenated amino acid sequences for 12 protein-coding genes of all complete cestode mtDNAs confirmed taxonomic and previous phylogenetic assessments, with D. latum being a sister taxon to the cyclophyllideans. High nodal support and phylogenetic congruence between different methods suggest that mt genomes may be of utility in resolving ordinal relationships within the cestodes. All species of Diphyllobothrium infect fish-eating vertebrates, and D. latum commonly infects humans through the ingestion of raw, poorly cooked or pickled fish. The complete mitochondrial genome provides a wealth of genetic markers which could be useful for identifying different life-cycle stages and for investigating their population genetics, ecology and epidemiology.


Genomics ◽  
2015 ◽  
Vol 106 (6) ◽  
pp. 367-372 ◽  
Author(s):  
Sandip Paul ◽  
Archana Bhardwaj ◽  
Sumit K. Bag ◽  
Evgeni V. Sokurenko ◽  
Sujay Chattopadhyay

2020 ◽  
Author(s):  
Yi-Tian Fu ◽  
Yu Nie ◽  
De-Yong Duan ◽  
Guo-Hua Liu

Abstract Background: The family Hoplopleuridae contains at least 183 species of blood-sucking lice, which widely parasitize both mice and rats. Fragmented mitochondrial (mt) genomes have been reported in two rat lice (Hoplopleura kitti and H. akanezumi) from this family, but some minichromosomes were unidentified in their mt genomes.Methods: We sequenced the mt genome of the rat louse Hoplopleura sp. with an Illumina platform and compared its mt genome organization with H. kitti and H. akanezumi.Results: Fragmented mt genome of the rat louse Hoplopleura sp. contains 37 genes which are on 12 circular mt minichromosomes. Each mt minichromosome is 1.8–2.7 kb long and contains 1–5 genes and one large non-coding region. The gene content and arrangement of mt minichromosomes of Hoplopleura sp. (n = 3) and H. kitti (n = 3) are different from those in H. akanezumi (n = 3). Phylogenetic analyses based on the deduced amino acid sequences of the eight protein-coding genes showed that the Hoplopleura sp. was more closely related to H. akanezumi than to H. kitti, and then they formed a monophyletic group.Conclusions: Comparison among the three rat lice revealed variation in the composition of mt minichromosomes within the genus Hoplopleura. Hoplopleura sp. is the first species from the family Hoplopleuridae for which a complete fragmented mt genome has been sequenced. The new data provide useful genetic markers for studying the population genetics, molecular systematics and phylogenetics of blood-sucking lice.


ZooKeys ◽  
2020 ◽  
Vol 945 ◽  
pp. 1-16
Author(s):  
Yuan-An Wu ◽  
Jin-Wei Gao ◽  
Xiao-Fei Cheng ◽  
Min Xie ◽  
Xi-Ping Yuan ◽  
...  

Azygia hwangtsiyui (Trematoda, Azygiidae), a neglected parasite of predatory fishes, is little-known in terms of its molecular epidemiology, population ecology and phylogenetic study. In the present study, the complete mitochondrial genome of A. hwangtsiyui was sequenced and characterized: it is a 13,973 bp circular DNA molecule and encodes 36 genes (12 protein-coding genes, 22 transfer RNA genes, two ribosomal RNA genes) as well as two non-coding regions. The A+T content of the A. hwangtsiyui mitogenome is 59.6% and displays a remarkable bias in nucleotide composition with a negative AT skew (–0.437) and a positive GC skew (0.408). Phylogenetic analysis based on concatenated amino acid sequences of twelve protein-coding genes reveals that A. hwangtsiyui is placed in a separate clade, suggesting that it has no close relationship with any other trematode family. This is the first characterization of the A. hwangtsiyui mitogenome, and the first reported mitogenome of the family Azygiidae. These novel datasets of the A. hwangtsiyui mt genome represent a meaningful resource for the development of mitochondrial markers for the identification, diagnostics, taxonomy, homology and phylogenetic relationships of trematodes.


Cancers ◽  
2019 ◽  
Vol 11 (10) ◽  
pp. 1524 ◽  
Author(s):  
Rodiola Begolli ◽  
Nikos Sideris ◽  
Antonis Giakountis

During the last decade, high-throughput sequencing efforts in the fields of transcriptomics and epigenomics have shed light on the noncoding part of the transcriptome and its potential role in human disease. Regulatory noncoding RNAs are broadly divided into short and long noncoding transcripts. The latter, also known as lncRNAs, are defined as transcripts longer than 200 nucleotides with low or no protein-coding potential. LncRNAs form a diverse group of transcripts that regulate vital cellular functions through interactions with proteins, chromatin, and even RNA itself. Notably, an important regulatory aspect of these RNA species is their association with the epigenetic machinery and the recruitment of its regulatory apparatus to specific loci, resulting in DNA methylation and/or post-translational modifications of histones. Such epigenetic modifications play a pivotal role in maintaining the active or inactive transcriptional state of chromatin and are crucial regulators of normal cellular development and tissue-specific gene expression. Evidently, aberrant expression of lncRNAs that interact with epigenetic modifiers can cause severe epigenetic disruption and is thus is closely associated with altered gene function, cellular dysregulation, and malignant transformation. Here, we survey the latest breakthroughs concerning the role of lncRNAs interacting with the epigenetic machinery in various forms of cancer.


2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Qingzhen Wei ◽  
Jinglei Wang ◽  
Wuhong Wang ◽  
Tianhua Hu ◽  
Haijiao Hu ◽  
...  

Abstract Eggplant (Solanum melongena L.) is an economically important vegetable crop in the Solanaceae family, with extensive diversity among landraces and close relatives. Here, we report a high-quality reference genome for the eggplant inbred line HQ-1315 (S. melongena-HQ) using a combination of Illumina, Nanopore and 10X genomics sequencing technologies and Hi-C technology for genome assembly. The assembled genome has a total size of ~1.17 Gb and 12 chromosomes, with a contig N50 of 5.26 Mb, consisting of 36,582 protein-coding genes. Repetitive sequences comprise 70.09% (811.14 Mb) of the eggplant genome, most of which are long terminal repeat (LTR) retrotransposons (65.80%), followed by long interspersed nuclear elements (LINEs, 1.54%) and DNA transposons (0.85%). The S. melongena-HQ eggplant genome carries a total of 563 accession-specific gene families containing 1009 genes. In total, 73 expanded gene families (892 genes) and 34 contraction gene families (114 genes) were functionally annotated. Comparative analysis of different eggplant genomes identified three types of variations, including single-nucleotide polymorphisms (SNPs), insertions/deletions (indels) and structural variants (SVs). Asymmetric SV accumulation was found in potential regulatory regions of protein-coding genes among the different eggplant genomes. Furthermore, we performed QTL-seq for eggplant fruit length using the S. melongena-HQ reference genome and detected a QTL interval of 71.29–78.26 Mb on chromosome E03. The gene Smechr0301963, which belongs to the SUN gene family, is predicted to be a key candidate gene for eggplant fruit length regulation. Moreover, we anchored a total of 210 linkage markers associated with 71 traits to the eggplant chromosomes and finally obtained 26 QTL hotspots. The eggplant HQ-1315 genome assembly can be accessed at http://eggplant-hq.cn. In conclusion, the eggplant genome presented herein provides a global view of genomic divergence at the whole-genome level and powerful tools for the identification of candidate genes for important traits in eggplant.


Sign in / Sign up

Export Citation Format

Share Document