scholarly journals A regulatory-sequence classifier with a neural network for genomic information processing

2018 ◽  
Author(s):  
Koh Onimaru ◽  
Osamu Nishimura ◽  
Shigehiro Kuraku

Genotype-phenotype mapping is one of the fundamental challenges in biology. The difficulties stem in part from the large amount of sequence information and the puzzling genomic code, particularly of non-protein-coding regions such as gene regulatory sequences. However, recently deep learning–based methods were shown to have the ability to decipher the gene regulatory code of genomes. Still, prediction accuracy needs improvement. Here, we report the design of convolution layers that efficiently process genomic sequence information and developed a software, DeepGMAP, to train and compare different deep learning-based models (https://github.com/koonimaru/DeepGMAP). First, we demonstrate that our convolution layers, termed forward- and reverse-sequence scan (FRSS) layers, enhance the power to predict gene regulatory sequences. Second, we assessed previous studies and identified problems associated with data structures that caused overfitting. Finally, we introduce several visualization methods that provide insights into the syntax of gene regulatory sequences.

2020 ◽  
Vol 16 (11) ◽  
pp. e1008334
Author(s):  
Ling Chen ◽  
John A. Capra

Deep neural networks (DNNs) have achieved state-of-the-art performance in identifying gene regulatory sequences, but they have provided limited insight into the biology of regulatory elements due to the difficulty of interpreting the complex features they learn. Several models of how combinatorial binding of transcription factors, i.e. the regulatory grammar, drives enhancer activity have been proposed, ranging from the flexible TF billboard model to the stringent enhanceosome model. However, there is limited knowledge of the prevalence of these (or other) sequence architectures across enhancers. Here we perform several hypothesis-driven analyses to explore the ability of DNNs to learn the regulatory grammar of enhancers. We created synthetic datasets based on existing hypotheses about combinatorial transcription factor binding site (TFBS) patterns, including homotypic clusters, heterotypic clusters, and enhanceosomes, from real TF binding motifs from diverse TF families. We then trained deep residual neural networks (ResNets) to model the sequences under a range of scenarios that reflect real-world multi-label regulatory sequence prediction tasks. We developed a gradient-based unsupervised clustering method to extract the patterns learned by the ResNet models. We demonstrated that simulated regulatory grammars are best learned in the penultimate layer of the ResNets, and the proposed method can accurately retrieve the regulatory grammar even when there is heterogeneity in the enhancer categories and a large fraction of TFBS outside of the regulatory grammar. However, we also identify common scenarios where ResNets fail to learn simulated regulatory grammars. Finally, we applied the proposed method to mouse developmental enhancers and were able to identify the components of a known heterotypic TF cluster. Our results provide a framework for interpreting the regulatory rules learned by ResNets, and they demonstrate that the ability and efficiency of ResNets in learning the regulatory grammar depends on the nature of the prediction task.


2009 ◽  
Vol 58 (3) ◽  
pp. 447-460 ◽  
Author(s):  
Darrell Jay Grimes ◽  
Crystal N. Johnson ◽  
Kevin S. Dillon ◽  
Adrienne R. Flowers ◽  
Nicholas F. Noriea ◽  
...  

2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Jan Zrimec ◽  
Christoph S. Börlin ◽  
Filip Buric ◽  
Azam Sheikh Muhammad ◽  
Rhongzen Chen ◽  
...  

AbstractUnderstanding the genetic regulatory code governing gene expression is an important challenge in molecular biology. However, how individual coding and non-coding regions of the gene regulatory structure interact and contribute to mRNA expression levels remains unclear. Here we apply deep learning on over 20,000 mRNA datasets to examine the genetic regulatory code controlling mRNA abundance in 7 model organisms ranging from bacteria to Human. In all organisms, we can predict mRNA abundance directly from DNA sequence, with up to 82% of the variation of transcript levels encoded in the gene regulatory structure. By searching for DNA regulatory motifs across the gene regulatory structure, we discover that motif interactions could explain the whole dynamic range of mRNA levels. Co-evolution across coding and non-coding regions suggests that it is not single motifs or regions, but the entire gene regulatory structure and specific combination of regulatory elements that define gene expression levels.


2015 ◽  
Author(s):  
Anil Raj ◽  
Sidney H. Wang ◽  
Heejung Shim ◽  
Arbel Harpak ◽  
Yang I. Li ◽  
...  

AbstractAccurate annotation of protein coding regions is essential for understanding how genetic information is translated into biological functions. Here we describe riboHMM, a new method that uses ribosome footprint data along with gene expression and sequence information to accurately infer translated sequences. We applied our method to human lymphoblastoid cell lines and identified 7,273 previously unannotated coding sequences, including 2,442 translated upstream open reading frames. We observed an enrichment of harringtonine-treated ribosome footprints at the inferred initiation sites, validating many of the novel coding sequences. The novel sequences exhibit significant signatures of selective constraint in the reading frames of the inferred proteins, suggesting that many of these are functional. Nearly 40% of bicistronic transcripts showed significant negative correlation in the levels of translation of their two coding sequences, suggesting a key regulatory role for these novel translated sequences. Our work significantly expands the set of known coding regions in humans.


2015 ◽  
Author(s):  
Sam Penglase ◽  
Kristin Hamre ◽  
Ståle Ellingsen

Selenoprotein P (SEPP1) distributes selenium (Se) throughout the body via the circulatory system. The Se content of SEPP1 varies from 7 to 18 Se atoms depending on the species, but the reason for this variation remains unclear. Herein we provide evidence that vertebrate SEPP1 Sec content correlates positively with Se requirements (R2=0.88). As the Se content of full length SEPP1 is genetically determined, this presents a unique case where a nutrient requirement can be predicted based on genomic sequence information.


2020 ◽  
Author(s):  
Chao Wei ◽  
Junying Zhang ◽  
Xiguo Yuan ◽  
Zongzhen He ◽  
Guojun Liu

ABSTRACTProtein coding regions prediction is a very important but overlooked subtask for tasks such as prediction of complete gene structure, coding/noncoding RNA. Many machine learning methods have been proposed for this problem, they first encode a biological sequence into numerical values and then feed them into a classifier for final prediction. However, encoding schemes directly influence the classifier capability to capture coding features and how to choose a proper encoding scheme remains uncertain. Recently, we proposed a protein coding region prediction method in transcript sequences based on a bidirectional recurrent neural network with non-overlapping kmer, and achieved considerable improvement over existing methods, but there is still much room to improve the performance. In fact, kmer features that count the occurrence frequency of trinucleotides only reflect the local sequence order information between the most contiguous nucleotides, which loses almost all the global sequence order information. In viewing of the point, we here present a deep learning framework with hybrid encoding for protein coding regions prediction in biological sequences, which effectively exploiting global sequence order information, non-overlapping kmer features and statistical dependencies among coding labels. Evaluated on genomic and transcript sequences, our proposed method significantly outperforms existing state-of-the-art methods.


Author(s):  
Rose G. Mage ◽  
Claire Rogel-Gaillard

Abstract This chapter on immunogenetics in the rabbit focused on some genes with genetic and genomic sequence information including those encoding: soluble circulating immunoglobulin molecules (Igs) and their surface-bound forms on B lymphocytes (BCRs); T-cell receptors on T lymphocyte surfaces, (TCRs); the rabbit Leukocyte Antigen (RLA) complex (proteins on cells that function to present antigen fragments to TCRs); and some cytokine genes that encode key regulators of T- and B-cell responses.


2013 ◽  
Vol 22 (8) ◽  
pp. 964-968 ◽  
Author(s):  
Leila Jamal ◽  
Julie C Sapp ◽  
Katie Lewis ◽  
Tatiane Yanes ◽  
Flavia M Facio ◽  
...  

Science ◽  
2020 ◽  
Vol 367 (6473) ◽  
pp. 96-100 ◽  
Author(s):  
Candace S. Britton ◽  
Trevor R. Sorrells ◽  
Alexander D. Johnson

Changes in both the coding sequence of transcriptional regulators and in the cis-regulatory sequences recognized by these regulators have been implicated in the evolution of transcriptional circuits. However, little is known about how they evolved in concert. We describe an evolutionary pathway in fungi where a new transcriptional circuit (a-specific gene repression by the homeodomain protein Matα2) evolved by coding changes in this ancient regulator, followed millions of years later by cis-regulatory sequence changes in the genes of its future regulon. By analyzing a group of species that has acquired the coding changes but not the cis-regulatory sites, we show that the coding changes became necessary for the regulator’s deeply conserved function, thereby poising the regulator to jump-start formation of the new circuit.


Sign in / Sign up

Export Citation Format

Share Document