A regulatory-sequence classifier with a neural network for genomic information processing

Genotype-phenotype mapping is one of the fundamental challenges in biology. The difficulties stem in part from the large amount of sequence information and the puzzling genomic code, particularly of non-protein-coding regions such as gene regulatory sequences. However, recently deep learning–based methods were shown to have the ability to decipher the gene regulatory code of genomes. Still, prediction accuracy needs improvement. Here, we report the design of convolution layers that efficiently process genomic sequence information and developed a software, DeepGMAP, to train and compare different deep learning-based models (https://github.com/koonimaru/DeepGMAP). First, we demonstrate that our convolution layers, termed forward- and reverse-sequence scan (FRSS) layers, enhance the power to predict gene regulatory sequences. Second, we assessed previous studies and identified problems associated with data structures that caused overfitting. Finally, we introduce several visualization methods that provide insights into the syntax of gene regulatory sequences.

Download Full-text

Learning and interpreting the gene regulatory grammar in a deep learning framework

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008334 ◽

2020 ◽

Vol 16 (11) ◽

pp. e1008334

Author(s):

Ling Chen ◽

John A. Capra

Keyword(s):

Neural Networks ◽

Large Fraction ◽

Regulatory Elements ◽

Regulatory Sequence ◽

Regulatory Sequences ◽

Enhancer Activity ◽

Gradient Based ◽

Gene Regulatory ◽

Synthetic Datasets ◽

Complex Features

Deep neural networks (DNNs) have achieved state-of-the-art performance in identifying gene regulatory sequences, but they have provided limited insight into the biology of regulatory elements due to the difficulty of interpreting the complex features they learn. Several models of how combinatorial binding of transcription factors, i.e. the regulatory grammar, drives enhancer activity have been proposed, ranging from the flexible TF billboard model to the stringent enhanceosome model. However, there is limited knowledge of the prevalence of these (or other) sequence architectures across enhancers. Here we perform several hypothesis-driven analyses to explore the ability of DNNs to learn the regulatory grammar of enhancers. We created synthetic datasets based on existing hypotheses about combinatorial transcription factor binding site (TFBS) patterns, including homotypic clusters, heterotypic clusters, and enhanceosomes, from real TF binding motifs from diverse TF families. We then trained deep residual neural networks (ResNets) to model the sequences under a range of scenarios that reflect real-world multi-label regulatory sequence prediction tasks. We developed a gradient-based unsupervised clustering method to extract the patterns learned by the ResNet models. We demonstrated that simulated regulatory grammars are best learned in the penultimate layer of the ResNets, and the proposed method can accurately retrieve the regulatory grammar even when there is heterogeneity in the enhancer categories and a large fraction of TFBS outside of the regulatory grammar. However, we also identify common scenarios where ResNets fail to learn simulated regulatory grammars. Finally, we applied the proposed method to mouse developmental enhancers and were able to identify the components of a known heterotypic TF cluster. Our results provide a framework for interpreting the regulatory rules learned by ResNets, and they demonstrate that the ability and efficiency of ResNets in learning the regulatory grammar depends on the nature of the prediction task.

Download Full-text

What Genomic Sequence Information Has Revealed About Vibrio Ecology in the Ocean—A Review

Microbial Ecology ◽

10.1007/s00248-009-9578-9 ◽

2009 ◽

Vol 58 (3) ◽

pp. 447-460 ◽

Cited By ~ 41

Author(s):

Darrell Jay Grimes ◽

Crystal N. Johnson ◽

Kevin S. Dillon ◽

Adrienne R. Flowers ◽

Nicholas F. Noriea ◽

...

Keyword(s):

Genomic Sequence ◽

Sequence Information ◽

Genomic Sequence Information

Download Full-text

Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure

Nature Communications ◽

10.1038/s41467-020-19921-4 ◽

2020 ◽

Vol 11 (1) ◽

Cited By ~ 1

Author(s):

Jan Zrimec ◽

Christoph S. Börlin ◽

Filip Buric ◽

Azam Sheikh Muhammad ◽

Rhongzen Chen ◽

...

Keyword(s):

Gene Expression ◽

Deep Learning ◽

Regulatory Elements ◽

Mrna Abundance ◽

Model Organisms ◽

Mrna Levels ◽

Regulatory Structure ◽

Expression Levels ◽

Coding Regions ◽

Gene Regulatory

AbstractUnderstanding the genetic regulatory code governing gene expression is an important challenge in molecular biology. However, how individual coding and non-coding regions of the gene regulatory structure interact and contribute to mRNA expression levels remains unclear. Here we apply deep learning on over 20,000 mRNA datasets to examine the genetic regulatory code controlling mRNA abundance in 7 model organisms ranging from bacteria to Human. In all organisms, we can predict mRNA abundance directly from DNA sequence, with up to 82% of the variation of transcript levels encoded in the gene regulatory structure. By searching for DNA regulatory motifs across the gene regulatory structure, we discover that motif interactions could explain the whole dynamic range of mRNA levels. Co-evolution across coding and non-coding regions suggests that it is not single motifs or regions, but the entire gene regulatory structure and specific combination of regulatory elements that define gene expression levels.

Download Full-text

Thousands of novel translated open reading frames in humans inferred by ribosome footprint profiling

10.1101/031617 ◽

2015 ◽

Author(s):

Anil Raj ◽

Sidney H. Wang ◽

Heejung Shim ◽

Arbel Harpak ◽

Yang I. Li ◽

...

Keyword(s):

Significant Negative Correlation ◽

Selective Constraint ◽

Open Reading Frames ◽

Sequence Information ◽

The Novel ◽

Protein Coding ◽

Coding Sequences ◽

Coding Regions ◽

Human Lymphoblastoid Cell ◽

Reading Frames

AbstractAccurate annotation of protein coding regions is essential for understanding how genetic information is translated into biological functions. Here we describe riboHMM, a new method that uses ribosome footprint data along with gene expression and sequence information to accurately infer translated sequences. We applied our method to human lymphoblastoid cell lines and identified 7,273 previously unannotated coding sequences, including 2,442 translated upstream open reading frames. We observed an enrichment of harringtonine-treated ribosome footprints at the inferred initiation sites, validating many of the novel coding sequences. The novel sequences exhibit significant signatures of selective constraint in the reading frames of the inferred proteins, suggesting that many of these are functional. Nearly 40% of bicistronic transcripts showed significant negative correlation in the levels of translation of their two coding sequences, suggesting a key regulatory role for these novel translated sequences. Our work significantly expands the set of known coding regions in humans.

Download Full-text

The selenium content of SEPP1 versus selenium requirements in vertebrates

10.7287/peerj.preprints.784v1 ◽

2015 ◽

Author(s):

Sam Penglase ◽

Kristin Hamre ◽

Ståle Ellingsen

Keyword(s):

Genomic Sequence ◽

Circulatory System ◽

The Body ◽

Selenoprotein P ◽

Sequence Information ◽

Selenium Content ◽

Unique Case ◽

Nutrient Requirement ◽

Genetically Determined ◽

Genomic Sequence Information

Selenoprotein P (SEPP1) distributes selenium (Se) throughout the body via the circulatory system. The Se content of SEPP1 varies from 7 to 18 Se atoms depending on the species, but the reason for this variation remains unclear. Herein we provide evidence that vertebrate SEPP1 Sec content correlates positively with Se requirements (R2=0.88). As the Se content of full length SEPP1 is genetically determined, this presents a unique case where a nutrient requirement can be predicted based on genomic sequence information.

Download Full-text

A deep learning framework with hybrid encoding for protein coding regions prediction in biological sequences

10.1101/2020.11.07.372524 ◽

2020 ◽

Author(s):

Chao Wei ◽

Junying Zhang ◽

Xiguo Yuan ◽

Zongzhen He ◽

Guojun Liu

Keyword(s):

Deep Learning ◽

Noncoding Rna ◽

Order Information ◽

Biological Sequences ◽

Biological Sequence ◽

Coding Region ◽

Protein Coding ◽

Learning Framework ◽

Coding Regions ◽

Local Sequence

ABSTRACTProtein coding regions prediction is a very important but overlooked subtask for tasks such as prediction of complete gene structure, coding/noncoding RNA. Many machine learning methods have been proposed for this problem, they first encode a biological sequence into numerical values and then feed them into a classifier for final prediction. However, encoding schemes directly influence the classifier capability to capture coding features and how to choose a proper encoding scheme remains uncertain. Recently, we proposed a protein coding region prediction method in transcript sequences based on a bidirectional recurrent neural network with non-overlapping kmer, and achieved considerable improvement over existing methods, but there is still much room to improve the performance. In fact, kmer features that count the occurrence frequency of trinucleotides only reflect the local sequence order information between the most contiguous nucleotides, which loses almost all the global sequence order information. In viewing of the point, we here present a deep learning framework with hybrid encoding for protein coding regions prediction in biological sequences, which effectively exploiting global sequence order information, non-overlapping kmer features and statistical dependencies among coding labels. Evaluated on genomic and transcript sequences, our proposed method significantly outperforms existing state-of-the-art methods.

Download Full-text

Immunogenetics in the rabbit.

The genetics and genomics of the rabbit ◽

10.1079/9781780643342.0005 ◽

2021 ◽

pp. 66-83

Author(s):

Rose G. Mage ◽

Claire Rogel-Gaillard

Keyword(s):

T Cell ◽

B Cell ◽

B Lymphocytes ◽

T Lymphocyte ◽

Genomic Sequence ◽

Sequence Information ◽

Cell Receptors ◽

Leukocyte Antigen ◽

Cell Responses ◽

Genomic Sequence Information

Abstract This chapter on immunogenetics in the rabbit focused on some genes with genetic and genomic sequence information including those encoding: soluble circulating immunoglobulin molecules (Igs) and their surface-bound forms on B lymphocytes (BCRs); T-cell receptors on T lymphocyte surfaces, (TCRs); the rabbit Leukocyte Antigen (RLA) complex (proteins on cells that function to present antigen fragments to TCRs); and some cytokine genes that encode key regulators of T- and B-cell responses.

Download Full-text

Predicting Long Non-coding RNAs Based on Genomic Sequence Information

Computational Molecular Biology ◽

10.5376/cmb.2013.03.0004 ◽

2013 ◽

Author(s):

Jie Lv ◽

Hongbo Liu ◽

Hui Liu ◽

Qiong Wu ◽

Yan Zhang

Keyword(s):

Genomic Sequence ◽

Sequence Information ◽

Non Coding Rnas ◽

Genomic Sequence Information

Download Full-text

Research participants’ attitudes towards the confidentiality of genomic sequence information

European Journal of Human Genetics ◽

10.1038/ejhg.2013.276 ◽

2013 ◽

Vol 22 (8) ◽

pp. 964-968 ◽

Cited By ~ 22

Author(s):

Leila Jamal ◽

Julie C Sapp ◽

Katie Lewis ◽

Tatiane Yanes ◽

Flavia M Facio ◽

...

Keyword(s):

Genomic Sequence ◽

Sequence Information ◽

Research Participants ◽

Genomic Sequence Information

Download Full-text

Protein-coding changes preceded cis-regulatory gains in a newly evolved transcription circuit

Science ◽

10.1126/science.aax5217 ◽

2020 ◽

Vol 367 (6473) ◽

pp. 96-100 ◽

Cited By ~ 1

Author(s):

Candace S. Britton ◽

Trevor R. Sorrells ◽

Alexander D. Johnson

Keyword(s):

Transcriptional Regulators ◽

Gene Repression ◽

Specific Gene ◽

Regulatory Sequence ◽

Regulatory Sequences ◽

Homeodomain Protein ◽

Protein Coding ◽

Evolutionary Pathway ◽

Coding Sequence ◽

Regulatory Sites

Changes in both the coding sequence of transcriptional regulators and in the cis-regulatory sequences recognized by these regulators have been implicated in the evolution of transcriptional circuits. However, little is known about how they evolved in concert. We describe an evolutionary pathway in fungi where a new transcriptional circuit (a-specific gene repression by the homeodomain protein Matα2) evolved by coding changes in this ancient regulator, followed millions of years later by cis-regulatory sequence changes in the genes of its future regulon. By analyzing a group of species that has acquired the coding changes but not the cis-regulatory sites, we show that the coding changes became necessary for the regulator’s deeply conserved function, thereby poising the regulator to jump-start formation of the new circuit.

Download Full-text