GEMPROT: visualization of the impact on the protein of the genetic variants found on each haplotype

Tania Cuppens; Thomas E Ludwig; Pascal Trouvé; Emmanuelle Genin

doi:10.1093/bioinformatics/bty993

GEMPROT: visualization of the impact on the protein of the genetic variants found on each haplotype

Bioinformatics ◽

10.1093/bioinformatics/bty993 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2492-2494

Author(s):

Tania Cuppens ◽

Thomas E Ludwig ◽

Pascal Trouvé ◽

Emmanuelle Genin

Keyword(s):

Genetic Variants ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Supplementary Information ◽

Analysis Tool ◽

Functional Protein ◽

Key Players ◽

On Line ◽

The Impact

Abstract Summary When analyzing sequence data, genetic variants are considered one by one, taking no account of whether or not they are found in the same individual. However, variant combinations might be key players in some diseases as variants that are neutral on their own can become deleterious when associated together. GEMPROT is a new analysis tool that allows, from a phased vcf file, to visualize the consequences of the genetic variants on the protein. At the level of an individual, the program shows the variants on each of the two protein sequences and the Pfam functional protein domains. When data on several individuals are available, GEMPROT lists the haplotypes found in the sample and can compare the haplotype distributions between different sub-groups of individuals. By offering a global visualization of the gene with the genetic variants present, GEMPROT makes it possible to better understand the impact of combinations of genetic variants on the protein sequence. Availability and implementation GEMPROT is freely available at https://github.com/TaniaCuppens/GEMPROT. An on-line version is also available at http://med-laennec.univ-brest.fr/GEMPROT/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Techniques for the verification of minimal phylogenetic trees illustrated with ten mammalian haemoglobin sequences

Biochemical Journal ◽

10.1042/bj1870065 ◽

1980 ◽

Vol 187 (1) ◽

pp. 65-74 ◽

Cited By ~ 12

Author(s):

D Penny ◽

M D Hendy ◽

L R Foulds

Keyword(s):

Amino Acid ◽

Phylogenetic Tree ◽

Protein Sequence ◽

Phylogenetic Trees ◽

Sequence Data ◽

Protein Sequences ◽

Nucleotide Sequences ◽

Amino Acid Sequences ◽

Minimal Tree ◽

Protein Sequence Data

We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds Hendy & Penny (1979) J. Mol. Evol. 13. 127–150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151–166]. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin alpha-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree.

Download Full-text

Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207321666180130100838 ◽

2018 ◽

Vol 21 (2) ◽

pp. 100-110 ◽

Cited By ~ 3

Author(s):

Chun Li ◽

Jialing Zhao ◽

Changzhong Wang ◽

Yuhua Yao

Keyword(s):

Dna Binding ◽

Protein Sequence ◽

Protein Identification ◽

Binding Proteins ◽

Graphical Representation ◽

Sequence Data ◽

Protein Sequences ◽

Dna Binding Proteins ◽

Support Vector ◽

Letter Sequence

Aim and Objective: The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. This study is undertaken to develop an efficient computational approach for timely encoding protein sequences and extracting the hidden information. Methods: Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. Results: By using the proposed mathematical descriptor of a protein sequence, similarity comparisons among β-globin proteins of 17 species and 72 spike proteins of coronaviruses were made, respectively. The resulting clusters agreed well with the established taxonomic groups. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Experiment results showed that our method performed better than DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1.45-15.76% in terms of F1M. When the benchmark dataset was expanded with negative samples, the presented approach outperformed the four previous methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82- 33.85% in terms of F1M. Conclusion: These results suggested that the generalized PseAAC model was very efficient for comparison and analysis of protein sequences, and very competitive in identifying DNA-binding proteins.

Download Full-text

Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1821905116 ◽

2019 ◽

Vol 116 (28) ◽

pp. 13996-14001 ◽

Cited By ~ 18

Author(s):

Jae Yong Ryu ◽

Hyun Uk Kim ◽

Sang Yup Lee

Keyword(s):

Deep Learning ◽

High Throughput ◽

Protein Sequence ◽

Enzyme Commission ◽

Sequence Data ◽

Protein Sequences ◽

Third Party ◽

Industrial Biotechnology ◽

High Quality ◽

Prediction Tools

High-quality and high-throughput prediction of enzyme commission (EC) numbers is essential for accurate understanding of enzyme functions, which have many implications in pathologies and industrial biotechnology. Several EC number prediction tools are currently available, but their prediction performance needs to be further improved to precisely and efficiently process an ever-increasing volume of protein sequence data. Here, we report DeepEC, a deep learning-based computational framework that predicts EC numbers for protein sequences with high precision and in a high-throughput manner. DeepEC takes a protein sequence as input and predicts EC numbers as output. DeepEC uses 3 convolutional neural networks (CNNs) as a major engine for the prediction of EC numbers, and also implements homology analysis for EC numbers that cannot be classified by the CNNs. Comparative analyses against 5 representative EC number prediction tools show that DeepEC allows the most precise prediction of EC numbers, and is the fastest and the lightest in terms of the disk space required. Furthermore, DeepEC is the most sensitive in detecting the effects of mutated domains/binding site residues of protein sequences. DeepEC can be used as an independent tool, and also as a third-party software component in combination with other computational platforms that examine metabolic reactions.

Download Full-text

FEGS: a novel feature extraction model for protein sequences and its applications

BMC Bioinformatics ◽

10.1186/s12859-021-04223-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Zengchao Mu ◽

Ting Yu ◽

Xiaoping Liu ◽

Hongyu Zheng ◽

Leyi Wei ◽

...

Keyword(s):

Feature Extraction ◽

Protein Sequence ◽

Graphical Representation ◽

Sequence Data ◽

Protein Sequences ◽

Statistical Features ◽

Research Areas ◽

Protein Functions ◽

Protein Sequence Data ◽

Extraction Model

Abstract Background Feature extraction of protein sequences is widely used in various research areas related to protein analysis, such as protein similarity analysis and prediction of protein functions or interactions. Results In this study, we introduce FEGS (Feature Extraction based on Graphical and Statistical features), a novel feature extraction model of protein sequences, by developing a new technique for graphical representation of protein sequences based on the physicochemical properties of amino acids and effectively employing the statistical features of protein sequences. By fusing the graphical and statistical features, FEGS transforms a protein sequence into a 578-dimensional numerical vector. When FEGS is applied to phylogenetic analysis on five protein sequence data sets, its performance is notably better than all of the other compared methods. Conclusion The FEGS method is carefully designed, which is practically powerful for extracting features of protein sequences. The current version of FEGS is developed to be user-friendly and is expected to play a crucial role in the related studies of protein sequence analyses.

Download Full-text

Using sound to understand protein sequence data: new sonification algorithms for protein sequences and multiple sequence alignments

BMC Bioinformatics ◽

10.1186/s12859-021-04362-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Edward J. Martin ◽

Thomas R. Meagher ◽

Daniel Barker

Keyword(s):

Focus Group ◽

User Experience ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Sequence Alignments ◽

Multiple Sequence ◽

Future Directions ◽

Multiple Sequence Alignments ◽

Protein Sequence Data

Abstract Background The use of sound to represent sequence data—sonification—has great potential as an alternative and complement to visual representation, exploiting features of human psychoacoustic intuitions to convey nuance more effectively. We have created five parameter-mapping sonification algorithms that aim to improve knowledge discovery from protein sequences and small protein multiple sequence alignments. For two of these algorithms, we investigated their effectiveness at conveying information. To do this we focussed on subjective assessments of user experience. This entailed a focus group session and survey research by questionnaire of individuals engaged in bioinformatics research. Results For single protein sequences, the success of our sonifications for conveying features was supported by both the survey and focus group findings. For protein multiple sequence alignments, there was limited evidence that the sonifications successfully conveyed information. Additional work is required to identify effective algorithms to render multiple sequence alignment sonification useful to researchers. Feedback from both our survey and focus groups suggests future directions for sonification of multiple alignments: animated visualisation indicating the column in the multiple alignment as the sonification progresses, user control of sequence navigation, and customisation of the sound parameters. Conclusions Sonification approaches undertaken in this work have shown some success in conveying information from protein sequence data. Feedback points out future directions to build on the sonification approaches outlined in this paper. The effectiveness assessment process implemented in this work proved useful, giving detailed feedback and key approaches for improvement based on end-user input. The uptake of similar user experience focussed effectiveness assessments could also help with other areas of bioinformatics, for example in visualisation.

Download Full-text

Generating functional protein variants with variational autoencoders

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008736 ◽

2021 ◽

Vol 17 (2) ◽

pp. e1008736

Author(s):

Alex Hawkins-Hooker ◽

Florence Depardieu ◽

Sebastien Baur ◽

Guillaume Couairon ◽

Arthur Chen ◽

...

Keyword(s):

Protein Design ◽

Protein Sequence ◽

Rational Design ◽

Sequence Data ◽

3D Structure ◽

Generative Models ◽

Functional Protein ◽

Long Distance ◽

Functional Variants ◽

Protein Sequence Data

The vast expansion of protein sequence databases provides an opportunity for new protein design approaches which seek to learn the sequence-function relationship directly from natural sequence variation. Deep generative models trained on protein sequence data have been shown to learn biologically meaningful representations helpful for a variety of downstream tasks, but their potential for direct use in the design of novel proteins remains largely unexplored. Here we show that variational autoencoders trained on a dataset of almost 70000 luciferase-like oxidoreductases can be used to generate novel, functional variants of the luxA bacterial luciferase. We propose separate VAE models to work with aligned sequence input (MSA VAE) and raw sequence input (AR-VAE), and offer evidence that while both are able to reproduce patterns of amino acid usage characteristic of the family, the MSA VAE is better able to capture long-distance dependencies reflecting the influence of 3D structure. To confirm the practical utility of the models, we used them to generate variants of luxA whose luminescence activity was validated experimentally. We further showed that conditional variants of both models could be used to increase the solubility of luxA without disrupting function. Altogether 6/12 of the variants generated using the unconditional AR-VAE and 9/11 generated using the unconditional MSA VAE retained measurable luminescence, together with all 23 of the less distant variants generated by conditional versions of the models; the most distant functional variant contained 35 differences relative to the nearest training set sequence. These results demonstrate the feasibility of using deep generative models to explore the space of possible protein sequences and generate useful variants, providing a method complementary to rational design and directed evolution approaches.

Download Full-text

An Investigation of Alternatives to Transform Protein Sequence Databases to a Columnar Index Schema

Algorithms ◽

10.3390/a14020059 ◽

2021 ◽

Vol 14 (2) ◽

pp. 59

Author(s):

Roman Zoun ◽

Kay Schallert ◽

David Broneske ◽

Ivayla Trifonova ◽

Xiao Chen ◽

...

Keyword(s):

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Memory Consumption ◽

Mass Spectrometers ◽

Protein Sequence Data ◽

Relational Systems ◽

Spectrometer Data ◽

Database Engine ◽

High Storage

Mass spectrometers enable identifying proteins in biological samples leading to biomarkers for biological process parameters and diseases. However, bioinformatic evaluation of the mass spectrometer data needs a standardized workflow and system that stores the protein sequences. Due to its standardization and maturity, relational systems are a great fit for storing protein sequences. Hence, in this work, we present a schema for distributed column-based database management systems using a column-oriented index to store sequence data. In order to achieve a high storage performance, it was necessary to choose a well-performing strategy for transforming the protein sequence data from the FASTA format to the new schema. Therefore, we applied an in-memory map, HDDmap, database engine, and extended radix tree and evaluated their performance. The results show that our proposed extended radix tree performs best regarding memory consumption and runtime. Hence, the radix tree is a suitable data structure for transforming protein sequences into the indexed schema.

Download Full-text

Improving Generalizability of Protein Sequence Models with Data Augmentations

10.1101/2021.02.18.431877 ◽

2021 ◽

Author(s):

Hongyu Shen ◽

Layne C. Price ◽

Taha Bahadori ◽

Franziska Seeger

Keyword(s):

Machine Learning ◽

Protein Sequence ◽

Data Augmentation ◽

Sequence Data ◽

Protein Sequences ◽

Representation Learning ◽

Amino Acid Replacement ◽

Fine Tuning ◽

Protein Sequence Data ◽

Tuning Methods

AbstractWhile protein sequence data is an emerging application domain for machine learning methods, small modifications to protein sequences can result in difficult-to-predict changes to the protein’s function. Consequently, protein machine learning models typically do not use randomized data augmentation procedures analogous to those used in computer vision or natural language, e.g., cropping or synonym substitution. In this paper, we empirically explore a set of simple string manipulations, which we use to augment protein sequence data when fine-tuning semi-supervised protein models. We provide 276 different comparisons to the Tasks Assessing Protein Embeddings (TAPE) baseline models, with Transformer-based models and training datasets that vary from the baseline methods only in the data augmentations and representation learning procedure. For each TAPE validation task, we demonstrate improvements to the baseline scores when the learned protein representation is fixed between tasks. We also show that contrastive learning fine-tuning methods typically outperform masked-token prediction in these models, with increasing amounts of data augmentation generally improving performance for contrastive learning protein methods. We find the most consistent results across TAPE tasks when using domain-motivated transformations, such as amino acid replacement, as well as restricting the Transformer attention to randomly sampled sub-regions of the protein sequence. In rarer cases, we even find that information-destroying augmentations, such as randomly shuffling entire protein sequences, can improve downstream performance.

Download Full-text

Middle Pleistocene protein sequences from the rhinoceros genusStephanorhinusand the phylogeny of extant and extinct Middle/Late Pleistocene Rhinocerotidae

PeerJ ◽

10.7717/peerj.3033 ◽

2017 ◽

Vol 5 ◽

pp. e3033 ◽

Cited By ~ 26

Author(s):

Frido Welker ◽

Geoff M. Smith ◽

Jarod M. Hutson ◽

Lutz Kindler ◽

Alejandro Garcia-Moreno ◽

...

Keyword(s):

Mass Spectrometry ◽

Phylogenetic Analysis ◽

Late Pleistocene ◽

Phylogenetic Relationships ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Middle Pleistocene ◽

Extant Species ◽

Protein Sequence Data

BackgroundAncient protein sequences are increasingly used to elucidate the phylogenetic relationships between extinct and extant mammalian taxa. Here, we apply these recent developments to Middle Pleistocene bone specimens of the rhinoceros genusStephanorhinus. No biomolecular sequence data is currently available for this genus, leaving phylogenetic hypotheses on its evolutionary relationships to extant and extinct rhinoceroses untested. Furthermore, recent phylogenies based on Rhinocerotidae (partial or complete) mitochondrial DNA sequences differ in the placement of the Sumatran rhinoceros (Dicerorhinus sumatrensis). Therefore, studies utilising ancient protein sequences from Middle Pleistocene contexts have the potential to provide further insights into the phylogenetic relationships between extant and extinct species, includingStephanorhinusandDicerorhinus.MethodsZooMS screening (zooarchaeology by mass spectrometry) was performed on several Late and Middle Pleistocene specimens from the genusStephanorhinus, subsequently followed by liquid chromatography-tandem mass spectrometry (LC-MS/MS) to obtain ancient protein sequences from a Middle PleistoceneStephanorhinusspecimen. We performed parallel analysis on a Late Pleistocene woolly rhinoceros specimen and extant species of rhinoceroses, resulting in the availability of protein sequence data for five extant species and two extinct genera. Phylogenetic analysis additionally included all extant Perissodactyla genera (Equus,Tapirus), and was conducted using Bayesian (MrBayes) and maximum-likelihood (RAxML) methods.ResultsVarious ancient proteins were identified in both the Middle and Late Pleistocene rhinoceros samples. Protein degradation and proteome complexity are consistent with an endogenous origin of the identified proteins. Phylogenetic analysis of informative proteins resolved the Perissodactyla phylogeny in agreement with previous studies in regards to the placement of the families Equidae, Tapiridae, and Rhinocerotidae.Stephanorhinusis shown to be most closely related to the generaCoelodontaandDicerorhinus. The protein sequence data further places the Sumatran rhino in a clade together with the genusRhinoceros, opposed to forming a clade with the black and white rhinoceros species.DiscussionThe first biomolecular dataset available forStephanorhinusplaces this genus together with the extinct genusCoelodontaand the extant genusDicerorhinus. This is in agreement with morphological studies, although we are unable to resolve the order of divergence between these genera based on the protein sequences available. Our data supports the placement of the genusDicerorhinusin a clade together with extantRhinocerosspecies. Finally, the availability of protein sequence data for both extinct European rhinoceros genera allows future investigations into their geographic distribution and extinction chronologies.

Download Full-text

DeMaSk: a deep mutational scanning substitution matrix and its use for variant impact prediction

Bioinformatics ◽

10.1093/bioinformatics/btaa1030 ◽

2020 ◽

Author(s):

Daniel Munro ◽

Mona Singh

Keyword(s):

Amino Acid ◽

Protein Sequence ◽

Substitution Matrix ◽

Supplementary Information ◽

Missense Mutations ◽

Impact Prediction ◽

Variant Frequency ◽

Quantitative Impact ◽

The Impact ◽

Impact Predictions

Abstract Motivation Accurately predicting the quantitative impact of a substitution on a protein’s molecular function would be a great aid in understanding the effects of observed genetic variants across populations. While this remains a challenging task, new approaches can leverage data from the increasing numbers of comprehensive deep mutational scanning (DMS) studies that systematically mutate proteins and measure fitness. Results We introduce DeMaSk, an intuitive and interpretable method based only upon DMS datasets and sequence homologs that predicts the impact of missense mutations within any protein. DeMaSk first infers a directional amino acid substitution matrix from DMS datasets and then fits a linear model that combines these substitution scores with measures of per-position evolutionary conservation and variant frequency across homologs. Despite its simplicity, DeMaSk has state-of-the-art performance in predicting the impact of amino acid substitutions, and can easily and rapidly be applied to any protein sequence. Availability and implementation https://demask.princeton.edu generates fitness impact predictions and visualizations for any user-submitted protein sequence. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text