protein annotation Latest Research Papers

PASS: Protein Annotation Surveillance Site for Protein Annotation Using Homologous Clusters, NLP, and Sequence Similarity Networks

Frontiers in Bioinformatics ◽

10.3389/fbinf.2021.749008 ◽

2021 ◽

Vol 1 ◽

Author(s):

Jin Tao ◽

Kelly A. Brayton ◽

Shira L. Broschat

Keyword(s):

Protein Function ◽

Expert Knowledge ◽

Sequence Similarity ◽

Input Sequence ◽

Software Tool ◽

Sequence Information ◽

Protein Annotation ◽

Surveillance Site ◽

Uniprotkb Database ◽

Learning Software

Advances in genome sequencing have accelerated the growth of sequenced genomes but at a cost in the quality of genome annotation. At the same time, computational analysis is widely used for protein annotation, but a dearth of experimental verification has contributed to inaccurate annotation as well as to annotation error propagation. Thus, a tool to help life scientists with accurate protein annotation would be useful. In this work we describe a website we have developed, the Protein Annotation Surveillance Site (PASS), which provides such a tool. This website consists of three major components: a database of homologous clusters of more than eight million protein sequences deduced from the representative genomes of bacteria, archaea, eukarya, and viruses, together with sequence information; a machine-learning software tool which periodically queries the UniprotKB database to determine whether protein function has been experimentally verified; and a query-able webpage where the FASTA headers of sequences from the cluster best matching an input sequence are returned. The user can choose from these sequences to create a sequence similarity network to assist in annotation or else use their expert knowledge to choose an annotation from the cluster sequences. Illustrations demonstrating use of this website are presented.

Download Full-text

Glycoproteogenomics characterizes the CD44 splicing code driving bladder cancer invasion

10.1101/2021.09.04.458979 ◽

2021 ◽

Author(s):

Cristiana Gaiteiro ◽

Janine Soares ◽

Marta Relvas-Santos ◽

Andreia Peixoto ◽

Dylan Ferreira ◽

...

Keyword(s):

Bladder Cancer ◽

Alternative Splicing ◽

Molecular Characterization ◽

Poor Prognosis ◽

Protein Isoforms ◽

Precision Oncology ◽

Protein Annotation ◽

Cell Surface Glycoprotein ◽

Targeted Interventions ◽

Potential Biomarker

AbstractBladder cancer (BC) management demands the introduction of novel molecular targets for precision medicine. Cell surface glycoprotein CD44 has been widely studied as a potential biomarker of BC aggressiveness and cancer stem cells. However, significant alternative splicing and multiple glycosylation generate a myriad of glycoproteoforms with potentially distinct functional roles. The lack of tools for precise molecular characterization has led to conflicting results, delaying clinical applications. Addressing these limitations, we have interrogated the transcriptome of a large BC patient cohort for splicing signatures. Remarkable CD44 heterogeneity was observed, as well as associations between short CD44 standard splicing isoform (CD44s), invasion and poor prognosis. In parallel, immunoassays showed that targeting short O-glycoforms could hold the key to improve CD44 cancer specificity. This prompted the development of a glycoproteogenomics approach, building on the integration of transcriptomics-customized datasets and glycomics for protein annotation from nanoLC-ESI-MS/MS experiments. The concept was applied to invasive human BC cell lines, glycoengineered cells, and tumor tissues, enabling unequivocal CD44s identification. Finally, we confirmed the link between CD44s and invasion in vitro by siRNA knockdown, supporting findings from BC tissues. The key role played by short-chain O-glycans in CD44-mediated invasion was also demonstrated through glycoengineered cell models. Overall, CD44s emerged as biomarker of poor prognosis and CD44-Tn/STn as promising molecular signatures for targeted interventions. This study materializes the concept of glycoproteogenomics and provides a key vision to address the cancer splicing code at the protein level, which may now be expanded to better understand CD44 functional role in health and disease.Significance StatementThe biological role of CD44, a cell membrane glycoprotein involved in most cancer hallmarks and widely explored in BC, is intimately linked to its protein isoforms. mRNA alternative splicing generates several closely related polypeptide sequences, which have so far been inferred from transcripts analysis, in the absence of workflows for unequivocal protein annotation. Dense O-glycosylation is also key for protein function and may exponentiate the number of proteoforms, rendering CD44 molecular characterization a daunting enterprise. Here, we integrated multiple molecular information (RNA, proteins, glycans) for definitive CD44 characterization by mass spectrometry, materializing the concept of glycoproteogenomics. BC specific glycoproteoforms linked to invasion have been identified, holding potential for precise cancer targeting. The approach may be transferable to other tumors, paving the way for precision oncology.

Download Full-text

Metric Labeling and Semimetric Embedding for Protein Annotation Prediction

Journal of Computational Biology ◽

10.1089/cmb.2020.0425 ◽

2020 ◽

Author(s):

Emre Sefer ◽

Carl Kingsford

Keyword(s):

Protein Annotation

Download Full-text

Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database

Applied Sciences ◽

10.3390/app11010024 ◽

2020 ◽

Vol 11 (1) ◽

pp. 24

Author(s):

Jin Tao ◽

Kelly Brayton ◽

Shira Broschat

Keyword(s):

Language Processing ◽

Fine Tuning ◽

Support Vector ◽

Protein Annotation ◽

Computing Power ◽

Journal Publication ◽

Novel Approach ◽

Uniprotkb Database ◽

Public Repositories ◽

Annotation Errors

Advances in genome sequencing technology and computing power have brought about the explosive growth of sequenced genomes in public repositories with a concomitant increase in annotation errors. Many protein sequences are annotated using computational analysis rather than experimental verification, leading to inaccuracies in annotation. Confirmation of existing protein annotations is urgently needed before misannotation becomes even more prevalent due to error propagation. In this work we present a novel approach for automatically confirming the existence of manually curated information with experimental evidence of protein annotation. Our ensemble learning method uses a combination of recurrent convolutional neural network, logistic regression, and support vector machine models. Natural language processing in the form of word embeddings is used with journal publication titles retrieved from the UniProtKB database. Importantly, we use recall as our most significant metric to ensure the maximum number of verifications possible; results are reported to a human curator for confirmation. Our ensemble model achieves 91.25% recall, 71.26% accuracy, 65.19% precision, and an F1 score of 76.05% and outperforms the Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) model with fine-tuning using the same data.

Download Full-text

Mosaic evolution of the phosphopantothenate biosynthesis pathway in bacteria and archaea

Genome Biology and Evolution ◽

10.1093/gbe/evaa262 ◽

2020 ◽

Author(s):

Luc Thomès ◽

Alain Lescure

Keyword(s):

Metabolic Pathways ◽

Biosynthetic Pathway ◽

Coenzyme A ◽

Symbiotic Bacteria ◽

Dynamic Evolution ◽

Biosynthesis Pathway ◽

Protein Annotation ◽

Picrophilus Torridus ◽

Domains Of Life ◽

Mosaic Evolution

Abstract Phosphopantothenate is a precursor to synthesis of Coenzyme A (CoA), a molecule essential to many metabolic pathways. Organisms of the archaeal phyla were shown to utilize a different phosphopantothenate biosynthetic pathway from the eukaryotic and bacterial one. In this study, we report that symbiotic bacteria from the group Candidatus poribacteria present enzymes of the archaeal pathway, namely pantoate kinase (PoK) and phosphopantothenate synthetase (PPS), mirroring what was demonstrated for Picrophilus torridus, an archaea partially utilizing the bacterial pathway. Our results support the ancient origin of the CoA pathway in the three domains of life, but also highlight its complex and dynamic evolution. Importantly, this study helps to improve protein annotation for this pathway in the Candidatus poribacteria group and other related organisms.

Download Full-text

Protein Annotation of Breast-cancer-related Proteins with Machine-learning Tools

Makara Journal of Science ◽

10.7454/mss.v24i1.12106 ◽

2020 ◽

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Learning Tools ◽

Protein Annotation ◽

Related Proteins

Download Full-text

UPCLASS: a deep learning-based classifier for UniProtKB entry publications

Database ◽

10.1093/database/baaa026 ◽

2020 ◽

Vol 2020 ◽

Cited By ~ 1

Author(s):

Douglas Teodoro ◽

Julien Knafou ◽

Nona Naderi ◽

Emilie Pasche ◽

Julien Gobeill ◽

...

Keyword(s):

Neural Network ◽

Support Vector Machine ◽

Logistic Regression ◽

Deep Learning ◽

Specific Protein ◽

Support Vector ◽

Protein Annotation ◽

Feature Sets ◽

Percentage Points ◽

Main Challenge

Abstract In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliographies in UniProt, we investigate a convolutional neural network (CNN) model to classify publications with accession annotations according to UniProtKB categories. The main challenge of categorizing publications at the accession annotation level is that the same publication can be annotated with multiple proteins and thus be associated with different category sets according to the evidence provided for the protein. We propose a model that divides the document into parts containing and not containing evidence for the protein annotation. Then, we use these parts to create different feature sets for each accession and feed them to separate layers of the network. The CNN model achieved a micro F1-score of 0.72 and a macro F1-score of 0.62, outperforming baseline models based on logistic regression and support vector machine by up to 22 and 18 percentage points, respectively. We believe that such an approach could be used to systematically categorize the computationally mapped bibliography in UniProtKB, which represents a significant set of the publications, and help curators to decide whether a publication is relevant for further curation for a protein accession. Database URL: https://goldorak.hesge.ch/bioexpclass/upclass/.

Download Full-text

UPCLASS: a Deep Learning-based Classifier for UniProtKB Entry Publications

10.1101/842062 ◽

2019 ◽

Author(s):

Douglas Teodoro ◽

Julien Knafou ◽

Nona Naderi ◽

Emilie Pasche ◽

Julien Gobeill ◽

...

Keyword(s):

Neural Network ◽

Support Vector Machine ◽

Logistic Regression ◽

Deep Learning ◽

Specific Protein ◽

Support Vector ◽

Protein Annotation ◽

Feature Sets ◽

Percentage Points ◽

Main Challenge

AbstractIn the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliography in UniProt, we investigate a Convolution Neural Network (CNN) model to classify publications with accession annotations according to UniProtKB categories. The main challenge to categorize publications at the accession annotation level is that the same publication can be annotated with multiple proteins, and thus be associated to different category sets according to the evidence provided for the protein. We propose a model that divides the document into parts containing and not containing evidence for the protein annotation. Then, we use these parts to create different feature sets for each accession and feed them to separate layers of the network. The CNN model achieved a F1-score of 0.72, outperforming baseline models based on logistic regression and support vector machine by up to 22 and 18 percentage points, respectively. We believe that such approach could be used to systematically categorize the computationally mapped bibliography in UniProtKB, which represents a significant set of the publications, and help curators to decide whether a publication is relevant for further curation for a protein accession.

Download Full-text

mPies: a novel metaproteomics tool for the creation of relevant protein databases and automatized protein annotation

Biology Direct ◽

10.1186/s13062-019-0253-x ◽

2019 ◽

Vol 14 (1) ◽

Cited By ~ 1

Author(s):

Johannes Werner ◽

Augustin Géron ◽

Jules Kerssemakers ◽

Sabine Matallana-Surget

Keyword(s):

Input Data ◽

Rapid Development ◽

Protein Annotation ◽

Group Level ◽

Protein Databases ◽

Protein Inference ◽

Protein Group ◽

The Creation ◽

Public Repositories ◽

First Time

Abstract Metaproteomics allows to decipher the structure and functionality of microbial communities. Despite its rapid development, crucial steps such as the creation of standardized protein search databases and reliable protein annotation remain challenging. To overcome those critical steps, we developed a new program named mPies (metaProteomics in environmental sciences). mPies allows the creation of protein databases derived from assembled or unassembled metagenomes, and/or public repositories based on taxon IDs, gene or protein names. For the first time, mPies facilitates the automatization of reliable taxonomic and functional consensus annotations at the protein group level, minimizing the well-known protein inference issue, which is commonly encountered in metaproteomics. mPies’ workflow is highly customizable with regards to input data, workflow steps, and parameter adjustment. mPies is implemented in Python 3/Snakemake and freely available on GitHub: https://github.com/johanneswerner/mPies/. Reviewer This article was reviewed by Dr. Wilson Wen Bin Goh.

Download Full-text

mPies: a novel metaproteomics tool for the creation of relevant protein databases and automatized protein annotation

10.1101/690131 ◽

2019 ◽

Cited By ~ 2

Author(s):

Johannes Werner ◽

Augustin Geron ◽

Jules Kerssemakers ◽

Sabine Matallana-Surget

Keyword(s):

Input Data ◽

Rapid Development ◽

Protein Annotation ◽

Group Level ◽

Protein Databases ◽

Protein Inference ◽

Protein Group ◽

The Creation ◽

Public Repositories ◽

First Time

AbstractMetaproteomics allows to decipher the structure and functionality of microbial communities. Despite its rapid development, crucial steps such as the creation of standardized protein search databases and reliable protein annotation remain challenging. To overcome those critical steps, we developed a new program named mPies (metaProteomics in environmental sciences). mPies allows the creation of protein databases derived from assembled or unassembled metagenomes, and/or public repositories based on taxon IDs, gene or protein names. For the first time, mPies facilitates the automatization of reliable taxonomic and functional consensus annotations at the protein group level, minimizing the well-known protein inference issue which is commonly encountered in metaproteomics. mPies’ workflow is highly customizable with regards to input data, workflow steps, and parameter adjustment. mPies is implemented in Python 3/Snakemake and freely available on GitHub: https://github.com/johanneswerner/mPies/.

Download Full-text

protein annotation
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

PASS: Protein Annotation Surveillance Site for Protein Annotation Using Homologous Clusters, NLP, and Sequence Similarity Networks

Glycoproteogenomics characterizes the CD44 splicing code driving bladder cancer invasion

Metric Labeling and Semimetric Embedding for Protein Annotation Prediction

Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database

Mosaic evolution of the phosphopantothenate biosynthesis pathway in bacteria and archaea

Protein Annotation of Breast-cancer-related Proteins with Machine-learning Tools

UPCLASS: a deep learning-based classifier for UniProtKB entry publications

UPCLASS: a Deep Learning-based Classifier for UniProtKB Entry Publications

mPies: a novel metaproteomics tool for the creation of relevant protein databases and automatized protein annotation

mPies: a novel metaproteomics tool for the creation of relevant protein databases and automatized protein annotation

Export Citation Format

protein annotationRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

PASS: Protein Annotation Surveillance Site for Protein Annotation Using Homologous Clusters, NLP, and Sequence Similarity Networks

Glycoproteogenomics characterizes the CD44 splicing code driving bladder cancer invasion

Metric Labeling and Semimetric Embedding for Protein Annotation Prediction

Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database

Mosaic evolution of the phosphopantothenate biosynthesis pathway in bacteria and archaea

Protein Annotation of Breast-cancer-related Proteins with Machine-learning Tools

UPCLASS: a deep learning-based classifier for UniProtKB entry publications

UPCLASS: a Deep Learning-based Classifier for UniProtKB Entry Publications

mPies: a novel metaproteomics tool for the creation of relevant protein databases and automatized protein annotation

mPies: a novel metaproteomics tool for the creation of relevant protein databases and automatized protein annotation

protein annotation
Recently Published Documents