protein annotation
Recently Published Documents


TOTAL DOCUMENTS

78
(FIVE YEARS 11)

H-INDEX

15
(FIVE YEARS 1)

2021 ◽  
Vol 1 ◽  
Author(s):  
Jin Tao ◽  
Kelly A. Brayton ◽  
Shira L. Broschat

Advances in genome sequencing have accelerated the growth of sequenced genomes but at a cost in the quality of genome annotation. At the same time, computational analysis is widely used for protein annotation, but a dearth of experimental verification has contributed to inaccurate annotation as well as to annotation error propagation. Thus, a tool to help life scientists with accurate protein annotation would be useful. In this work we describe a website we have developed, the Protein Annotation Surveillance Site (PASS), which provides such a tool. This website consists of three major components: a database of homologous clusters of more than eight million protein sequences deduced from the representative genomes of bacteria, archaea, eukarya, and viruses, together with sequence information; a machine-learning software tool which periodically queries the UniprotKB database to determine whether protein function has been experimentally verified; and a query-able webpage where the FASTA headers of sequences from the cluster best matching an input sequence are returned. The user can choose from these sequences to create a sequence similarity network to assist in annotation or else use their expert knowledge to choose an annotation from the cluster sequences. Illustrations demonstrating use of this website are presented.


2021 ◽  
Author(s):  
Cristiana Gaiteiro ◽  
Janine Soares ◽  
Marta Relvas-Santos ◽  
Andreia Peixoto ◽  
Dylan Ferreira ◽  
...  

AbstractBladder cancer (BC) management demands the introduction of novel molecular targets for precision medicine. Cell surface glycoprotein CD44 has been widely studied as a potential biomarker of BC aggressiveness and cancer stem cells. However, significant alternative splicing and multiple glycosylation generate a myriad of glycoproteoforms with potentially distinct functional roles. The lack of tools for precise molecular characterization has led to conflicting results, delaying clinical applications. Addressing these limitations, we have interrogated the transcriptome of a large BC patient cohort for splicing signatures. Remarkable CD44 heterogeneity was observed, as well as associations between short CD44 standard splicing isoform (CD44s), invasion and poor prognosis. In parallel, immunoassays showed that targeting short O-glycoforms could hold the key to improve CD44 cancer specificity. This prompted the development of a glycoproteogenomics approach, building on the integration of transcriptomics-customized datasets and glycomics for protein annotation from nanoLC-ESI-MS/MS experiments. The concept was applied to invasive human BC cell lines, glycoengineered cells, and tumor tissues, enabling unequivocal CD44s identification. Finally, we confirmed the link between CD44s and invasion in vitro by siRNA knockdown, supporting findings from BC tissues. The key role played by short-chain O-glycans in CD44-mediated invasion was also demonstrated through glycoengineered cell models. Overall, CD44s emerged as biomarker of poor prognosis and CD44-Tn/STn as promising molecular signatures for targeted interventions. This study materializes the concept of glycoproteogenomics and provides a key vision to address the cancer splicing code at the protein level, which may now be expanded to better understand CD44 functional role in health and disease.Significance StatementThe biological role of CD44, a cell membrane glycoprotein involved in most cancer hallmarks and widely explored in BC, is intimately linked to its protein isoforms. mRNA alternative splicing generates several closely related polypeptide sequences, which have so far been inferred from transcripts analysis, in the absence of workflows for unequivocal protein annotation. Dense O-glycosylation is also key for protein function and may exponentiate the number of proteoforms, rendering CD44 molecular characterization a daunting enterprise. Here, we integrated multiple molecular information (RNA, proteins, glycans) for definitive CD44 characterization by mass spectrometry, materializing the concept of glycoproteogenomics. BC specific glycoproteoforms linked to invasion have been identified, holding potential for precise cancer targeting. The approach may be transferable to other tumors, paving the way for precision oncology.


2020 ◽  
Vol 11 (1) ◽  
pp. 24
Author(s):  
Jin Tao ◽  
Kelly Brayton ◽  
Shira Broschat

Advances in genome sequencing technology and computing power have brought about the explosive growth of sequenced genomes in public repositories with a concomitant increase in annotation errors. Many protein sequences are annotated using computational analysis rather than experimental verification, leading to inaccuracies in annotation. Confirmation of existing protein annotations is urgently needed before misannotation becomes even more prevalent due to error propagation. In this work we present a novel approach for automatically confirming the existence of manually curated information with experimental evidence of protein annotation. Our ensemble learning method uses a combination of recurrent convolutional neural network, logistic regression, and support vector machine models. Natural language processing in the form of word embeddings is used with journal publication titles retrieved from the UniProtKB database. Importantly, we use recall as our most significant metric to ensure the maximum number of verifications possible; results are reported to a human curator for confirmation. Our ensemble model achieves 91.25% recall, 71.26% accuracy, 65.19% precision, and an F1 score of 76.05% and outperforms the Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) model with fine-tuning using the same data.


Author(s):  
Luc Thomès ◽  
Alain Lescure

Abstract Phosphopantothenate is a precursor to synthesis of Coenzyme A (CoA), a molecule essential to many metabolic pathways. Organisms of the archaeal phyla were shown to utilize a different phosphopantothenate biosynthetic pathway from the eukaryotic and bacterial one. In this study, we report that symbiotic bacteria from the group Candidatus poribacteria present enzymes of the archaeal pathway, namely pantoate kinase (PoK) and phosphopantothenate synthetase (PPS), mirroring what was demonstrated for Picrophilus torridus, an archaea partially utilizing the bacterial pathway. Our results support the ancient origin of the CoA pathway in the three domains of life, but also highlight its complex and dynamic evolution. Importantly, this study helps to improve protein annotation for this pathway in the Candidatus poribacteria group and other related organisms.


Database ◽  
2020 ◽  
Vol 2020 ◽  
Author(s):  
Douglas Teodoro ◽  
Julien Knafou ◽  
Nona Naderi ◽  
Emilie Pasche ◽  
Julien Gobeill ◽  
...  

Abstract In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliographies in UniProt, we investigate a convolutional neural network (CNN) model to classify publications with accession annotations according to UniProtKB categories. The main challenge of categorizing publications at the accession annotation level is that the same publication can be annotated with multiple proteins and thus be associated with different category sets according to the evidence provided for the protein. We propose a model that divides the document into parts containing and not containing evidence for the protein annotation. Then, we use these parts to create different feature sets for each accession and feed them to separate layers of the network. The CNN model achieved a micro F1-score of 0.72 and a macro F1-score of 0.62, outperforming baseline models based on logistic regression and support vector machine by up to 22 and 18 percentage points, respectively. We believe that such an approach could be used to systematically categorize the computationally mapped bibliography in UniProtKB, which represents a significant set of the publications, and help curators to decide whether a publication is relevant for further curation for a protein accession. Database URL: https://goldorak.hesge.ch/bioexpclass/upclass/.


2019 ◽  
Author(s):  
Douglas Teodoro ◽  
Julien Knafou ◽  
Nona Naderi ◽  
Emilie Pasche ◽  
Julien Gobeill ◽  
...  

AbstractIn the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliography in UniProt, we investigate a Convolution Neural Network (CNN) model to classify publications with accession annotations according to UniProtKB categories. The main challenge to categorize publications at the accession annotation level is that the same publication can be annotated with multiple proteins, and thus be associated to different category sets according to the evidence provided for the protein. We propose a model that divides the document into parts containing and not containing evidence for the protein annotation. Then, we use these parts to create different feature sets for each accession and feed them to separate layers of the network. The CNN model achieved a F1-score of 0.72, outperforming baseline models based on logistic regression and support vector machine by up to 22 and 18 percentage points, respectively. We believe that such approach could be used to systematically categorize the computationally mapped bibliography in UniProtKB, which represents a significant set of the publications, and help curators to decide whether a publication is relevant for further curation for a protein accession.


2019 ◽  
Vol 14 (1) ◽  
Author(s):  
Johannes Werner ◽  
Augustin Géron ◽  
Jules Kerssemakers ◽  
Sabine Matallana-Surget

Abstract Metaproteomics allows to decipher the structure and functionality of microbial communities. Despite its rapid development, crucial steps such as the creation of standardized protein search databases and reliable protein annotation remain challenging. To overcome those critical steps, we developed a new program named mPies (metaProteomics in environmental sciences). mPies allows the creation of protein databases derived from assembled or unassembled metagenomes, and/or public repositories based on taxon IDs, gene or protein names. For the first time, mPies facilitates the automatization of reliable taxonomic and functional consensus annotations at the protein group level, minimizing the well-known protein inference issue, which is commonly encountered in metaproteomics. mPies’ workflow is highly customizable with regards to input data, workflow steps, and parameter adjustment. mPies is implemented in Python 3/Snakemake and freely available on GitHub: https://github.com/johanneswerner/mPies/. Reviewer This article was reviewed by Dr. Wilson Wen Bin Goh.


2019 ◽  
Author(s):  
Johannes Werner ◽  
Augustin Geron ◽  
Jules Kerssemakers ◽  
Sabine Matallana-Surget

AbstractMetaproteomics allows to decipher the structure and functionality of microbial communities. Despite its rapid development, crucial steps such as the creation of standardized protein search databases and reliable protein annotation remain challenging. To overcome those critical steps, we developed a new program named mPies (metaProteomics in environmental sciences). mPies allows the creation of protein databases derived from assembled or unassembled metagenomes, and/or public repositories based on taxon IDs, gene or protein names. For the first time, mPies facilitates the automatization of reliable taxonomic and functional consensus annotations at the protein group level, minimizing the well-known protein inference issue which is commonly encountered in metaproteomics. mPies’ workflow is highly customizable with regards to input data, workflow steps, and parameter adjustment. mPies is implemented in Python 3/Snakemake and freely available on GitHub: https://github.com/johanneswerner/mPies/.


Sign in / Sign up

Export Citation Format

Share Document