TargetMM: accurate missense mutation prediction by utilizing local and global sequence information with classifier ensemble

Author(s):  
Fang Ge ◽  
Jun Hu ◽  
Yi-Heng Zhu ◽  
Muhammad Arif ◽  
Dong-Jun Yu

Aim and Objective: Missense mutation (MM) may lead to various human diseases by disabling proteins. Accurate prediction of MM is important and challenging for both protein function annotation and drug design. Although several computational methods yielded acceptable success rates, there is still room for further enhancing the prediction performance of MM. Materials and Methods: In the present study, we designed a new feature extracting method, which considers the impact degree of residues in the microenvironment range to the mutation site. Stringent cross-validation and independent test on benchmark datasets were performed to evaluate the efficacy of the proposed feature extracting method. Furthermore, three heterogeneous prediction models were trained and then ensembled for the final prediction. Results: By combining the feature representation method and classifier ensemble technique, we reported a novel MM predictor called TargetMM for identifying the pathogenic mutations from the neutral ones. Conclusion: Comparison outcomes based on statistical evaluation demonstrate that TargetMM outperforms the prior advanced methods on the independent test data. The source code and benchmark datasets of TargetMM are freely available at https://github.com/sera616/TargetMM.git for academic use.

2018 ◽  
Vol 15 (1) ◽  
pp. 45-54 ◽  
Author(s):  
Bishnupriya Panda ◽  
Babita Majhi ◽  
Abhimanyu Thakur

Background: Proteins are the utmost multi-purpose macromolecules, which play a crucial function in many aspects of biological processes. For a long time, sequence arrangement of amino acid has been utilized for the prediction of protein secondary structure. Besides, in major methods for the prediction of protein secondary structure class, the impact of Gaussian noise on sequence representation of amino acids has not been considered until now; which is one of the important constraints for the functionality of a protein. </P><P> Methods: In the present research, the prediction of protein secondary structure class was accomplished by integrated application of Stockwell transformation and Amino Acid Composition (AAC), on equivalent Electron-ion Interaction Potential (EIIP) representation of raw amino acid sequence. The introduced method was evaluated by using 4 benchmark datasets of low sequence homology, namely PDB25, 498, 277, and 204. Furthermore, random forest algorithm together with the out-of-bag error estimate and Support Vector Machine (SVM), using k-fold cross validation demonstrated high feature representation potential of our reported approach. Results: The overall prediction accuracy for PDB25, 498, 277, and 204 datasets with randomforest classifier was 92.5%, 94.79%, 92.45%, and 88.04% respectively, whereas with SVM, the results were 84.66%, 95.32%, 89.29%, and 84.37% respectively. An integrated-order-function-frequency-time (OFFT) model has been proposed for the prediction of protein secondary structure class. For the first time, we reported the effect of Gaussian noise on the prediction accuracy of protein secondary structure class and proposed a robust integrated- OFFT model, which is effectively noise resistant.


2020 ◽  
Vol 36 (11) ◽  
pp. 3350-3356 ◽  
Author(s):  
Md Mehedi Hasan ◽  
Nalini Schaduangrat ◽  
Shaherin Basith ◽  
Gwang Lee ◽  
Watshara Shoombuatong ◽  
...  

Abstract Motivation Therapeutic peptides failing at clinical trials could be attributed to their toxicity profiles like hemolytic activity, which hamper further progress of peptides as drug candidates. The accurate prediction of hemolytic peptides (HLPs) and its activity from the given peptides is one of the challenging tasks in immunoinformatics, which is essential for drug development and basic research. Although there are a few computational methods that have been proposed for this aspect, none of them are able to identify HLPs and their activities simultaneously. Results In this study, we proposed a two-layer prediction framework, called HLPpred-Fuse, that can accurately and automatically predict both hemolytic peptides (HLPs or non-HLPs) as well as HLPs activity (high and low). More specifically, feature representation learning scheme was utilized to generate 54 probabilistic features by integrating six different machine learning classifiers and nine different sequence-based encodings. Consequently, the 54 probabilistic features were fused to provide sufficiently converged sequence information which was used as an input to extremely randomized tree for the development of two final prediction models which independently identify HLP and its activity. Performance comparisons over empirical cross-validation analysis, independent test and case study against state-of-the-art methods demonstrate that HLPpred-Fuse consistently outperformed these methods in the identification of hemolytic activity. Availability and implementation For the convenience of experimental scientists, a web-based tool has been established at http://thegleelab.org/HLPpred-Fuse. Contact [email protected] or [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 20 (5) ◽  
pp. 565-578 ◽  
Author(s):  
Lidong Wang ◽  
Ruijun Zhang

Ubiquitination is an important post-translational modification (PTM) process for the regulation of protein functions, which is associated with cancer, cardiovascular and other diseases. Recent initiatives have focused on the detection of potential ubiquitination sites with the aid of physicochemical test approaches in conjunction with the application of computational methods. The identification of ubiquitination sites using laboratory tests is especially susceptible to the temporality and reversibility of the ubiquitination processes, and is also costly and time-consuming. It has been demonstrated that computational methods are effective in extracting potential rules or inferences from biological sequence collections. Up to the present, the computational strategy has been one of the critical research approaches that have been applied for the identification of ubiquitination sites, and currently, there are numerous state-of-the-art computational methods that have been developed from machine learning and statistical analysis to undertake such work. In the present study, the construction of benchmark datasets is summarized, together with feature representation methods, feature selection approaches and the classifiers involved in several previous publications. In an attempt to explore pertinent development trends for the identification of ubiquitination sites, an independent test dataset was constructed and the predicting results obtained from five prediction tools are reported here, together with some related discussions.


Animals ◽  
2021 ◽  
Vol 11 (7) ◽  
pp. 2050
Author(s):  
Beatriz Castro Dias Cuyabano ◽  
Gabriel Rovere ◽  
Dajeong Lim ◽  
Tae Hun Kim ◽  
Hak Kyo Lee ◽  
...  

It is widely known that the environment influences phenotypic expression and that its effects must be accounted for in genetic evaluation programs. The most used method to account for environmental effects is to add herd and contemporary group to the model. Although generally informative, the herd effect treats different farms as independent units. However, if two farms are located physically close to each other, they potentially share correlated environmental factors. We introduce a method to model herd effects that uses the physical distances between farms based on the Global Positioning System (GPS) coordinates as a proxy for the correlation matrix of these effects that aims to account for similarities and differences between farms due to environmental factors. A population of Hanwoo Korean cattle was used to evaluate the impact of modelling herd effects as correlated, in comparison to assuming the farms as completely independent units, on the variance components and genomic prediction. The main result was an increase in the reliabilities of the predicted genomic breeding values compared to reliabilities obtained with traditional models (across four traits evaluated, reliabilities of prediction presented increases that ranged from 0.05 ± 0.01 to 0.33 ± 0.03), suggesting that these models may overestimate heritabilities. Although little to no significant gain was obtained in phenotypic prediction, the increased reliability of the predicted genomic breeding values is of practical relevance for genetic evaluation programs.


Author(s):  
Jia-Bin Zhou ◽  
Yan-Qin Bai ◽  
Yan-Ru Guo ◽  
Hai-Xiang Lin

AbstractIn general, data contain noises which come from faulty instruments, flawed measurements or faulty communication. Learning with data in the context of classification or regression is inevitably affected by noises in the data. In order to remove or greatly reduce the impact of noises, we introduce the ideas of fuzzy membership functions and the Laplacian twin support vector machine (Lap-TSVM). A formulation of the linear intuitionistic fuzzy Laplacian twin support vector machine (IFLap-TSVM) is presented. Moreover, we extend the linear IFLap-TSVM to the nonlinear case by kernel function. The proposed IFLap-TSVM resolves the negative impact of noises and outliers by using fuzzy membership functions and is a more accurate reasonable classifier by using the geometric distribution information of labeled data and unlabeled data based on manifold regularization. Experiments with constructed artificial datasets, several UCI benchmark datasets and MNIST dataset show that the IFLap-TSVM has better classification accuracy than other state-of-the-art twin support vector machine (TSVM), intuitionistic fuzzy twin support vector machine (IFTSVM) and Lap-TSVM.


2020 ◽  
Vol 176 (2) ◽  
pp. 183-203
Author(s):  
Santosh Chapaneri ◽  
Deepak Jayaswal

Modeling the music mood has wide applications in music categorization, retrieval, and recommendation systems; however, it is challenging to computationally model the affective content of music due to its subjective nature. In this work, a structured regression framework is proposed to model the valence and arousal mood dimensions of music using a single regression model at a linear computational cost. To tackle the subjectivity phenomena, a confidence-interval based estimated consensus is computed by modeling the behavior of various annotators (e.g. biased, adversarial) and is shown to perform better than using the average annotation values. For a compact feature representation of music clips, variational Bayesian inference is used to learn the Gaussian mixture model representation of acoustic features and chord-related features are used to improve the valence estimation by probing the chord progressions between chroma frames. The dimensionality of features is further reduced using an adaptive version of kernel PCA. Using an efficient implementation of twin Gaussian process for structured regression, the proposed work achieves a significant improvement in R2 for arousal and valence dimensions relative to state-of-the-art techniques on two benchmark datasets for music mood estimation.


2021 ◽  
Vol 22 (6) ◽  
pp. 3079
Author(s):  
Xuechen Mu ◽  
Yueying Wang ◽  
Meiyu Duan ◽  
Shuai Liu ◽  
Fei Li ◽  
...  

Enhancers are short genomic regions exerting tissue-specific regulatory roles, usually for remote coding regions. Enhancers are observed in both prokaryotic and eukaryotic genomes, and their detections facilitate a better understanding of the transcriptional regulation mechanism. The accurate detection and transcriptional regulation strength evaluation of the enhancers remain a major bioinformatics challenge. Most of the current studies utilized the statistical features of short fixed-length nucleotide sequences. This study introduces the location information of each k-mer (SeqPose) into the encoding strategy of a DNA sequence and employs the attention mechanism in the two-layer bi-directional long-short term memory (BD-LSTM) model (spEnhancer) for the enhancer detection problem. The first layer of the delivered classifier discriminates between enhancers and non-enhancers, and the second layer evaluates the transcriptional regulation strength of the detected enhancer. The SeqPose-encoded features are selected by the Chi-squared test, and 45 positions are removed from further analysis. The existing studies may focus on selecting the statistical DNA sequence descriptors with large contributions to the prediction models. This study does not utilize these statistical DNA sequence descriptors. Then the word vector of the SeqPose-encoded features is obtained by using the word embedding layer. This study hypothesizes that different word vector features may contribute differently to the enhancer detection model, and assigns different weights to these word vectors through the attention mechanism in the BD-LSTM model. The previous study generously provided the training and independent test datasets, and the proposed spEnhancer is compared with the three existing state-of-the-art studies using the same experimental procedure. The leave-one-out validation data on the training dataset shows that the proposed spEnhancer achieves similar detection performances as the three existing studies. While spEnhancer achieves the best overall performance metric MCC for both of the two binary classification problems on the independent test dataset. The experimental data shows that the strategy of removing redundant positions (SeqPose) may help improve the DNA sequence-based prediction models. spEnhancer may serve well as a complementary model to the existing studies, especially for the novel query enhancers that are not included in the training dataset.


1981 ◽  
Vol 5 (2) ◽  
pp. 92-96 ◽  
Author(s):  
James L. Smith ◽  
Roy A. Mead

Abstract Two aerial photo volume prediction models, Avery's Composite Aerial Volume Table, and Mead's Quadratic Model, were compared using graphs and a small independent test set. The graphs indicated that Mead's model predicted higher merchantable volumes for pine stands in central Mississippi than did Avery's model. Both models tended to underpredict ground merchantable volume. However, only Avery's model underpredicted in a statistically significant manner. Even though the possibility of negative volume predictions exists when using Mead's Quadratic Model, it was deemed the superior model of the two investigated.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Neel Patel ◽  
William S. Bush

Abstract Background Transcriptional regulation is complex, requiring multiple cis (local) and trans acting mechanisms working in concert to drive gene expression, with disruption of these processes linked to multiple diseases. Previous computational attempts to understand the influence of regulatory mechanisms on gene expression have used prediction models containing input features derived from cis regulatory factors. However, local chromatin looping and trans-acting mechanisms are known to also influence transcriptional regulation, and their inclusion may improve model accuracy and interpretation. In this study, we create a general model of transcription factor influence on gene expression by incorporating both cis and trans gene regulatory features. Results We describe a computational framework to model gene expression for GM12878 and K562 cell lines. This framework weights the impact of transcription factor-based regulatory data using multi-omics gene regulatory networks to account for both cis and trans acting mechanisms, and measures of the local chromatin context. These prediction models perform significantly better compared to models containing cis-regulatory features alone. Models that additionally integrate long distance chromatin interactions (or chromatin looping) between distal transcription factor binding regions and gene promoters also show improved accuracy. As a demonstration of their utility, effect estimates from these models were used to weight cis-regulatory rare variants for sequence kernel association test analyses of gene expression. Conclusions Our models generate refined effect estimates for the influence of individual transcription factors on gene expression, allowing characterization of their roles across the genome. This work also provides a framework for integrating multiple data types into a single model of transcriptional regulation.


Sign in / Sign up

Export Citation Format

Share Document