scholarly journals A Multi-Label Supervised Topic Model Conditioned on Arbitrary Features for Gene Function Prediction

Genes ◽  
2019 ◽  
Vol 10 (1) ◽  
pp. 57 ◽  
Author(s):  
Lin Liu ◽  
Lin Tang ◽  
Xin Jin ◽  
Wei Zhou

With the continuous accumulation of biological data, more and more machine learning algorithms have been introduced into the field of gene function prediction, which has great significance in decoding the secret of life. Recently, a multi-label supervised topic model named labeled latent Dirichlet allocation (LLDA) has been applied to gene function prediction, and obtained more accurate and explainable predictions than conventional methods. Nonetheless, the LLDA model is only able to construct a bag of amino acid words as a classification feature, and does not support any other features, such as hydrophobicity, which has a profound impact on gene function. To achieve more accurate probabilistic modeling of gene function, we propose a multi-label supervised topic model conditioned on arbitrary features, named Dirichlet multinomial regression LLDA (DMR-LLDA), for introducing multiple types of features into the process of topic modeling. Based on DMR framework, DMR-LLDA applies an exponential a priori construction, previously with weighted features, on the hyper-parameters of gene-topic distribution, so as to reflect the effects of extra features on function probability distribution. In the five-fold cross validation experiment of a yeast datasets, DMR-LLDA outperforms the compared model significantly. All of these experiments demonstrate the effectiveness and potential value of DMR-LLDA for predicting gene function.

Author(s):  
Jeffrey N Law ◽  
Shiv D Kale ◽  
T M Murali

Abstract Motivation Nearly 40% of the genes in sequenced genomes have no experimentally or computationally derived functional annotations. To fill this gap, we seek to develop methods for network-based gene function prediction that can integrate heterogeneous data for multiple species with experimentally based functional annotations and systematically transfer them to newly sequenced organisms on a genome-wide scale. However, the large sizes of such networks pose a challenge for the scalability of current methods. Results We develop a label propagation algorithm called FastSinkSource. By formally bounding its rate of progress, we decrease the running time by a factor of 100 without sacrificing accuracy. We systematically evaluate many approaches to construct multi-species bacterial networks and apply FastSinkSource and other state-of-the-art methods to these networks. We find that the most accurate and efficient approach is to pre-compute annotation scores for species with experimental annotations, and then to transfer them to other organisms. In this manner, FastSinkSource runs in under 3 min for 200 bacterial species. Availability and implementation An implementation of our framework and all data used in this research are available at https://github.com/Murali-group/multi-species-GOA-prediction. Supplementary information Supplementary data are available at Bioinformatics online.


2010 ◽  
Vol 26 (7) ◽  
pp. 912-918 ◽  
Author(s):  
Christoph Lippert ◽  
Zoubin Ghahramani ◽  
Karsten M. Borgwardt

2014 ◽  
Vol 18 (4) ◽  
pp. 717-738 ◽  
Author(s):  
Peerapon Vateekul ◽  
Miroslav Kubat ◽  
Kanoksri Sarinnapakorn

Sign in / Sign up

Export Citation Format

Share Document