scholarly journals New machine learning and physics-based scoring functions for drug discovery

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Isabella A. Guedes ◽  
André M. S. Barreto ◽  
Diogo Marinho ◽  
Eduardo Krempser ◽  
Mélaine A. Kuenemann ◽  
...  

AbstractScoring functions are essential for modern in silico drug discovery. However, the accurate prediction of binding affinity by scoring functions remains a challenging task. The performance of scoring functions is very heterogeneous across different target classes. Scoring functions based on precise physics-based descriptors better representing protein–ligand recognition process are strongly needed. We developed a set of new empirical scoring functions, named DockTScore, by explicitly accounting for physics-based terms combined with machine learning. Target-specific scoring functions were developed for two important drug targets, proteases and protein–protein interactions, representing an original class of molecules for drug discovery. Multiple linear regression (MLR), support vector machine and random forest algorithms were employed to derive general and target-specific scoring functions involving optimized MMFF94S force-field terms, solvation and lipophilic interactions terms, and an improved term accounting for ligand torsional entropy contribution to ligand binding. DockTScore scoring functions demonstrated to be competitive with the current best-evaluated scoring functions in terms of binding energy prediction and ranking on four DUD-E datasets and will be useful for in silico drug design for diverse proteins as well as for specific targets such as proteases and protein–protein interactions. Currently, the MLR DockTScore is available at www.dockthor.lncc.br.

Author(s):  
Alexander Goncearenco ◽  
Minghui Li ◽  
Franco L. Simonetti ◽  
Benjamin A. Shoemaker ◽  
Anna R. Panchenko

2021 ◽  
Author(s):  
Shengya Cao ◽  
Nadia Martinez-Martin

Technological improvements in unbiased screening have accelerated drug target discovery. In particular, membrane-embedded and secreted proteins have gained attention because of their ability to orchestrate intercellular communication. Dysregulation of their extracellular protein–protein interactions (ePPIs) underlies the initiation and progression of many human diseases. Practically, ePPIs are also accessible for modulation by therapeutics since they operate outside of the plasma membrane. Therefore, it is unsurprising that while these proteins make up about 30% of human genes, they encompass the majority of drug targets approved by the FDA. Even so, most secreted and membrane proteins remain uncharacterized in terms of binding partners and cellular functions. To address this, a number of approaches have been developed to overcome challenges associated with membrane protein biology and ePPI discovery. This chapter will cover recent advances that use high-throughput methods to move towards the generation of a comprehensive network of ePPIs in humans for future targeted drug discovery.


Author(s):  
Piyali Chatterjee ◽  
Subhadip Basu ◽  
Mahantapas Kundu ◽  
Mita Nasipuri ◽  
Dariusz Plewczynski

AbstractProtein-protein interactions (PPI) control most of the biological processes in a living cell. In order to fully understand protein functions, a knowledge of protein-protein interactions is necessary. Prediction of PPI is challenging, especially when the three-dimensional structure of interacting partners is not known. Recently, a novel prediction method was proposed by exploiting physical interactions of constituent domains. We propose here a novel knowledge-based prediction method, namely PPI_SVM, which predicts interactions between two protein sequences by exploiting their domain information. We trained a two-class support vector machine on the benchmarking set of pairs of interacting proteins extracted from the Database of Interacting Proteins (DIP). The method considers all possible combinations of constituent domains between two protein sequences, unlike most of the existing approaches. Moreover, it deals with both single-domain proteins and multi domain proteins; therefore it can be applied to the whole proteome in high-throughput studies. Our machine learning classifier, following a brainstorming approach, achieves accuracy of 86%, with specificity of 95%, and sensitivity of 75%, which are better results than most previous methods that sacrifice recall values in order to boost the overall precision. Our method has on average better sensitivity combined with good selectivity on the benchmarking dataset. The PPI_SVM source code, train/test datasets and supplementary files are available freely in the public domain at: http://code.google.com/p/cmater-bioinfo/.


2020 ◽  
Author(s):  
Kushal Batra ◽  
Kimberley M. Zorn ◽  
Daniel H. Foil ◽  
Eni Minerali ◽  
Victor O. Gawriljuk ◽  
...  

<p>The growing public and private datasets focused on small molecules screened against biological targets or whole organisms <sup>1</sup> provides a wealth of drug discovery relevant data. Increasingly this is used to create machine learning models which can be used for enabling target-based design <sup>2-4</sup>, predict on- or off-target effects and create scoring functions <sup>5,6</sup>. This is matched by the availability of machine learning algorithms such as Support Vector Machines (SVM) and Deep Neural Networks (DNN) that are computationally expensive to perform on very large datasets and thousands of molecular descriptors. Quantum computer (QC) algorithms have been proposed to offer an approach to accelerate quantum machine learning over classical computer (CC) algorithms, however with significant limitations. In the case of cheminformatics, one of the challenges to overcome is the need for compression of large numbers of molecular descriptors for use on QC. Here we show how to achieve compression with datasets using hundreds of molecules (SARS-CoV-2) to hundreds of thousands (whole cell screening datasets for plague and <i>M. tuberculosis</i>) with SVM and data re-uploading classifier (a DNN equivalent algorithm) on a QC benchmarked against CC and hybrid approaches. This illustrates a quantum advantage for drug discovery to build upon in future.</p>


Author(s):  
Varsha D Badal ◽  
Petras J Kundrotas ◽  
Ilya A Vakser

Abstract Motivation Procedures for structural modeling of protein-protein complexes (protein docking) produce a number of models which need to be further analyzed and scored. Scoring can be based on independently determined constraints on the structure of the complex, such as knowledge of amino acids essential for the protein interaction. Previously, we showed that text mining of residues in freely available PubMed abstracts of papers on studies of protein-protein interactions may generate such constraints. However, absence of post-processing of the spotted residues reduced usability of the constraints, as a significant number of the residues were not relevant for the binding of the specific proteins. Results We explored filtering of the irrelevant residues by two machine learning approaches, Deep Recursive Neural Network (DRNN) and Support Vector Machine (SVM) models with different training/testing schemes. The results showed that the DRNN model is superior to the SVM model when training is performed on the PMC-OA full-text articles and applied to classification (interface or non-interface) of the residues spotted in the PubMed abstracts. When both training and testing is performed on full-text articles or on abstracts, the performance of these models is similar. Thus, in such cases, there is no need to utilize computationally demanding DRNN approach, which is computationally expensive especially at the training stage. The reason is that SVM success is often determined by the similarity in data/text patterns in the training and the testing sets, whereas the sentence structures in the abstracts are, in general, different from those in the full text articles. Availability The code and the datasets generated in this study are available at https://gitlab.ku.edu/vakser-lab-public/text-mining/-/tree/2020-09-04. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Ben Geoffrey A S ◽  
Rafal Madaj ◽  
Akhil Sanker ◽  
Mario Sergio Valdés Tresanco ◽  
Host Antony Davidd ◽  
...  

<div><p>The work is composed of python based programmatic tool that automates the workflow of drug discovery for coronavirus. Firstly, the python program is written to automate the process of data mining PubChem database to collect data required to perform a machine learning based AutoQSAR algorithm through which drug leads for coronavirus are generated. The data acquisition from PubChem was carried out through python web scrapping techniques. The workflow of the machine learning based AutoQSAR involves feature learning and descriptor selection, QSAR modelling, validation and prediction. The drug leads generated by the program are required to satisfy the Lipinski’s drug likeness criteria as compounds that satisfy Lipinski’s criteria are likely to be an orally active drug in humans. Drug leads generated by the program are fed as programmatic inputs to an In Silico modelling package to computer model the interaction of the compounds generated as drug leads and two coronavirus drug targets identified with their PDB ID : 6W9C and 1P9U. The results are stored in the working folder of the user. The program also generates protein-ligand interaction profiling and stores the visualized images in the working folder of the user. Thus our programmatic tool ushers in the new age automatic ease in drug identification for coronavirus through a fully automated QSAR and an automated In Silico modelling of the drug leads generated by the autoQSAR algorithm.<br><br></p><p>The program is hosted, maintained and supported at the GitHub repository link given below</p><p><a href="https://github.com/bengeof/Programmatic-tool-to-automate-the-drug-discovery-workflow-for-coronavirus">https://github.com/bengeof/Programmatic-tool-to-automate-the-drug-discovery-workflow-for-coronavirus</a></p></div>


2020 ◽  
Author(s):  
Kushal Batra ◽  
Kimberley M. Zorn ◽  
Daniel H. Foil ◽  
Eni Minerali ◽  
Victor O. Gawriljuk ◽  
...  

<p>The growing public and private datasets focused on small molecules screened against biological targets or whole organisms <sup>1</sup> provides a wealth of drug discovery relevant data. Increasingly this is used to create machine learning models which can be used for enabling target-based design <sup>2-4</sup>, predict on- or off-target effects and create scoring functions <sup>5,6</sup>. This is matched by the availability of machine learning algorithms such as Support Vector Machines (SVM) and Deep Neural Networks (DNN) that are computationally expensive to perform on very large datasets and thousands of molecular descriptors. Quantum computer (QC) algorithms have been proposed to offer an approach to accelerate quantum machine learning over classical computer (CC) algorithms, however with significant limitations. In the case of cheminformatics, one of the challenges to overcome is the need for compression of large numbers of molecular descriptors for use on QC. Here we show how to achieve compression with datasets using hundreds of molecules (SARS-CoV-2) to hundreds of thousands (whole cell screening datasets for plague and <i>M. tuberculosis</i>) with SVM and data re-uploading classifier (a DNN equivalent algorithm) on a QC benchmarked against CC and hybrid approaches. This illustrates a quantum advantage for drug discovery to build upon in future.</p>


2021 ◽  
Author(s):  
Amro Safadi ◽  
Simon C Lovell ◽  
Andrew James Doig

The identification of genes that may be linked to cancer is of great importance for the discovery of new drug targets. The rate at which cancer genes are being found experimentally is slow, however, due to the complexity of the identification and confirmation process, giving a narrow range of therapeutic targets to investigate and develop. One solution to this problem is to use predictive analysis techniques that can accurately identify cancer gene candidates in a timely fashion. Furthermore, the effort in identifying characteristics that are linked to cancer genes is crucial to further our understanding of this disease. These characteristics can be employed in recognising therapeutic drug targets. Here, we investigated whether certain genes' properties can indicate the likelihood of it to be involved in the initiation or progression of cancer. We found that for cancer, the essentiality scores tend to be higher for cancer genes than for all protein coding human genes. A machine-learning model was developed and we found that essentiality related properties and properties arising from protein-protein interaction networks or evolution are particularly effective in predicting cancer-associated genes. We were also able to identify potential drug targets that have not been previously linked with cancer, but have the characteristics of cancer-related genes.


Sign in / Sign up

Export Citation Format

Share Document