New machine learning and physics-based scoring functions for drug discovery

Technological improvements in unbiased screening have accelerated drug target discovery. In particular, membrane-embedded and secreted proteins have gained attention because of their ability to orchestrate intercellular communication. Dysregulation of their extracellular protein–protein interactions (ePPIs) underlies the initiation and progression of many human diseases. Practically, ePPIs are also accessible for modulation by therapeutics since they operate outside of the plasma membrane. Therefore, it is unsurprising that while these proteins make up about 30% of human genes, they encompass the majority of drug targets approved by the FDA. Even so, most secreted and membrane proteins remain uncharacterized in terms of binding partners and cellular functions. To address this, a number of approaches have been developed to overcome challenges associated with membrane protein biology and ePPI discovery. This chapter will cover recent advances that use high-throughput methods to move towards the generation of a comprehensive network of ePPIs in humans for future targeted drug discovery.

Download Full-text

PPI_SVM: Prediction of protein-protein interactions using machine learning, domain-domain affinities and frequency tables

Cellular & Molecular Biology Letters ◽

10.2478/s11658-011-0008-x ◽

2011 ◽

Vol 16 (2) ◽

Cited By ~ 41

Author(s):

Piyali Chatterjee ◽

Subhadip Basu ◽

Mahantapas Kundu ◽

Mita Nasipuri ◽

Dariusz Plewczynski

Keyword(s):

Machine Learning ◽

Protein Interactions ◽

Three Dimensional ◽

Prediction Method ◽

Protein Sequences ◽

Dimensional Structure ◽

Support Vector ◽

Interacting Proteins ◽

Protein Protein Interactions ◽

Protein Functions

AbstractProtein-protein interactions (PPI) control most of the biological processes in a living cell. In order to fully understand protein functions, a knowledge of protein-protein interactions is necessary. Prediction of PPI is challenging, especially when the three-dimensional structure of interacting partners is not known. Recently, a novel prediction method was proposed by exploiting physical interactions of constituent domains. We propose here a novel knowledge-based prediction method, namely PPI_SVM, which predicts interactions between two protein sequences by exploiting their domain information. We trained a two-class support vector machine on the benchmarking set of pairs of interacting proteins extracted from the Database of Interacting Proteins (DIP). The method considers all possible combinations of constituent domains between two protein sequences, unlike most of the existing approaches. Moreover, it deals with both single-domain proteins and multi domain proteins; therefore it can be applied to the whole proteome in high-throughput studies. Our machine learning classifier, following a brainstorming approach, achieves accuracy of 86%, with specificity of 95%, and sensitivity of 75%, which are better results than most previous methods that sacrifice recall values in order to boost the overall precision. Our method has on average better sensitivity combined with good selectivity on the benchmarking dataset. The PPI_SVM source code, train/test datasets and supplementary files are available freely in the public domain at: http://code.google.com/p/cmater-bioinfo/.

Download Full-text

Quantum Machine Learning for Drug Discovery

10.26434/chemrxiv.12781232.v1 ◽

2020 ◽

Author(s):

Kushal Batra ◽

Kimberley M. Zorn ◽

Daniel H. Foil ◽

Eni Minerali ◽

Victor O. Gawriljuk ◽

...

Keyword(s):

Machine Learning ◽

Drug Discovery ◽

Molecular Descriptors ◽

Quantum Computer ◽

Machine Learning Algorithms ◽

Support Vector ◽

Scoring Functions ◽

Biological Targets ◽

Hybrid Approaches ◽

Quantum Machine Learning

The growing public and private datasets focused on small molecules screened against biological targets or whole organisms 1 provides a wealth of drug discovery relevant data. Increasingly this is used to create machine learning models which can be used for enabling target-based design 2-4, predict on- or off-target effects and create scoring functions 5,6. This is matched by the availability of machine learning algorithms such as Support Vector Machines (SVM) and Deep Neural Networks (DNN) that are computationally expensive to perform on very large datasets and thousands of molecular descriptors. Quantum computer (QC) algorithms have been proposed to offer an approach to accelerate quantum machine learning over classical computer (CC) algorithms, however with significant limitations. In the case of cheminformatics, one of the challenges to overcome is the need for compression of large numbers of molecular descriptors for use on QC. Here we show how to achieve compression with datasets using hundreds of molecules (SARS-CoV-2) to hundreds of thousands (whole cell screening datasets for plague and M. tuberculosis) with SVM and data re-uploading classifier (a DNN equivalent algorithm) on a QC benchmarked against CC and hybrid approaches. This illustrates a quantum advantage for drug discovery to build upon in future.

Download Full-text

Abstract A45: In silico drug discovery targeting Hippo pathway and YAP-TEAD protein-protein interactions for small-molecule anticancer agent

10.1158/1557-3125.hippo19-a45 ◽

2020 ◽

Author(s):

Kim Jongwan ◽

Hocheol Lim ◽

K.T. No

Keyword(s):

Drug Discovery ◽

Small Molecule ◽

Protein Interactions ◽

In Silico ◽

Anticancer Agent ◽

Hippo Pathway ◽

Protein Protein Interactions

Download Full-text

Text mining for modeling of protein complexes enhanced by machine learning

Bioinformatics ◽

10.1093/bioinformatics/btaa823 ◽

2020 ◽

Author(s):

Varsha D Badal ◽

Petras J Kundrotas ◽

Ilya A Vakser

Keyword(s):

Machine Learning ◽

Text Mining ◽

Protein Interactions ◽

Full Text ◽

Protein Complexes ◽

Protein Docking ◽

Supplementary Information ◽

Support Vector ◽

Learning Approaches ◽

Protein Protein Interactions

Abstract Motivation Procedures for structural modeling of protein-protein complexes (protein docking) produce a number of models which need to be further analyzed and scored. Scoring can be based on independently determined constraints on the structure of the complex, such as knowledge of amino acids essential for the protein interaction. Previously, we showed that text mining of residues in freely available PubMed abstracts of papers on studies of protein-protein interactions may generate such constraints. However, absence of post-processing of the spotted residues reduced usability of the constraints, as a significant number of the residues were not relevant for the binding of the specific proteins. Results We explored filtering of the irrelevant residues by two machine learning approaches, Deep Recursive Neural Network (DRNN) and Support Vector Machine (SVM) models with different training/testing schemes. The results showed that the DRNN model is superior to the SVM model when training is performed on the PMC-OA full-text articles and applied to classification (interface or non-interface) of the residues spotted in the PubMed abstracts. When both training and testing is performed on full-text articles or on abstracts, the performance of these models is similar. Thus, in such cases, there is no need to utilize computationally demanding DRNN approach, which is computationally expensive especially at the training stage. The reason is that SVM success is often determined by the similarity in data/text patterns in the training and the testing sets, whereas the sentence structures in the abstracts are, in general, different from those in the full text articles. Availability The code and the datasets generated in this study are available at https://gitlab.ku.edu/vakser-lab-public/text-mining/-/tree/2020-09-04. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Network analysis and in silico prediction of protein–protein interactions with applications in drug discovery

Current Opinion in Structural Biology ◽

10.1016/j.sbi.2017.02.005 ◽

2017 ◽

Vol 44 ◽

pp. 134-142 ◽

Cited By ~ 29

Author(s):

Yoichi Murakami ◽

Lokesh P Tripathi ◽

Philip Prathipati ◽

Kenji Mizuguchi

Keyword(s):

Drug Discovery ◽

Network Analysis ◽

Protein Interactions ◽

In Silico ◽

In Silico Prediction ◽

Protein Protein Interactions

Download Full-text

A programmatic tool for automatic ease in coronavirus drug discovery through programmatically automated data mining, QSAR and In Silico modelling

10.26434/chemrxiv.12423638.v2 ◽

2020 ◽

Author(s):

Ben Geoffrey A S ◽

Rafal Madaj ◽

Akhil Sanker ◽

Mario Sergio Valdés Tresanco ◽

Host Antony Davidd ◽

...

Keyword(s):

Machine Learning ◽

Data Mining ◽

Drug Discovery ◽

In Silico ◽

Drug Targets ◽

Feature Learning ◽

Ligand Interaction ◽

Descriptor Selection ◽

In Silico Modelling ◽

Drug Leads

<div>The work is composed of python based programmatic tool that automates the workflow of drug discovery for coronavirus. Firstly, the python program is written to automate the process of data mining PubChem database to collect data required to perform a machine learning based AutoQSAR algorithm through which drug leads for coronavirus are generated. The data acquisition from PubChem was carried out through python web scrapping techniques. The workflow of the machine learning based AutoQSAR involves feature learning and descriptor selection, QSAR modelling, validation and prediction. The drug leads generated by the program are required to satisfy the Lipinski’s drug likeness criteria as compounds that satisfy Lipinski’s criteria are likely to be an orally active drug in humans. Drug leads generated by the program are fed as programmatic inputs to an In Silico modelling package to computer model the interaction of the compounds generated as drug leads and two coronavirus drug targets identified with their PDB ID : 6W9C and 1P9U. The results are stored in the working folder of the user. The program also generates protein-ligand interaction profiling and stores the visualized images in the working folder of the user. Thus our programmatic tool ushers in the new age automatic ease in drug identification for coronavirus through a fully automated QSAR and an automated In Silico modelling of the drug leads generated by the autoQSAR algorithm. The program is hosted, maintained and supported at the GitHub repository link given below<a href="https://github.com/bengeof/Programmatic-tool-to-automate-the-drug-discovery-workflow-for-coronavirus">https://github.com/bengeof/Programmatic-tool-to-automate-the-drug-discovery-workflow-for-coronavirus</a></div>

Download Full-text

Quantum Machine Learning for Drug Discovery

10.26434/chemrxiv.12781232 ◽

2020 ◽

Author(s):

Kushal Batra ◽

Kimberley M. Zorn ◽

Daniel H. Foil ◽

Eni Minerali ◽

Victor O. Gawriljuk ◽

...

Keyword(s):

Machine Learning ◽

Drug Discovery ◽

Molecular Descriptors ◽

Quantum Computer ◽

Machine Learning Algorithms ◽

Support Vector ◽

Scoring Functions ◽

Biological Targets ◽

Hybrid Approaches ◽

Quantum Machine Learning

The growing public and private datasets focused on small molecules screened against biological targets or whole organisms 1 provides a wealth of drug discovery relevant data. Increasingly this is used to create machine learning models which can be used for enabling target-based design 2-4, predict on- or off-target effects and create scoring functions 5,6. This is matched by the availability of machine learning algorithms such as Support Vector Machines (SVM) and Deep Neural Networks (DNN) that are computationally expensive to perform on very large datasets and thousands of molecular descriptors. Quantum computer (QC) algorithms have been proposed to offer an approach to accelerate quantum machine learning over classical computer (CC) algorithms, however with significant limitations. In the case of cheminformatics, one of the challenges to overcome is the need for compression of large numbers of molecular descriptors for use on QC. Here we show how to achieve compression with datasets using hundreds of molecules (SARS-CoV-2) to hundreds of thousands (whole cell screening datasets for plague and M. tuberculosis) with SVM and data re-uploading classifier (a DNN equivalent algorithm) on a QC benchmarked against CC and hybrid approaches. This illustrates a quantum advantage for drug discovery to build upon in future.

Download Full-text

Essentiality, Protein-Protein Interactions and Evolutionary Properties are Key Predictors for Identifying Cancer Genes Using Machine Learning

10.1101/2021.09.01.458494 ◽

2021 ◽

Author(s):

Amro Safadi ◽

Simon C Lovell ◽

Andrew James Doig

Keyword(s):

Machine Learning ◽

Protein Interactions ◽

Drug Targets ◽

Therapeutic Drug ◽

Cancer Genes ◽

Protein Protein Interactions ◽

Protein Coding ◽

Protein Protein Interaction ◽

Machine Learning Model ◽

Human Genes

The identification of genes that may be linked to cancer is of great importance for the discovery of new drug targets. The rate at which cancer genes are being found experimentally is slow, however, due to the complexity of the identification and confirmation process, giving a narrow range of therapeutic targets to investigate and develop. One solution to this problem is to use predictive analysis techniques that can accurately identify cancer gene candidates in a timely fashion. Furthermore, the effort in identifying characteristics that are linked to cancer genes is crucial to further our understanding of this disease. These characteristics can be employed in recognising therapeutic drug targets. Here, we investigated whether certain genes' properties can indicate the likelihood of it to be involved in the initiation or progression of cancer. We found that for cancer, the essentiality scores tend to be higher for cancer genes than for all protein coding human genes. A machine-learning model was developed and we found that essentiality related properties and properties arising from protein-protein interaction networks or evolution are particularly effective in predicting cancer-associated genes. We were also able to identify potential drug targets that have not been previously linked with cancer, but have the characteristics of cancer-related genes.

Download Full-text