Papyrus - A large scale curated dataset aimed at bioactivity predictions

With the recent rapid growth of publicly available ligand-protein bioactivity data, there is a trove of viable data that can be used to train machine learning algorithms. However, not all data is equal in terms of size and quality, and a significant portion of researcher’s time is needed to adapt the data to their needs. On top of that, finding the right data for a research question can often be a challenge on its own. As an answer to that, we have constructed the Papyrus dataset (DOI: 10.4121/16896406), comprised of around 60 million datapoints. This dataset contains multiple large publicly available datasets such as ChEMBL and ExCAPE-DB combined with several smaller datasets containing high quality data. The aggregated data has been standardised and normalised in a manner that is suitable for machine learning. We show how data can be filtered in a variety of ways, and also perform some baseline quantitative structure-activity relationship analyses and proteochemometrics modeling. Our ambition is this pruned data collection constitutes a benchmark set that can be used for constructing predictive models, while also providing a solid baseline for related research.

Download Full-text

Machine Learning for the Design and Development of Biofilm Regulators

10.20944/preprints201803.0118.v1 ◽

2018 ◽

Author(s):

Benjamin Stone ◽

Erik Sapper

Keyword(s):

Machine Learning ◽

Quorum Sensing ◽

Computational Models ◽

Quantitative Structure Activity Relationship ◽

Structural Features ◽

Machine Learning Algorithms ◽

Chemical Signaling ◽

Biochemical System ◽

Structure Activity ◽

Qsar Models

Biofilms are congregations of bacteria on a surface, and they grow into obstacles for the functionalities of any device or machinery involves anything biological. Biofilms are developed through a biochemical system known as ‘Quorum Sensing’ that accounts for the chemical signaling that direct either biofilm formation or inhibition. Computational models that relate chemical and structural features of compounds to their performance properties have been used to aide in the discovery of active small molecules for many decades. These quantitative structure-activity relationship (QSAR) models are also important for predicting the activity of molecules that can have a range of effectiveness in biological systems. This study uses QSAR methodologies combined with and different machine learning algorithms to predict and assess the performance of several different compounds acting in Quorum Sensing. Through computational probing of the quorum sensing molecular interaction, new design rules can be elucidated for countering biofilms.

Download Full-text

Accuracy of Machine Learning Algorithms for the Classification of Molecular Features of Gliomas on MRI: A Systematic Literature Review and Meta-Analysis

Cancers ◽

10.3390/cancers13112606 ◽

2021 ◽

Vol 13 (11) ◽

pp. 2606

Author(s):

Evi J. van Kempen ◽

Max Post ◽

Manoj Mannil ◽

Benno Kusters ◽

Mark ter Laan ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

Meta Analysis ◽

Learning Algorithms ◽

External Validation ◽

Machine Learning Algorithms ◽

Molecular Characteristics ◽

Aggregated Data ◽

Molecular Features

Treatment planning and prognosis in glioma treatment are based on the classification into low- and high-grade oligodendroglioma or astrocytoma, which is mainly based on molecular characteristics (IDH1/2- and 1p/19q codeletion status). It would be of great value if this classification could be made reliably before surgery, without biopsy. Machine learning algorithms (MLAs) could play a role in achieving this by enabling glioma characterization on magnetic resonance imaging (MRI) data without invasive tissue sampling. The aim of this study is to provide a performance evaluation and meta-analysis of various MLAs for glioma characterization. Systematic literature search and meta-analysis were performed on the aggregated data, after which subgroup analyses for several target conditions were conducted. This study is registered with PROSPERO, CRD42020191033. We identified 724 studies; 60 and 17 studies were eligible to be included in the systematic review and meta-analysis, respectively. Meta-analysis showed excellent accuracy for all subgroups, with the classification of 1p/19q codeletion status scoring significantly poorer than other subgroups (AUC: 0.748, p = 0.132). There was considerable heterogeneity among some of the included studies. Although promising results were found with regard to the ability of MLA-tools to be used for the non-invasive classification of gliomas, large-scale, prospective trials with external validation are warranted in the future.

Download Full-text

Efficient Image Retrieval approach for Large-scale Chest X Ray data using Hand-Crafted Features and Machine Learning Algorithms

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i11.890896 ◽

2018 ◽

Vol 6 (11) ◽

pp. 890-896

Author(s):

Irene Getzi S ◽

D. Christopher Durairaj ◽

V Joseph Raj

Keyword(s):

Machine Learning ◽

Image Retrieval ◽

Large Scale ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

X Ray ◽

Chest X Ray

Download Full-text

Clinician checklist for assessing suitability of machine learning applications in healthcare

BMJ Health & Care Informatics ◽

10.1136/bmjhci-2020-100251 ◽

2021 ◽

Vol 28 (1) ◽

pp. e100251

Author(s):

Ian Scott ◽

Stacey Carter ◽

Enrico Coiera

Keyword(s):

Machine Learning ◽

Large Scale ◽

Clinical Decision Making ◽

Improve Patient Care ◽

Clinical Decision ◽

Routine Care ◽

Machine Learning Algorithms ◽

Clinical Settings ◽

Machine Learning Applications ◽

Key Issues

Machine learning algorithms are being used to screen and diagnose disease, prognosticate and predict therapeutic responses. Hundreds of new algorithms are being developed, but whether they improve clinical decision making and patient outcomes remains uncertain. If clinicians are to use algorithms, they need to be reassured that key issues relating to their validity, utility, feasibility, safety and ethical use have been addressed. We propose a checklist of 10 questions that clinicians can ask of those advocating for the use of a particular algorithm, but which do not expect clinicians, as non-experts, to demonstrate mastery over what can be highly complex statistical and computational concepts. The questions are: (1) What is the purpose and context of the algorithm? (2) How good were the data used to train the algorithm? (3) Were there sufficient data to train the algorithm? (4) How well does the algorithm perform? (5) Is the algorithm transferable to new clinical settings? (6) Are the outputs of the algorithm clinically intelligible? (7) How will this algorithm fit into and complement current workflows? (8) Has use of the algorithm been shown to improve patient care and outcomes? (9) Could the algorithm cause patient harm? and (10) Does use of the algorithm raise ethical, legal or social concerns? We provide examples where an algorithm may raise concerns and apply the checklist to a recent review of diagnostic imaging applications. This checklist aims to assist clinicians in assessing algorithm readiness for routine care and identify situations where further refinement and evaluation is required prior to large-scale use.

Download Full-text

57 Precision neoantigen discovery using novel algorithms and expanded HLA-ligandome datasets

Journal for ImmunoTherapy of Cancer ◽

10.1136/jitc-2020-sitc2020.0057 ◽

2020 ◽

Vol 8 (Suppl 3) ◽

pp. A62-A62

Author(s):

Dattatreya Mellacheruvu ◽

Rachel Pyke ◽

Charles Abbott ◽

Nick Phillips ◽

Sejal Desai ◽

...

Keyword(s):

Machine Learning ◽

Cell Lines ◽

Antigen Processing ◽

Large Scale ◽

Prediction Models ◽

K562 Cells ◽

Machine Learning Algorithms ◽

Training Data ◽

High Quality ◽

Tissue Samples

BackgroundAccurately identified neoantigens can be effective therapeutic agents in both adjuvant and neoadjuvant settings. A key challenge for neoantigen discovery has been the availability of accurate prediction models for MHC peptide presentation. We have shown previously that our proprietary model based on (i) large-scale, in-house mono-allelic data, (ii) custom features that model antigen processing, and (iii) advanced machine learning algorithms has strong performance. We have extended upon our work by systematically integrating large quantities of high-quality, publicly available data, implementing new modelling algorithms, and rigorously testing our models. These extensions lead to substantial improvements in performance and generalizability. Our algorithm, named Systematic HLA Epitope Ranking Pan Algorithm (SHERPA™), is integrated into the ImmunoID NeXT Platform®, our immuno-genomics and transcriptomics platform specifically designed to enable the development of immunotherapies.MethodsIn-house immunopeptidomic data was generated using stably transfected HLA-null K562 cells lines that express a single HLA allele of interest, followed by immunoprecipitation using W6/32 antibody and LC-MS/MS. Public immunopeptidomics data was downloaded from repositories such as MassIVE and processed uniformly using in-house pipelines to generate peptide lists filtered at 1% false discovery rate. Other metrics (features) were either extracted from source data or generated internally by re-processing samples utilizing the ImmunoID NeXT Platform.ResultsWe have generated large-scale and high-quality immunopeptidomics data by using approximately 60 mono-allelic cell lines that unambiguously assign peptides to their presenting alleles to create our primary models. Briefly, our primary ‘binding’ algorithm models MHC-peptide binding using peptide and binding pockets while our primary ‘presentation’ model uses additional features to model antigen processing and presentation. Both primary models have significantly higher precision across all recall values in multiple test data sets, including mono-allelic cell lines and multi-allelic tissue samples. To further improve the performance of our model, we expanded the diversity of our training set using high-quality, publicly available mono-allelic immunopeptidomics data. Furthermore, multi-allelic data was integrated by resolving peptide-to-allele mappings using our primary models. We then trained a new model using the expanded training data and a new composite machine learning architecture. The resulting secondary model further improves performance and generalizability across several tissue samples.ConclusionsImproving technologies for neoantigen discovery is critical for many therapeutic applications, including personalized neoantigen vaccines, and neoantigen-based biomarkers for immunotherapies. Our new and improved algorithm (SHERPA) has significantly higher performance compared to a state-of-the-art public algorithm and furthers this objective.

Download Full-text

THE CIPA DATABASE FOR SAVING THE HERITAGE OF SYRIA

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprsarchives-xli-b5-953-2016 ◽

2016 ◽

Vol XLI-B5 ◽

pp. 953-960 ◽

Cited By ~ 1

Author(s):

Minna Silver ◽

Fulvio Rinaudo ◽

Emanuele Morezzi ◽

Francesca Quenda ◽

Maria Laura Moretti

Keyword(s):

Cultural Heritage ◽

Large Scale ◽

Digital Data ◽

Quality Data ◽

High Quality Data ◽

Open Access Database ◽

Directorate General ◽

International Projects ◽

Restoration Work ◽

Set Up

CIPA is contributing with its technical knowledge in saving the heritage of Syria by constructing an open access database based on the data that the CIPA members have collected during various projects in Syria over the years before the civil war in the country broke out in 2011. In this way we wish to support the protection and preservation of the environment, sites, monuments, and artefacts and the memory of the cultural region that has been crucial for the human past and the emergence of civilizations. Apart from the countless human atrocities and loss, damage, destruction and looting of the cultural heritage have taken place in a large scale. The CIPA’s initiative is one of the various international projects that have been set up after the conflict started. The Directorate-General of the Antiquities and Museums (DGAM) of Syria as well as UNESCO with its various sub-organizations have been central in facing the challenges during the war. Digital data capture, storage, use and dissemination are in the heart of CIPA’s strategies in recording and documenting cultural heritage, also in Syria. It goes without saying that for the conservation and restoration work the high quality data providing metric information is of utmost importance.

Download Full-text

SeisBench: A toolbox for benchmarking and applying machine learning in seismology.

10.5194/egusphere-egu21-12218 ◽

2021 ◽

Author(s):

Jack Woollam ◽

Jannes Münchmeyer ◽

Carlo Giunchi ◽

Dario Jozinovic ◽

Tobias Diehl ◽

...

Keyword(s):

Machine Learning ◽

Model Comparison ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Quality Data ◽

Data Sets ◽

Waveform Data ◽

Detection Techniques ◽

Benchmark Data

<p>Machine learning methods have seen widespread adoption within the seismological community in recent years due to their ability to effectively process large amounts of data, while equalling or surpassing the performance of human analysts or classic algorithms. In the wider machine learning world, for example in imaging applications, the open availability of extensive high-quality datasets for training, validation, and the benchmarking of competing algorithms is seen as a vital ingredient to the rapid progress observed throughout the last decade. Within seismology, vast catalogues of labelled data are readily available, but collecting the waveform data for millions of records and assessing the quality of training examples is a time-consuming, tedious process. The natural variability in source processes and seismic wave propagation also presents a critical problem during training. The performance of models trained on different regions, distance and magnitude ranges are not easily comparable. The inability to easily compare and contrast state-of-the-art machine learning-based detection techniques on varying seismic data sets is currently a barrier to further progress within this emerging field. We present SeisBench, an extensible open-source framework for training, benchmarking, and applying machine learning algorithms. SeisBench provides access to various benchmark data sets and models from literature, along with pre-trained model weights, through a unified API. Built to be extensible, and modular, SeisBench allows for the simple addition of new models and data sets, which can be easily interchanged with existing pre-trained models and benchmark data. Standardising the access of varying quality data, and metadata simplifies comparison workflows, enabling the development of more robust machine learning algorithms. We initially focus on phase detection, identification and picking, but the framework is designed to be extended for other purposes, for example direct estimation of event parameters. Users will be able to contribute their own benchmarks and (trained) models. In the future, it will thus be much easier to compare both the performance of new algorithms against published machine learning models/architectures and to check the performance of established algorithms against new data sets. We hope that the ease of validation and inter-model comparison enabled by SeisBench will serve as a catalyst for the development of the next generation of machine learning techniques within the seismological community. The SeisBench source code will be published with an open license and explicitly encourages community involvement.</p>

Download Full-text

Large-Scale Screening of Antifungal Peptides Based on Quantitative Structure–Activity Relationship

ACS Medicinal Chemistry Letters ◽

10.1021/acsmedchemlett.1c00556 ◽

2021 ◽

Author(s):

Jin Zhang ◽

Longbing Yang ◽

Zhuqing Tian ◽

Wenjing Zhao ◽

Chaoqin Sun ◽

...

Keyword(s):

Large Scale ◽

Quantitative Structure Activity Relationship ◽

Structure Activity Relationship ◽

Activity Relationship ◽

Quantitative Structure ◽

Structure Activity ◽

Antifungal Peptides ◽

Large Scale Screening

Download Full-text

Essentiality of Machine Learning Algorithms for Big Data Computation

Advances in Data Mining and Database Management - Managing and Processing Big Data in Cloud Computing ◽

10.4018/978-1-4666-9767-6.ch011 ◽

2016 ◽

pp. 156-167

Author(s):

Manjunath Thimmasandra Narayanapppa ◽

T. P. Puneeth Kumar ◽

Ravindra S. Hegadi

Keyword(s):

Machine Learning ◽

Big Data ◽

Large Scale ◽

Learning Algorithms ◽

Big Data Analytics ◽

Machine Learning Algorithms ◽

Real Time Analysis ◽

Large Scale Data ◽

Computational Environment ◽

Large Scale Data Processing

Recent technological advancements have led to generation of huge volume of data from distinctive domains (scientific sensors, health care, user-generated data, finical companies and internet and supply chain systems) over the past decade. To capture the meaning of this emerging trend the term big data was coined. In addition to its huge volume, big data also exhibits several unique characteristics as compared with traditional data. For instance, big data is generally unstructured and require more real-time analysis. This development calls for new system platforms for data acquisition, storage, transmission and large-scale data processing mechanisms. In recent years analytics industries interest expanding towards the big data analytics to uncover potentials concealed in big data, such as hidden patterns or unknown correlations. The main goal of this chapter is to explore the importance of machine learning algorithms and computational environment including hardware and software that is required to perform analytics on big data.

Download Full-text