scholarly journals Unassisted Noise-Reduction of Chemical Reactions Data Sets

Author(s):  
Alessandra Toniato ◽  
Philippe Schwaller ◽  
Antonio Cardinale ◽  
Joppe Geluykens ◽  
Teodoro Laino

<p>Existing deep learning models applied to reaction prediction in organic chemistry can reach high levels of accuracy (> 90% for Natural Language Processing-based ones).</p><p>With no chemical knowledge embedded than the information learnt from reaction data, the quality of the data sets plays a crucial role in the performance of the prediction models. While human curation is prohibitively expensive, the need for unaided approaches to remove chemically incorrect entries from existing data sets is essential to improve artificial intelligence models' performance in synthetic chemistry tasks. Here we propose a machine learning-based, unassisted approach to remove chemically wrong entries from chemical reaction collections. We applied this method to the collection of chemical reactions Pistachio and to an open data set, both extracted from USPTO (United States Patent Office) patents. Our results show an improved prediction quality for models trained on the cleaned and balanced data sets. For the retrosynthetic models, the round-trip accuracy metric grows by 13 percentage points and the value of</p><p>the cumulative Jensen Shannon divergence decreases by 30% compared to its original record. The coverage remains high with 97%, and the value of the class-diversity is not affected by the cleaning. The proposed strategy is the first unassisted rule-free technique to address automatic noise reduction in chemical data sets.</p>

2020 ◽  
Vol 1 (1) ◽  
pp. 396-413 ◽  
Author(s):  
Kuansan Wang ◽  
Zhihong Shen ◽  
Chiyuan Huang ◽  
Chieh-Han Wu ◽  
Yuxiao Dong ◽  
...  

An ongoing project explores the extent to which artificial intelligence (AI), specifically in the areas of natural language processing and semantic reasoning, can be exploited to facilitate the studies of science by deploying software agents equipped with natural language understanding capabilities to read scholarly publications on the web. The knowledge extracted by these AI agents is organized into a heterogeneous graph, called Microsoft Academic Graph (MAG), where the nodes and the edges represent the entities engaging in scholarly communications and the relationships among them, respectively. The frequently updated data set and a few software tools central to the underlying AI components are distributed under an open data license for research and commercial applications. This paper describes the design, schema, and technical and business motivations behind MAG and elaborates how MAG can be used in analytics, search, and recommendation scenarios. How AI plays an important role in avoiding various biases and human induced errors in other data sets and how the technologies can be further improved in the future are also discussed.


2020 ◽  
Author(s):  
Philippe Schwaller ◽  
Alain C. Vaucher ◽  
Teodoro Laino ◽  
Jean-Louis Reymond

Chemical reactions describe how precursor molecules react together and transform into products. The reaction yield describes the percentage of the precursors successfully transformed into products relative to the theoretical maximum. The prediction of reaction yields can help chemists navigate reaction space and accelerate the design of more effective routes. Here, we investigate the best-studied high-throughput experiment data set and show how data augmentation on chemical reactions can improve yield predictions' accuracy, even when only small data sets are available. Previous work used molecular fingerprints, physics-based or categorical descriptors of the precursors. In this manuscript, we fine-tune natural language processing-inspired reaction transformer models on different augmented data sets to predict yields solely using a text-based representation of chemical reactions. When the random training sets contain 2.5% or more of the data, our models outperform previous models, including those using physics-based descriptors as inputs. Moreover, we demonstrate the use of test-time augmentation to generate uncertainty estimates, which correlate with the prediction errors.


2019 ◽  
Vol 25 (1) ◽  
pp. 2-24 ◽  
Author(s):  
Hanna Lo ◽  
Alireza Ghasemi ◽  
Claver Diallo ◽  
John Newhook

Purpose Condition-based maintenance (CBM) has become a central maintenance approach because it performs more efficient diagnoses and prognoses based on equipment health condition compared to time-based methods. CBM models greatly inform maintenance decisions. This research examines three CBM fault prognostics models: logical analysis of data (LAD), artificial neural networks (ANNs) and proportional hazard models (PHM). A methodology, which involves data pre-processing, formulating the models and analyzing model outputs, is developed to apply and compare these models. The methodology is applied on NASA’s Turbofan Engine Degradation data set and the structural health monitoring (SHM) data set from a Nova Scotia Bridge. Results are evaluated using three metrics: error, half-life error and a cost score. This paper concludes that the LAD and feedforward ANN models compares favorably to the PHM model. However, the feedback ANN does not compare favorably, and its predictions show much larger variance than the predictions from the other three methods. Based on these conclusions, the purpose of this paper is to provide recommendations on the appropriate situations in which to apply these three prognostics models. Design/methodology/approach LAD, ANNs and PHM methods are adopted to perform prognostics and to calculate the mean residual life (MRL) of eqipment using NASA’s Turbofan Engine Degradation data set and the SHM data set from a Nova Scotia Bridge. Statistical testing was used to evaluate the statistical differences between the approaches based on these metrics. By considering the differences in these metrics between the models, it was possible to draw conclusions about how the models perform in specific cases. Findings Results were evaluated using three metrics: error, half-life error and a cost score. It was concluded that the LAD and feedforward ANN models compares favorably to the PHM model. However, the feedback ANN does not compare favorably and its predictions show much larger variance than the predictions from the other three methods. Overall the models predict failure after it has already occurred (negative error) when the residual life is large and vice versa. Practical implications It was concluded that a good CBM prognostics model for practical implications can be determined based on three main considerations: accuracy, run time and data type. When accuracy is a main concern, as in the case where impacts of failure are large, LAD and feedforward neural network are preferred. The preference changes when run time is considered. If data can be easily collected and updating the model is performed often, the ANNs and LAD are preferred. On the other hand, if CM data are not easily obtainable and existing data are not representative of the population’s behavior, data type comes into play. In this case, PHM is preferred. Originality/value Previous research in the literature performed reviews of multiple independent studies on CBM techniques performed on different data sets. They concluded that it is typically harder to implement artificial intelligence models, because of difficulties in data procurement, but these approaches offer improved performance as compared to more traditional model-based and statistical approaches. In this research, the authors further investigate and compare the performance and results from two major artificial intelligence models, namely, ANNs and LAD, and one pioneer statistical model, PHM over the same two real life prognostics data sets. Such in-depth comparison and review of major CBM techniques was missing in current literature of CBM field.


2020 ◽  
Author(s):  
Alessandra Toniato ◽  
Philippe Schwaller ◽  
Antonio Cardinale ◽  
Joppe Geluykens ◽  
Teodoro Laino

<div><div><div><p>Existing deep learning models applied to reaction prediction in organic chemistry are able to reach extremely high levels of accuracy (> 90% for NLP- based ones1). With no chemical knowledge embedded than the information learnt from reaction data, the quality of the data sets plays a crucial role in the performance of the prediction models. While human curation is prohibitively expensive, the need for unaided approaches to remove chemically incorrect entries from existing data sets is essential to improve the performance of artificial intelligence models in synthetic chemistry tasks. Here we propose a machine learning-based, unassisted approach to remove chemically wrong entries (noise) from chemical reaction collections. Results show that models trained on cleaned and balanced data sets improve the quality of the predictions without a decrease in performance. For the retrosynthetic models the round-trip accuracy is enhanced by 13% and the value of the cumulative Jensen Shannon metric is lowered down to 70% of its original value, while maintaining high values of coverage (97%) and constant class-diversity (1.6) at inference.</p></div></div></div>


2020 ◽  
Author(s):  
Alessandra Toniato ◽  
Philippe Schwaller ◽  
Antonio Cardinale ◽  
Joppe Geluykens ◽  
Teodoro Laino

<div><div><div><p>Existing deep learning models applied to reaction prediction in organic chemistry are able to reach extremely high levels of accuracy (> 90% for NLP- based ones1). With no chemical knowledge embedded than the information learnt from reaction data, the quality of the data sets plays a crucial role in the performance of the prediction models. While human curation is prohibitively expensive, the need for unaided approaches to remove chemically incorrect entries from existing data sets is essential to improve the performance of artificial intelligence models in synthetic chemistry tasks. Here we propose a machine learning-based, unassisted approach to remove chemically wrong entries (noise) from chemical reaction collections. Results show that models trained on cleaned and balanced data sets improve the quality of the predictions without a decrease in performance. For the retrosynthetic models the round-trip accuracy is enhanced by 13% and the value of the cumulative Jensen Shannon metric is lowered down to 70% of its original value, while maintaining high values of coverage (97%) and constant class-diversity (1.6) at inference.</p></div></div></div>


2020 ◽  
Author(s):  
Philippe Schwaller ◽  
Alain C. Vaucher ◽  
Teodoro Laino ◽  
Jean-Louis Reymond

Chemical reactions describe how precursor molecules react together and transform into products. The reaction yield describes the percentage of the precursors successfully transformed into products relative to the theoretical maximum. The prediction of reaction yields can help chemists navigate reaction space and accelerate the design of more effective routes. Here, we investigate the best-studied high-throughput experiment data set and show how data augmentation on chemical reactions can improve yield predictions' accuracy, even when only small data sets are available. Previous work used molecular fingerprints, physics-based or categorical descriptors of the precursors. In this manuscript, we fine-tune natural language processing-inspired reaction transformer models on different augmented data sets to predict yields solely using a text-based representation of chemical reactions. When the random training sets contain 2.5% or more of the data, our models outperform previous models, including those using physics-based descriptors as inputs. Moreover, we demonstrate the use of test-time augmentation to generate uncertainty estimates, which correlate with the prediction errors.


2015 ◽  
Vol 17 (5) ◽  
pp. 719-732
Author(s):  
Dulakshi Santhusitha Kumari Karunasingha ◽  
Shie-Yui Liong

A simple clustering method is proposed for extracting representative subsets from lengthy data sets. The main purpose of the extracted subset of data is to use it to build prediction models (of the form of approximating functional relationships) instead of using the entire large data set. Such smaller subsets of data are often required in exploratory analysis stages of studies that involve resource consuming investigations. A few recent studies have used a subtractive clustering method (SCM) for such data extraction, in the absence of clustering methods for function approximation. SCM, however, requires several parameters to be specified. This study proposes a clustering method, which requires only a single parameter to be specified, yet it is shown to be as effective as the SCM. A method to find suitable values for the parameter is also proposed. Due to having only a single parameter, using the proposed clustering method is shown to be orders of magnitudes more efficient than using SCM. The effectiveness of the proposed method is demonstrated on phase space prediction of three univariate time series and prediction of two multivariate data sets. Some drawbacks of SCM when applied for data extraction are identified, and the proposed method is shown to be a solution for them.


2017 ◽  
Vol 44 (2) ◽  
pp. 203-229 ◽  
Author(s):  
Javier D Fernández ◽  
Miguel A Martínez-Prieto ◽  
Pablo de la Fuente Redondo ◽  
Claudio Gutiérrez

The publication of semantic web data, commonly represented in Resource Description Framework (RDF), has experienced outstanding growth over the last few years. Data from all fields of knowledge are shared publicly and interconnected in active initiatives such as Linked Open Data. However, despite the increasing availability of applications managing large-scale RDF information such as RDF stores and reasoning tools, little attention has been given to the structural features emerging in real-world RDF data. Our work addresses this issue by proposing specific metrics to characterise RDF data. We specifically focus on revealing the redundancy of each data set, as well as common structural patterns. We evaluate the proposed metrics on several data sets, which cover a wide range of designs and models. Our findings provide a basis for more efficient RDF data structures, indexes and compressors.


Author(s):  
Liah Shonhe

The main focus of the study was to explore the practices of open data sharing in the agricultural sector, including establishing the research outputs concerning open data in agriculture. The study adopted a desktop research methodology based on literature review and bibliographic data from WoS database. Bibliometric indicators discussed include yearly productivity, most prolific authors, and enhanced countries. Study findings revealed that research activity in the field of agriculture and open access is very low. There were 36 OA articles and only 6 publications had an open data badge. Most researchers do not yet embrace the need to openly publish their data set despite the availability of numerous open data repositories. Unfortunately, most African countries are still lagging behind in management of agricultural open data. The study therefore recommends that researchers should publish their research data sets as OA. African countries need to put more efforts in establishing open data repositories and implementing the necessary policies to facilitate OA.


Sensors ◽  
2020 ◽  
Vol 20 (3) ◽  
pp. 879 ◽  
Author(s):  
Uwe Köckemann ◽  
Marjan Alirezaie ◽  
Jennifer Renoux ◽  
Nicolas Tsiftes ◽  
Mobyen Uddin Ahmed ◽  
...  

As research in smart homes and activity recognition is increasing, it is of ever increasing importance to have benchmarks systems and data upon which researchers can compare methods. While synthetic data can be useful for certain method developments, real data sets that are open and shared are equally as important. This paper presents the E-care@home system, its installation in a real home setting, and a series of data sets that were collected using the E-care@home system. Our first contribution, the E-care@home system, is a collection of software modules for data collection, labeling, and various reasoning tasks such as activity recognition, person counting, and configuration planning. It supports a heterogeneous set of sensors that can be extended easily and connects collected sensor data to higher-level Artificial Intelligence (AI) reasoning modules. Our second contribution is a series of open data sets which can be used to recognize activities of daily living. In addition to these data sets, we describe the technical infrastructure that we have developed to collect the data and the physical environment. Each data set is annotated with ground-truth information, making it relevant for researchers interested in benchmarking different algorithms for activity recognition.


Sign in / Sign up

Export Citation Format

Share Document