scholarly journals Analytical challenges of untargeted GC-MS-based metabolomics and the critical issues in selecting the data processing strategy

F1000Research ◽  
2017 ◽  
Vol 6 ◽  
pp. 967 ◽  
Author(s):  
Ting-Li Han ◽  
Yang Yang ◽  
Hua Zhang ◽  
Kai P. Law

Background: A challenge of metabolomics is data processing the enormous amount of information generated by sophisticated analytical techniques. The raw data of an untargeted metabolomic experiment are composited with unwanted biological and technical variations that confound the biological variations of interest. The art of data normalisation to offset these variations and/or eliminate experimental or biological biases has made significant progress recently. However, published comparative studies are often biased or have omissions. Methods: We investigated the issues with our own data set, using five different representative methods of internal standard-based, model-based, and pooled quality control-based approaches, and examined the performance of these methods against each other in an epidemiological study of gestational diabetes using plasma. Results: Our results demonstrated that the quality control-based approaches gave the highest data precision in all methods tested, and would be the method of choice for controlled experimental conditions. But for our epidemiological study, the model-based approaches were able to classify the clinical groups more effectively than the quality control-based approaches because of their ability to minimise not only technical variations, but also biological biases from the raw data. Conclusions: We suggest that metabolomic researchers should optimise and justify the method they have chosen for their experimental condition in order to obtain an optimal biological outcome.

Author(s):  
Katherine Anderson Aur ◽  
Jessica Bobeck ◽  
Anthony Alberti ◽  
Phillip Kay

Abstract Supplementing an existing high-quality seismic monitoring network with openly available station data could improve coverage and decrease magnitudes of completeness; however, this can present challenges when varying levels of data quality exist. Without discerning the quality of openly available data, using it poses significant data management, analysis, and interpretation issues. Incorporating additional stations without properly identifying and mitigating data quality problems can degrade overall monitoring capability. If openly available stations are to be used routinely, a robust, automated data quality assessment for a wide range of quality control (QC) issues is essential. To meet this need, we developed Pycheron, a Python-based library for QC of seismic waveform data. Pycheron was initially based on the Incorporated Research Institutions for Seismology’s Modular Utility for STAtistical kNowledge Gathering but has been expanded to include more functionality. Pycheron can be implemented at the beginning of a data processing pipeline or can process stand-alone data sets. Its objectives are to (1) identify specific QC issues; (2) automatically assess data quality and instrumentation health; (3) serve as a basic service that all data processing builds on by alerting downstream processing algorithms to any quality degradation; and (4) improve our ability to process orders of magnitudes more data through performance optimizations. This article provides an overview of Pycheron, its features, basic workflow, and an example application using a synthetic QC data set.


2010 ◽  
Vol 16 (1) ◽  
pp. 1-14 ◽  
Author(s):  
Tong Ying Shun ◽  
John S. Lazo ◽  
Elizabeth R. Sharlow ◽  
Paul A. Johnston

High-throughput screening (HTS) has achieved a dominant role in drug discovery over the past 2 decades. The goal of HTS is to identify active compounds (hits) by screening large numbers of diverse chemical compounds against selected targets and/or cellular phenotypes. The HTS process consists of multiple automated steps involving compound handling, liquid transfers, and assay signal capture, all of which unavoidably contribute to systematic variation in the screening data. The challenge is to distinguish biologically active compounds from assay variability. Traditional plate controls-based and non-controls-based statistical methods have been widely used for HTS data processing and active identification by both the pharmaceutical industry and academic sectors. More recently, improved robust statistical methods have been introduced, reducing the impact of systematic row/column effects in HTS data. To apply such robust methods effectively and properly, we need to understand their necessity and functionality. Data from 6 HTS case histories are presented to illustrate that robust statistical methods may sometimes be misleading and can result in more, rather than less, false positives or false negatives. In practice, no single method is the best hit detection method for every HTS data set. However, to aid the selection of the most appropriate HTS data-processing and active identification methods, the authors developed a 3-step statistical decision methodology. Step 1 is to determine the most appropriate HTS data-processing method and establish criteria for quality control review and active identification from 3-day assay signal window and DMSO validation tests. Step 2 is to perform a multilevel statistical and graphical review of the screening data to exclude data that fall outside the quality control criteria. Step 3 is to apply the established active criterion to the quality-assured data to identify the active compounds.


2018 ◽  
Author(s):  
Jelena Telenius ◽  
Jim R. Hughes ◽  

ABSTRACTWith decreasing cost of next-generation sequencing (NGS), we are observing a rapid rise in the volume of ‘big data’ in academic research, healthcare and drug discovery sectors. The present bottleneck for extracting value from these ‘big data’ sets is data processing and analysis. Considering this, there is still a lack of reliable, automated and easy to use tools that will allow experimentalists to assess the quality of the sequenced libraries and explore the data first hand, without the need of investing a lot of time of computational core analysts in the early stages of analysis.NGseqBasic is an easy-to-use single-command analysis tool for chromatin accessibility (ATAC, DNaseI) and ChIP sequencing data, providing support to also new techniques such as low cell number sequencing and Cut-and-Run. It takes in fastq, fastq.gz or bam files, conducts all quality control, trimming and mapping steps, along with quality control and data processing statistics, and combines all this to a single-click loadable UCSC data hub, with integral statistics html page providing detailed reports from the analysis tools and quality control metrics. The tool is easy to set up, and no installation is needed. A wide variety of parameters are provided to fine-tune the analysis, with optional setting to generate DNase footprint or high resolution ChIP-seq tracks. A tester script is provided to help in the setup, along with a test data set and downloadable example user cases.NGseqBasic has been used in the routine analysis of next generation sequencing (NGS) data in high-impact publications 1,2. The code is actively developed, and accompanied with Git version control and Github code repository. Here we demonstrate NGseqBasic analysis and features using DNaseI-seq data from GSM689849, and CTCF-ChIP-seq data from GSM2579421, as well as a Cut-and-Run CTCF data set GSM2433142, and provide the one-click loadable UCSC data hubs generated by the tool, allowing for the ready exploration of the run results and quality control files generated by the tool.AvailabilityDownload, setup and help instructions are available on the NGseqBasic web site http://userweb.molbiol.ox.ac.uk/public/telenius/NGseqBasicManual/external/Bioconda users can load the tool as library “ngseqbasic”. The source code with Git version control is available in https://github.com/Hughes-Genome-Group/NGseqBasic/[email protected]


2014 ◽  
Author(s):  
Marc Bernau ◽  
Heidrun Weber ◽  
Michael Schmitt ◽  
Sven Bartha ◽  
Roland Gessner

In the research field of atmospheric chemistry a central question for acquired data sets is about validation. Have the data been validated to be useful for science? Has the data set been compared to other data sets? If deviations occur, which cause could be identified? Ultimately, two causes are possible when the same scene is observed: either the acquired raw data set is erroneous (hardware problem) or the data processing infers erroneous information (software problem). In order to make sure that the software works as expected, software validation plays a key role in the overall data set validation campaigns. This paper deals with operational software validation, which is an important component of the entire scientific validation chain. [...]


2020 ◽  
Vol 17 (5) ◽  
pp. 382-388
Author(s):  
Aparna Wadhwa ◽  
Faraat Ali ◽  
Sana Parveen ◽  
Robin Kumar ◽  
Gyanendra N. Singh

Objective: The main aim of the present work is to synthesize chloramphenicol impurity A (CLRMIMP- A) in the purest form and its subsequent characterization by using a panel of sophisticated analytical techniques (LC-MS, DSC, TGA, NMR, FTIR, HPLC, and CHNS) to provide as a reference standard mentioned in most of the international compendiums, including IP, BP, USP, and EP. The present synthetic procedure has not been disclosed anywhere in the prior art. Methods: A simple, cheaper, and new synthesis method was described for the preparation of CLRM-IMP-A. It was synthesized and characterized by FTIR, DSC, TGA, NMR (1H and 13C), LC-MS, CHNS, and HPLC. Results: CLRM-IMP-A present in drugs and dosage form can alter the therapeutic effects and adverse reaction of a drug considerably, it is mandatory to have a precise method for the estimation of impurities to safeguard the public health. Under these circumstances, the presence of CLRM-IMP-A in chloramphenicol (CLRM) requires strict quality control to satisfy the specified regulatory limit. The synthetic impurity obtained was in the pure form to provide a certified reference standard or working standard to stakeholders with defined potency. Conclusion: The present research describes a novel technique for the synthesis of pharmacopoeial impurity, which can help in checking/controlling the quality of the CLRM in the international markets.


1982 ◽  
Vol 47 (7) ◽  
pp. 1973-1978 ◽  
Author(s):  
Jiří Karhan ◽  
Zbyněk Ksandr ◽  
Jiřina Vlková ◽  
Věra Špatná

The determination of alcohols by 19F NMR spectroscopy making use of their reaction with hexafluoroacetone giving rise to hemiacetals was studied on butanols. The calibration curve method and the internal standard method were used and the results were mutually compared. The effects of some experimental conditions, viz. the sample preparation procedure, concentration, spectrometer setting, and electronic integration, were investigated; the conditions, particularly the concentrations, proved to have a statistically significant effect on the results of determination. For the internal standard method, the standard deviation was 0.061 in the concentration region 0.032-0.74 mol l-1. The method was applied to a determination of alcohols in the distillation residue from an oxo synthesis.


2021 ◽  
Author(s):  
Junjie Shi ◽  
Jiang Bian ◽  
Jakob Richter ◽  
Kuan-Hsun Chen ◽  
Jörg Rahnenführer ◽  
...  

AbstractThe predictive performance of a machine learning model highly depends on the corresponding hyper-parameter setting. Hence, hyper-parameter tuning is often indispensable. Normally such tuning requires the dedicated machine learning model to be trained and evaluated on centralized data to obtain a performance estimate. However, in a distributed machine learning scenario, it is not always possible to collect all the data from all nodes due to privacy concerns or storage limitations. Moreover, if data has to be transferred through low bandwidth connections it reduces the time available for tuning. Model-Based Optimization (MBO) is one state-of-the-art method for tuning hyper-parameters but the application on distributed machine learning models or federated learning lacks research. This work proposes a framework $$\textit{MODES}$$ MODES that allows to deploy MBO on resource-constrained distributed embedded systems. Each node trains an individual model based on its local data. The goal is to optimize the combined prediction accuracy. The presented framework offers two optimization modes: (1) $$\textit{MODES}$$ MODES -B considers the whole ensemble as a single black box and optimizes the hyper-parameters of each individual model jointly, and (2) $$\textit{MODES}$$ MODES -I considers all models as clones of the same black box which allows it to efficiently parallelize the optimization in a distributed setting. We evaluate $$\textit{MODES}$$ MODES by conducting experiments on the optimization for the hyper-parameters of a random forest and a multi-layer perceptron. The experimental results demonstrate that, with an improvement in terms of mean accuracy ($$\textit{MODES}$$ MODES -B), run-time efficiency ($$\textit{MODES}$$ MODES -I), and statistical stability for both modes, $$\textit{MODES}$$ MODES outperforms the baseline, i.e., carry out tuning with MBO on each node individually with its local sub-data set.


2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Sven Lißner ◽  
Stefan Huber

Abstract Background GPS-based cycling data are increasingly available for traffic planning these days. However, the recorded data often contain more information than simply bicycle trips. GPS tracks resulting from tracking while using other modes of transport than bike or long periods at working locations while people are still tracking are only some examples. Thus, collected bicycle GPS data need to be processed adequately to use them for transportation planning. Results The article presents a multi-level approach towards bicycle-specific data processing. The data processing model contains different steps of processing (data filtering, smoothing, trip segmentation, transport mode recognition, driving mode detection) to finally obtain a correct data set that contains bicycle trips, only. The validation reveals a sound accuracy of the model at its’ current state (82–88%).


2015 ◽  
Vol 14 (12) ◽  
pp. 5088-5098 ◽  
Author(s):  
Bas C. Jansen ◽  
Karli R. Reiding ◽  
Albert Bondt ◽  
Agnes L. Hipgrave Ederveen ◽  
Magnus Palmblad ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document