scholarly journals Correcting for experiment-specific variability in expression compendia can remove underlying signals

GigaScience ◽  
2020 ◽  
Vol 9 (11) ◽  
Author(s):  
Alexandra J Lee ◽  
YoSon Park ◽  
Georgia Doing ◽  
Deborah A Hogan ◽  
Casey S Greene

Abstract Motivation In the past two decades, scientists in different laboratories have assayed gene expression from millions of samples. These experiments can be combined into compendia and analyzed collectively to extract novel biological patterns. Technical variability, or "batch effects," may result from combining samples collected and processed at different times and in different settings. Such variability may distort our ability to extract true underlying biological patterns. As more integrative analysis methods arise and data collections get bigger, we must determine how technical variability affects our ability to detect desired patterns when many experiments are combined. Objective We sought to determine the extent to which an underlying signal was masked by technical variability by simulating compendia comprising data aggregated across multiple experiments. Method We developed a generative multi-layer neural network to simulate compendia of gene expression experiments from large-scale microbial and human datasets. We compared simulated compendia before and after introducing varying numbers of sources of undesired variability. Results The signal from a baseline compendium was obscured when the number of added sources of variability was small. Applying statistical correction methods rescued the underlying signal in these cases. However, as the number of sources of variability increased, it became easier to detect the original signal even without correction. In fact, statistical correction reduced our power to detect the underlying signal. Conclusion When combining a modest number of experiments, it is best to correct for experiment-specific noise. However, when many experiments are combined, statistical correction reduces our ability to extract underlying patterns.

2020 ◽  
Author(s):  
Alexandra J. Lee ◽  
YoSon Park ◽  
Georgia Doing ◽  
Deborah A. Hogan ◽  
Casey S. Greene

AbstractMotivationIn the last two decades, scientists working in different labs have assayed gene expression from millions of samples. These experiments can be combined into compendia and analyzed collectively to extract novel biological patterns. Technical variability, sometimes referred to as batch effects, may result from combining samples collected and processed at different times and in different settings. Such variability may distort our ability to interpret and extract true underlying biological patterns. As more multi-experiment, integrative analysis methods are developed and available data collections increase in size, it is crucial to determine how technical variability affect our ability to detect desired patterns when many experiments are combinedObjectiveWe sought to determine the extent to which an underlying signal was masked by technical variability by simulating compendia comprised of data aggregated across multiple experiments.MethodWe developed a generative multi-layer neural network to simulate compendia of gene expression experiments from large-scale microbial and human datasets. We compared simulated compendia before and after introducing varying numbers of sources of undesired variability.ResultsWe found that the signal from a baseline compendium was obscured when the number of added sources of variability was small. Perhaps as expected, applying statistical correction methods rescued the underlying signal in these cases. As the number of sources of variability increased, surprisingly, we observed that detecting the original signal became increasingly easier even without correction. In fact, applying statistical correction methods reduced our power to detect the underlying signal.ConclusionWhen combining a modest number of experiments, it is best to correct for experiment-specific noise. However, when many experiments are combined, statistical correction reduces one’s ability to extract underlying patterns.


2020 ◽  
Author(s):  
Lungwani Muungo

The purpose of this review is to evaluate progress inmolecular epidemiology over the past 24 years in canceretiology and prevention to draw lessons for futureresearch incorporating the new generation of biomarkers.Molecular epidemiology was introduced inthe study of cancer in the early 1980s, with theexpectation that it would help overcome some majorlimitations of epidemiology and facilitate cancerprevention. The expectation was that biomarkerswould improve exposure assessment, document earlychanges preceding disease, and identify subgroupsin the population with greater susceptibility to cancer,thereby increasing the ability of epidemiologic studiesto identify causes and elucidate mechanisms incarcinogenesis. The first generation of biomarkers hasindeed contributed to our understanding of riskandsusceptibility related largely to genotoxic carcinogens.Consequently, interventions and policy changes havebeen mounted to reduce riskfrom several importantenvironmental carcinogens. Several new and promisingbiomarkers are now becoming available for epidemiologicstudies, thanks to the development of highthroughputtechnologies and theoretical advances inbiology. These include toxicogenomics, alterations ingene methylation and gene expression, proteomics, andmetabonomics, which allow large-scale studies, includingdiscovery-oriented as well as hypothesis-testinginvestigations. However, most of these newer biomarkershave not been adequately validated, and theirrole in the causal paradigm is not clear. There is a needfor their systematic validation using principles andcriteria established over the past several decades inmolecular cancer epidemiology.


2018 ◽  
Author(s):  
Koen Van Den Berge ◽  
Katharina Hembach ◽  
Charlotte Soneson ◽  
Simone Tiberi ◽  
Lieven Clement ◽  
...  

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.


Author(s):  
Koen Van Den Berge ◽  
Katharina Hembach ◽  
Charlotte Soneson ◽  
Simone Tiberi ◽  
Lieven Clement ◽  
...  

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.


2019 ◽  
Vol 2 (1) ◽  
pp. 139-173 ◽  
Author(s):  
Koen Van den Berge ◽  
Katharina M. Hembach ◽  
Charlotte Soneson ◽  
Simone Tiberi ◽  
Lieven Clement ◽  
...  

Gene expression is the fundamental level at which the results of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq data sets, as well as the performance of the myriad of methods developed. In this review, we give an overview of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on the quantification of gene expression and statistical approachesfor differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.


2016 ◽  
Author(s):  
Robert B. Bentham ◽  
Kevin Bryson ◽  
Gyorgy Szabadkai

ABSTRACTThe potential to understand fundamental biological processes from gene expression data has grown parallel with the recent explosion of the size of data collections. However, to exploit this potential, novel analytical methods are required, capable of handling massive data matrices. We found current methods limited in the size of correlated gene sets they could discover within biologically heterogeneous data collections, hampering the identification of multi-gene controlled fundamental cellular processes such as energy metabolism, organelle biogenesis and stress responses. Here we describe a novel biclustering algorithm called Massively Correlated Biclustering (MCbiclust) that selects samples and genes from large datasets with maximal correlated gene expression, allowing regulation of complex pathway to be examined. The method has been evaluated using synthetic data and applied to large bacterial and cancer cell datasets. We show that the large biclusters discovered, so far elusive to identification by existing techniques, are biologically relevant and thus MCbiclust has great potential use in the analysis of transcriptomics data to identify large scale unknown effects hidden within the data. The identified massive biclusters can be used to develop improved transcriptomics based diagnosis tools for diseases caused by altered gene expression, or used for further network analysis to understand genotype-phenotype correlations.


Author(s):  
Koen Van Den Berge ◽  
Katharina Hembach ◽  
Charlotte Soneson ◽  
Simone Tiberi ◽  
Lieven Clement ◽  
...  

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.


2003 ◽  
Vol 57 (3) ◽  
pp. 617-642 ◽  
Author(s):  
Gary King ◽  
Will Lowe

Despite widespread recognition that aggregated summary statistics on international conflict and cooperation miss most of the complex interactions among nations, the vast majority of scholars continue to employ annual, quarterly, or (occasionally) monthly observations. Daily events data, coded from some of the huge volume of news stories produced by journalists, have not been used much for the past two decades. We offer some reason to change this practice, which we feel should lead to considerably increased use of these data. We address advances in event categorization schemes and software programs that automatically produce data by “reading” news stories without human coders. We design a method that makes it feasible, for the first time, to evaluate these programs when they are applied in areas with the particular characteristics of international conflict and cooperation data, namely event categories with highly unequal prevalences, and where rare events (such as highly conflictual actions) are of special interest. We use this rare events design to evaluate one existing program, and find it to be as good as trained human coders, but obviously far less expensive to use. For large-scale data collections, the program dominates human coding. Our new evaluative method should be of use in international relations, as well as more generally in the field of computational linguistics, for evaluating other automated information extraction tools. We believe that the data created by programs similar to the one we evaluated should see dramatically increased use in international relations research. To facilitate this process, we are releasing with this article data on 3.7 million international events, covering the entire world for the past decade.


2017 ◽  
Author(s):  
Wen He ◽  
Shanrong Zhao ◽  
Chi Zhang ◽  
Michael S. Vincent ◽  
Baohong Zhang

i.Summary/AbstractSequencing of transcribed RNA molecules (RNA-seq) has been used wildly for studying cell transcriptomes in bulk or at the single-cell level (1, 2, 3) and is becoming the de facto technology for investigating gene expression level changes in various biological conditions, on the time course, and under drug treatments. Furthermore, RNA-Seq data helped identify fusion genes that are related to certain cancers (4). Differential gene expression before and after drug treatments provides insights to mechanism of action, pharmacodynamics of the drugs, and safety concerns (5). Because each RNA-seq run generates tens to hundreds of millions of short reads with size ranging from 50bp-200bp, a tool that deciphers these short reads to an integrated and digestible analysis report is in high demand. QuickRNASeq (6) is an application for large-scale RNA-seq data analysis and real-time interactive visualization of complex data sets. This application automates the use of several of the best open-source tools to efficiently generate user friendly, easy to share, and ready to publish report. Figure 1 illustrates some of the interactive plots produced by QuickRNASeq. The visualization features of the application have been further improved since its first publication in early 2016. The original QuickRNASeq publication (6) provided details of background, software selection, and implementation. Here, we outline the steps required to implement QuickRNASeq in user’s own environment, as well as demonstrate some basic yet powerful utilities of the advanced interactive visualization modules in the report.


2019 ◽  
Vol 35 (18) ◽  
pp. 3357-3364 ◽  
Author(s):  
Holger Weishaupt ◽  
Patrik Johansson ◽  
Anders Sundström ◽  
Zelmina Lubovac-Pilav ◽  
Björn Olsson ◽  
...  

Abstract Motivation Medulloblastoma (MB) is a brain cancer predominantly arising in children. Roughly 70% of patients are cured today, but survivors often suffer from severe sequelae. MB has been extensively studied by molecular profiling, but often in small and scattered cohorts. To improve cure rates and reduce treatment side effects, accurate integration of such data to increase analytical power will be important, if not essential. Results We have integrated 23 transcription datasets, spanning 1350 MB and 291 normal brain samples. To remove batch effects, we combined the Removal of Unwanted Variation (RUV) method with a novel pipeline for determining empirical negative control genes and a panel of metrics to evaluate normalization performance. The documented approach enabled the removal of a majority of batch effects, producing a large-scale, integrative dataset of MB and cerebellar expression data. The proposed strategy will be broadly applicable for accurate integration of data and incorporation of normal reference samples for studies of various diseases. We hope that the integrated dataset will improve current research in the field of MB by allowing more large-scale gene expression analyses. Availability and implementation The RUV-normalized expression data is available through the Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) and can be accessed via the GSE series number GSE124814. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document