Correcting for experiment-specific variability in expression compendia can remove underlying signals

Alexandra J Lee; YoSon Park; Georgia Doing; Deborah A Hogan; Casey S Greene

doi:10.1093/gigascience/giaa117

Correcting for experiment-specific variability in expression compendia can remove underlying signals

10.1101/2020.05.03.066597 ◽

2020 ◽

Author(s):

Alexandra J. Lee ◽

YoSon Park ◽

Georgia Doing ◽

Deborah A. Hogan ◽

Casey S. Greene

Keyword(s):

Neural Network ◽

Gene Expression ◽

Large Scale ◽

Original Signal ◽

Batch Effects ◽

Technical Variability ◽

Statistical Correction ◽

Before And After ◽

Data Collections ◽

Biological Patterns

AbstractMotivationIn the last two decades, scientists working in different labs have assayed gene expression from millions of samples. These experiments can be combined into compendia and analyzed collectively to extract novel biological patterns. Technical variability, sometimes referred to as batch effects, may result from combining samples collected and processed at different times and in different settings. Such variability may distort our ability to interpret and extract true underlying biological patterns. As more multi-experiment, integrative analysis methods are developed and available data collections increase in size, it is crucial to determine how technical variability affect our ability to detect desired patterns when many experiments are combinedObjectiveWe sought to determine the extent to which an underlying signal was masked by technical variability by simulating compendia comprised of data aggregated across multiple experiments.MethodWe developed a generative multi-layer neural network to simulate compendia of gene expression experiments from large-scale microbial and human datasets. We compared simulated compendia before and after introducing varying numbers of sources of undesired variability.ResultsWe found that the signal from a baseline compendium was obscured when the number of added sources of variability was small. Perhaps as expected, applying statistical correction methods rescued the underlying signal in these cases. As the number of sources of variability increased, surprisingly, we observed that detecting the original signal became increasingly easier even without correction. In fact, applying statistical correction methods reduced our power to detect the underlying signal.ConclusionWhen combining a modest number of experiments, it is best to correct for experiment-specific noise. However, when many experiments are combined, statistical correction reduces one’s ability to extract underlying patterns.

Download Full-text

Molecular Epidemiology and Biomarkers in Etiologic Cancer Research: The New in Light of the Old

10.31234/osf.io/tcdfb ◽

2020 ◽

Author(s):

Lungwani Muungo

Keyword(s):

Gene Expression ◽

Cancer Research ◽

Molecular Epidemiology ◽

Exposure Assessment ◽

Large Scale ◽

First Generation ◽

Cancer Epidemiology ◽

Policy Changes ◽

The Past ◽

New Generation

The purpose of this review is to evaluate progress inmolecular epidemiology over the past 24 years in canceretiology and prevention to draw lessons for futureresearch incorporating the new generation of biomarkers.Molecular epidemiology was introduced inthe study of cancer in the early 1980s, with theexpectation that it would help overcome some majorlimitations of epidemiology and facilitate cancerprevention. The expectation was that biomarkerswould improve exposure assessment, document earlychanges preceding disease, and identify subgroupsin the population with greater susceptibility to cancer,thereby increasing the ability of epidemiologic studiesto identify causes and elucidate mechanisms incarcinogenesis. The first generation of biomarkers hasindeed contributed to our understanding of riskandsusceptibility related largely to genotoxic carcinogens.Consequently, interventions and policy changes havebeen mounted to reduce riskfrom several importantenvironmental carcinogens. Several new and promisingbiomarkers are now becoming available for epidemiologicstudies, thanks to the development of highthroughputtechnologies and theoretical advances inbiology. These include toxicogenomics, alterations ingene methylation and gene expression, proteomics, andmetabonomics, which allow large-scale studies, includingdiscovery-oriented as well as hypothesis-testinginvestigations. However, most of these newer biomarkershave not been adequately validated, and theirrole in the causal paradigm is not clear. There is a needfor their systematic validation using principles andcriteria established over the past several decades inmolecular cancer epidemiology.

Download Full-text

RNA sequencing data: hitchhiker's guide to expression analysis

10.7287/peerj.preprints.27283 ◽

2018 ◽

Author(s):

Koen Van Den Berge ◽

Katharina Hembach ◽

Charlotte Soneson ◽

Simone Tiberi ◽

Lieven Clement ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Large Scale ◽

Science Studies ◽

Rna Seq ◽

Sequencing Data ◽

Data Types ◽

The Past ◽

Long Read ◽

Statistical Approaches

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

Download Full-text

RNA sequencing data: hitchhiker's guide to expression analysis

10.7287/peerj.preprints.27283v1 ◽

2018 ◽

Cited By ~ 1

Author(s):

Koen Van Den Berge ◽

Katharina Hembach ◽

Charlotte Soneson ◽

Simone Tiberi ◽

Lieven Clement ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Large Scale ◽

Science Studies ◽

Rna Seq ◽

Sequencing Data ◽

Data Types ◽

The Past ◽

Long Read ◽

Statistical Approaches

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

Download Full-text

RNA Sequencing Data: Hitchhiker's Guide to Expression Analysis

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-072018-021255 ◽

2019 ◽

Vol 2 (1) ◽

pp. 139-173 ◽

Cited By ~ 23

Author(s):

Koen Van den Berge ◽

Katharina M. Hembach ◽

Charlotte Soneson ◽

Simone Tiberi ◽

Lieven Clement ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Large Scale ◽

Science Studies ◽

Data Sets ◽

Rna Seq ◽

Sequencing Data ◽

Data Types ◽

The Past ◽

Long Read

Gene expression is the fundamental level at which the results of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq data sets, as well as the performance of the myriad of methods developed. In this review, we give an overview of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on the quantification of gene expression and statistical approachesfor differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

Download Full-text

MCbiclust: a novel algorithm to discover large-scale functionally related gene sets from massive transcriptomics data collections

10.1101/075374 ◽

2016 ◽

Author(s):

Robert B. Bentham ◽

Kevin Bryson ◽

Gyorgy Szabadkai

Keyword(s):

Gene Expression ◽

Stress Responses ◽

Large Scale ◽

Synthetic Data ◽

Heterogeneous Data ◽

Organelle Biogenesis ◽

Biologically Relevant ◽

Gene Sets ◽

Transcriptomics Data ◽

Data Collections

ABSTRACTThe potential to understand fundamental biological processes from gene expression data has grown parallel with the recent explosion of the size of data collections. However, to exploit this potential, novel analytical methods are required, capable of handling massive data matrices. We found current methods limited in the size of correlated gene sets they could discover within biologically heterogeneous data collections, hampering the identification of multi-gene controlled fundamental cellular processes such as energy metabolism, organelle biogenesis and stress responses. Here we describe a novel biclustering algorithm called Massively Correlated Biclustering (MCbiclust) that selects samples and genes from large datasets with maximal correlated gene expression, allowing regulation of complex pathway to be examined. The method has been evaluated using synthetic data and applied to large bacterial and cancer cell datasets. We show that the large biclusters discovered, so far elusive to identification by existing techniques, are biologically relevant and thus MCbiclust has great potential use in the analysis of transcriptomics data to identify large scale unknown effects hidden within the data. The identified massive biclusters can be used to develop improved transcriptomics based diagnosis tools for diseases caused by altered gene expression, or used for further network analysis to understand genotype-phenotype correlations.

Download Full-text

RNA sequencing data: hitchhiker's guide to expression analysis

10.7287/peerj.preprints.27283v2 ◽

2018 ◽

Cited By ~ 1

Author(s):

Koen Van Den Berge ◽

Katharina Hembach ◽

Charlotte Soneson ◽

Simone Tiberi ◽

Lieven Clement ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Large Scale ◽

Science Studies ◽

Rna Seq ◽

Sequencing Data ◽

Data Types ◽

The Past ◽

Long Read ◽

Statistical Approaches

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

Download Full-text

An Automated Information Extraction Tool for International Conflict Data with Performance as Good as Human Coders: A Rare Events Evaluation Design

International Organization ◽

10.1017/s0020818303573064 ◽

2003 ◽

Vol 57 (3) ◽

pp. 617-642 ◽

Cited By ~ 183

Author(s):

Gary King ◽

Will Lowe

Keyword(s):

International Relations ◽

Information Extraction ◽

Computational Linguistics ◽

Large Scale ◽

International Conflict ◽

Rare Events ◽

News Stories ◽

Conflict And Cooperation ◽

The Past ◽

Data Collections

Despite widespread recognition that aggregated summary statistics on international conflict and cooperation miss most of the complex interactions among nations, the vast majority of scholars continue to employ annual, quarterly, or (occasionally) monthly observations. Daily events data, coded from some of the huge volume of news stories produced by journalists, have not been used much for the past two decades. We offer some reason to change this practice, which we feel should lead to considerably increased use of these data. We address advances in event categorization schemes and software programs that automatically produce data by “reading” news stories without human coders. We design a method that makes it feasible, for the first time, to evaluate these programs when they are applied in areas with the particular characteristics of international conflict and cooperation data, namely event categories with highly unequal prevalences, and where rare events (such as highly conflictual actions) are of special interest. We use this rare events design to evaluate one existing program, and find it to be as good as trained human coders, but obviously far less expensive to use. For large-scale data collections, the program dominates human coding. Our new evaluative method should be of use in international relations, as well as more generally in the field of computational linguistics, for evaluating other automated information extraction tools. We believe that the data created by programs similar to the one we evaluated should see dramatically increased use in international relations research. To facilitate this process, we are releasing with this article data on 3.7 million international events, covering the entire world for the past decade.

Download Full-text

QuickRNASeq: Guide for Pipeline Implementation and for Interactive Results Visualization

10.1101/125856 ◽

2017 ◽

Author(s):

Wen He ◽

Shanrong Zhao ◽

Chi Zhang ◽

Michael S. Vincent ◽

Baohong Zhang

Keyword(s):

Gene Expression ◽

Large Scale ◽

Time Course ◽

Interactive Visualization ◽

Complex Data ◽

Rna Seq ◽

Short Reads ◽

Rna Molecules ◽

Before And After ◽

Drug Treatments

i.Summary/AbstractSequencing of transcribed RNA molecules (RNA-seq) has been used wildly for studying cell transcriptomes in bulk or at the single-cell level (1, 2, 3) and is becoming the de facto technology for investigating gene expression level changes in various biological conditions, on the time course, and under drug treatments. Furthermore, RNA-Seq data helped identify fusion genes that are related to certain cancers (4). Differential gene expression before and after drug treatments provides insights to mechanism of action, pharmacodynamics of the drugs, and safety concerns (5). Because each RNA-seq run generates tens to hundreds of millions of short reads with size ranging from 50bp-200bp, a tool that deciphers these short reads to an integrated and digestible analysis report is in high demand. QuickRNASeq (6) is an application for large-scale RNA-seq data analysis and real-time interactive visualization of complex data sets. This application automates the use of several of the best open-source tools to efficiently generate user friendly, easy to share, and ready to publish report. Figure 1 illustrates some of the interactive plots produced by QuickRNASeq. The visualization features of the application have been further improved since its first publication in early 2016. The original QuickRNASeq publication (6) provided details of background, software selection, and implementation. Here, we outline the steps required to implement QuickRNASeq in user’s own environment, as well as demonstrate some basic yet powerful utilities of the advanced interactive visualization modules in the report.

Download Full-text

Batch-normalization of cerebellar and medulloblastoma gene expression datasets utilizing empirically defined negative control genes

Bioinformatics ◽

10.1093/bioinformatics/btz066 ◽

2019 ◽

Vol 35 (18) ◽

pp. 3357-3364 ◽

Cited By ~ 5

Author(s):

Holger Weishaupt ◽

Patrik Johansson ◽

Anders Sundström ◽

Zelmina Lubovac-Pilav ◽

Björn Olsson ◽

...

Keyword(s):

Gene Expression ◽

Large Scale ◽

Gene Expression Omnibus ◽

Negative Control ◽

Supplementary Information ◽

Normal Brain ◽

Expression Data ◽

Batch Effects ◽

Treatment Side Effects ◽

Cure Rates

Abstract Motivation Medulloblastoma (MB) is a brain cancer predominantly arising in children. Roughly 70% of patients are cured today, but survivors often suffer from severe sequelae. MB has been extensively studied by molecular profiling, but often in small and scattered cohorts. To improve cure rates and reduce treatment side effects, accurate integration of such data to increase analytical power will be important, if not essential. Results We have integrated 23 transcription datasets, spanning 1350 MB and 291 normal brain samples. To remove batch effects, we combined the Removal of Unwanted Variation (RUV) method with a novel pipeline for determining empirical negative control genes and a panel of metrics to evaluate normalization performance. The documented approach enabled the removal of a majority of batch effects, producing a large-scale, integrative dataset of MB and cerebellar expression data. The proposed strategy will be broadly applicable for accurate integration of data and incorporation of normal reference samples for studies of various diseases. We hope that the integrated dataset will improve current research in the field of MB by allowing more large-scale gene expression analyses. Availability and implementation The RUV-normalized expression data is available through the Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) and can be accessed via the GSE series number GSE124814. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text