scholarly journals Benchmarking atlas-level data integration in single-cell genomics

2021 ◽  
Author(s):  
Malte D. Luecken ◽  
M. Büttner ◽  
K. Chaichoompu ◽  
A. Danese ◽  
M. Interlandi ◽  
...  

AbstractSingle-cell atlases often include samples that span locations, laboratories and conditions, leading to complex, nested batch effects in data. Thus, joint analysis of atlas datasets requires reliable data integration. To guide integration method choice, we benchmarked 68 method and preprocessing combinations on 85 batches of gene expression, chromatin accessibility and simulation data from 23 publications, altogether representing >1.2 million cells distributed in 13 atlas-level integration tasks. We evaluated methods according to scalability, usability and their ability to remove batch effects while retaining biological variation using 14 evaluation metrics. We show that highly variable gene selection improves the performance of data integration methods, whereas scaling pushes methods to prioritize batch removal over conservation of biological variation. Overall, scANVI, Scanorama, scVI and scGen perform well, particularly on complex integration tasks, while single-cell ATAC-sequencing integration performance is strongly affected by choice of feature space. Our freely available Python module and benchmarking pipeline can identify optimal data integration methods for new data, benchmark new methods and improve method development.

Author(s):  
MD Luecken ◽  
M Büttner ◽  
K Chaichoompu ◽  
A Danese ◽  
M Interlandi ◽  
...  

AbstractCell atlases often include samples that span locations, labs, and conditions, leading to complex, nested batch effects in data. Thus, joint analysis of atlas datasets requires reliable data integration.Choosing a data integration method is a challenge due to the difficulty of defining integration success. Here, we benchmark 38 method and preprocessing combinations on 77 batches of gene expression, chromatin accessibility, and simulation data from 23 publications, altogether representing >1.2 million cells distributed in nine atlas-level integration tasks. Our integration tasks span several common sources of variation such as individuals, species, and experimental labs. We evaluate methods according to scalability, usability, and their ability to remove batch effects while retaining biological variation.Using 14 evaluation metrics, we find that highly variable gene selection improves the performance of data integration methods, whereas scaling pushes methods to prioritize batch removal over conservation of biological variation. Overall, BBKNN, Scanorama, and scVI perform well, particularly on complex integration tasks; Seurat v3 performs well on simpler tasks with distinct biological signals; and methods that prioritize batch removal perform best for ATAC-seq data integration. Our freely available reproducible python module can be used to identify optimal data integration methods for new data, benchmark new methods, and improve method development.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Shengquan Chen ◽  
Guanao Yan ◽  
Wenyu Zhang ◽  
Jinzhao Li ◽  
Rui Jiang ◽  
...  

AbstractThe recent advancements in single-cell technologies, including single-cell chromatin accessibility sequencing (scCAS), have enabled profiling the epigenetic landscapes for thousands of individual cells. However, the characteristics of scCAS data, including high dimensionality, high degree of sparsity and high technical variation, make the computational analysis challenging. Reference-guided approaches, which utilize the information in existing datasets, may facilitate the analysis of scCAS data. Here, we present RA3 (Reference-guided Approach for the Analysis of single-cell chromatin Accessibility data), which utilizes the information in massive existing bulk chromatin accessibility and annotated scCAS data. RA3 simultaneously models (1) the shared biological variation among scCAS data and the reference data, and (2) the unique biological variation in scCAS data that identifies distinct subpopulations. We show that RA3 achieves superior performance when used on several scCAS datasets, and on references constructed using various approaches. Altogether, these analyses demonstrate the wide applicability of RA3 in analyzing scCAS data.


2020 ◽  
Author(s):  
Bobby Ranjan ◽  
Wenjie Sun ◽  
Jinyu Park ◽  
Ronald Xie ◽  
Fatemeh Alipour ◽  
...  

Feature selection (marker gene selection) is widely believed to improve clustering accuracy, and is thus a key component of single cell clustering pipelines. However, we found that the performance of existing feature selection methods was inconsistent across benchmark datasets, and occasionally even worse than without feature selection. Moreover, existing methods ignored information contained in gene-gene correlations. We there-fore developed DUBStepR (Determining the Underlying Basis using Stepwise Regression), a feature selection algorithm that leverages gene-gene correlations with a novel measure of inhomogeneity in feature space, termed the Density Index (DI). Despite selecting a relatively small number of genes, DUB-StepR substantially outperformed existing single-cell feature selection methods across diverse clustering benchmarks. In a published scRNA-seq dataset from sorted monocytes, DUBStepR sensitively detected a rare and previously invisible population of contaminating basophils. DUBStepR is scalable to large datasets, and can be straightforwardly applied to other data types such as single-cell ATAC-seq. We propose DUBStepR as a general-purpose feature selection solution for accurately clustering single-cell data.


2020 ◽  
Author(s):  
Shengquan Chen ◽  
Guanao Yan ◽  
Wenyu Zhang ◽  
Jinzhao Li ◽  
Rui Jiang ◽  
...  

AbstractThe recent advancements in single-cell technologies, including single-cell chromatin accessibility sequencing (scCAS), have enabled profiling the epigenetic landscapes for thousands of individual cells. However, the characteristics of scCAS data, including high dimensionality, high degree of sparsity and high technical variation, make the computational analysis challenging. Reference-guided approach, which utilizes the information in existing datasets, may facilitate the analysis of scCAS data. We present RA3 (Reference-guided Approach for the Analysis of single-cell chromatin Acessibility data), which utilizes the information in massive existing bulk chromatin accessibility and annotated scCAS data. RA3 simultaneously models 1) the shared biological variation among scCAS data and the reference data, and 2) the unique biological variation in scCAS data that identifies distinct subpopulations. We show that RA3 achieves superior performance in many scCAS datasets. We also present several approaches to construct the reference data to demonstrate the wide applicability of RA3.


2021 ◽  
Author(s):  
Laura M. Richards ◽  
Mazdak Riverin ◽  
Suluxan Mohanraj ◽  
Shamini Ayyadhury ◽  
Danielle C. Croucher ◽  
...  

Tumours are routinely profiled with single-cell RNA sequencing (scRNA-seq) to characterize their diverse cellular ecosystems of malignant, immune, and stromal cell types. When combining data from multiple samples or studies, batch-specific technical variation can confound biological signals. However, scRNA-seq batch integration methods are often not designed for, or benchmarked, on datasets containing cancer cells. Here, we compare 5 data integration tools applied to 171,206 cells from 5 tumour scRNA-seq datasets. Based on our results, STACAS and fastMNN are the most suitable methods for integrating tumour datasets, demonstrating robust batch effect correction while preserving relevant biological variability in the malignant compartment. This comparison provides a framework for evaluating how well single-cell integration methods correct for technical variability while preserving biological heterogeneity of malignant and non-malignant cell populations.


2020 ◽  
Vol 35 (1) ◽  
pp. 2-13 ◽  
Author(s):  
Zhixiang Lin ◽  
Mahdi Zamanighomi ◽  
Timothy Daley ◽  
Shining Ma ◽  
Wing Hung Wong

2019 ◽  
Author(s):  
Yiliang Zhang ◽  
Kexuan Liang ◽  
Molei Liu ◽  
Yue Li ◽  
Hao Ge ◽  
...  

AbstractSingle-cell RNA sequencing technologies are widely used in recent years as a powerful tool allowing the observation of gene expression at the resolution of single cells. Two of the major challenges in scRNA-seq data analysis are dropout events and batch effects. The inflation of zero(dropout rate) varies substantially across single cells. Evidence has shown that technical noise, including batch effects, explains a notable proportion of this cell-to-cell variation. To capture biological variation, it is necessary to quantify and remove technical variation. Here, we introduce SCRIBE (Single-Cell Recovery Imputation with Batch Effects), a principled framework that imputes dropout events and corrects batch effects simultaneously. We demonstrate, through real examples, that SCRIBE outperforms existing scRNA-seq data analysis tools in recovering cell-specific gene expression patterns, removing batch effects and retaining biological variation across cells. Our software is freely available online at https://github.com/YiliangTracyZhang/SCRIBE.


2019 ◽  
Vol 10 (1) ◽  
Author(s):  
Wanwen Zeng ◽  
Xi Chen ◽  
Zhana Duren ◽  
Yong Wang ◽  
Rui Jiang ◽  
...  

Abstract Characterizing and interpreting heterogeneous mixtures at the cellular level is a critical problem in genomics. Single-cell assays offer an opportunity to resolve cellular level heterogeneity, e.g., scRNA-seq enables single-cell expression profiling, and scATAC-seq identifies active regulatory elements. Furthermore, while scHi-C can measure the chromatin contacts (i.e., loops) between active regulatory elements to target genes in single cells, bulk HiChIP can measure such contacts in a higher resolution. In this work, we introduce DC3 (De-Convolution and Coupled-Clustering) as a method for the joint analysis of various bulk and single-cell data such as HiChIP, RNA-seq and ATAC-seq from the same heterogeneous cell population. DC3 can simultaneously identify distinct subpopulations, assign single cells to the subpopulations (i.e., clustering) and de-convolve the bulk data into subpopulation-specific data. The subpopulation-specific profiles of gene expression, chromatin accessibility and enhancer-promoter contact obtained by DC3 provide a comprehensive characterization of the gene regulatory system in each subpopulation.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Zhiyuan Hu ◽  
Ahmed A. Ahmed ◽  
Christopher Yau

AbstractClustering of joint single-cell RNA-Seq (scRNA-Seq) data is often challenged by confounding factors, such as batch effects and biologically relevant variability. Existing batch effect removal methods typically require strong assumptions on the composition of cell populations being near identical across samples. Here, we present CIDER, a meta-clustering workflow based on inter-group similarity measures. We demonstrate that CIDER outperforms other scRNA-Seq clustering methods and integration approaches in both simulated and real datasets. Moreover, we show that CIDER can be used to assess the biological correctness of integration in real datasets, while it does not require the existence of prior cellular annotations.


2021 ◽  
Author(s):  
Boying Gong ◽  
Yun Zhou ◽  
Elizabeth Purdom

AbstractSingle-cell measurements of different cellular features or modalities from cells from the same system allow for a comprehensive understanding of a biological process. While the most common single-cell sequencing technologies require separate input cells for different modalities, there are a growing number of platforms that allow for measuring several modalities on a single cell. We present a novel method, Cobolt, for analyzing such multi-modality single-cell sequencing datasets. Cobolt jointly models the multiple modalities via a novel application of Multimodal Variational Autoencoder (MVAE) to a hierarchical generative model. We first demonstrate its performance on data from the multi-modality platform SNARE-seq, consisting of measurements of gene expression and chromatin accessibility on the same cells. We then illustrate the ability of Cobolt to integrate multi-modality platforms with single-modality platforms by jointly analyzing a SNARE-seq dataset, a single-cell gene expression dataset, and a single-cell chromatin accessibility dataset. We compared Cobolt with current options for analyzing such datasets and show that Cobolt provides robust and flexible results for integration of single-cell data on multiple modalities.


Sign in / Sign up

Export Citation Format

Share Document