BnpC: Bayesian non-parametric clustering of single-cell mutation profiles

Abstract Motivation The high resolution of single-cell DNA sequencing (scDNA-seq) offers great potential to resolve intratumor heterogeneity (ITH) by distinguishing clonal populations based on their mutation profiles. However, the increasing size of scDNA-seq datasets and technical limitations, such as high error rates and a large proportion of missing values, complicate this task and limit the applicability of existing methods. Results Here, we introduce BnpC, a novel non-parametric method to cluster individual cells into clones and infer their genotypes based on their noisy mutation profiles. We benchmarked our method comprehensively against state-of-the-art methods on simulated data using various data sizes, and applied it to three cancer scDNA-seq datasets. On simulated data, BnpC compared favorably against current methods in terms of accuracy, runtime and scalability. Its inferred genotypes were the most accurate, especially on highly heterogeneous data, and it was the only method able to run and produce results on datasets with 5000 cells. On tumor scDNA-seq data, BnpC was able to identify clonal populations missed by the original cluster analysis but supported by Supplementary Experimental Data. With ever growing scDNA-seq datasets, scalable and accurate methods such as BnpC will become increasingly relevant, not only to resolve ITH but also as a preprocessing step to reduce data size. Availability and implementation BnpC is freely available under MIT license at https://github.com/cbg-ethz/BnpC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Bayesian non-parametric clustering of single-cell mutation profiles

10.1101/2020.01.15.907345 ◽

2020 ◽

Cited By ~ 1

Author(s):

Nico Borgsmüller ◽

Jose Bonet ◽

Francesco Marass ◽

Abel Gonzalez-Perez ◽

Nuria Lopez-Bigas ◽

...

Keyword(s):

Single Cell ◽

Dirichlet Process ◽

Tumor Heterogeneity ◽

Missing Values ◽

Parametric Method ◽

Simulated Data ◽

Error Rates ◽

Data Sets ◽

Dirichlet Process Mixture ◽

Non Parametric

AbstractThe high resolution of single-cell DNA sequencing (scDNA-seq) offers great potential to resolve intra-tumor heterogeneity by distinguishing clonal populations based on their mutation profiles. However, the increasing size of scDNA-seq data sets and technical limitations, such as high error rates and a large proportion of missing values, complicate this task and limit the applicability of existing methods. Here we introduce BnpC, a novel non-parametric method to cluster individual cells into clones and infer their genotypes based on their noisy mutation profiles. BnpC employs a Dirichlet process mixture model coupled with a Markov chain Monte Carlo sampling scheme, including a modified split-merge move and a novel posterior estimator to predict clones and genotypes. We benchmarked our method comprehensively against state-of-the-art methods on simulated data using various data sizes, and applied it to three cancer scDNA-seq data sets. On simulated data, BnpC compared favorably against current methods in terms of accuracy, runtime, and scalability. Its inferred genotypes were the most accurate, and it was the only method able to run and produce results on data sets with 10,000 cells. On tumor scDNA-seq data, BnpC was able to identify clonal populations missed by the original cluster analysis but supported by supplementary experimental data. With ever growing scDNA-seq data sets, scalable and accurate methods such as BnpC will become increasingly relevant, not only to resolve intra-tumor heterogeneity but also as a pre-processing step to reduce data size. BnpC is freely available under MIT license at https://github.com/cbg-ethz/BnpC.

Download Full-text

scDoc: correcting drop-out events in single-cell RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btaa283 ◽

2020 ◽

Vol 36 (15) ◽

pp. 4233-4239

Author(s):

Di Ran ◽

Shanshan Zhang ◽

Nicholas Lytal ◽

Lingling An

Keyword(s):

Single Cell ◽

Simulated Data ◽

Drop Out ◽

Cellular Heterogeneity ◽

Supplementary Information ◽

Cell Subpopulation ◽

Rna Seq ◽

Imputation Methods ◽

Similarity Estimation ◽

Differential Expression Detection

Abstract Motivation Single-cell RNA-sequencing (scRNA-seq) has become an important tool to unravel cellular heterogeneity, discover new cell (sub)types, and understand cell development at single-cell resolution. However, one major challenge to scRNA-seq research is the presence of ‘drop-out’ events, which usually is due to extremely low mRNA input or the stochastic nature of gene expression. In this article, we present a novel single-cell RNA-seq drop-out correction (scDoc) method, imputing drop-out events by borrowing information for the same gene from highly similar cells. Results scDoc is the first method that directly involves drop-out information to accounting for cell-to-cell similarity estimation, which is crucial in scRNA-seq drop-out imputation but has not been appropriately examined. We evaluated the performance of scDoc using both simulated data and real scRNA-seq studies. Results show that scDoc outperforms the existing imputation methods in reference to data visualization, cell subpopulation identification and differential expression detection in scRNA-seq data. Availability and implementation R code is available at https://github.com/anlingUA/scDoc. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Small numbers are an opportunity, not a problem

Scientia Medica ◽

10.15448/1980-6108.2021.1.40128 ◽

2021 ◽

Vol 31 (1) ◽

pp. e40128

Author(s):

Jimmie Leppink

Keyword(s):

Time Window ◽

Parametric Method ◽

Simulated Data ◽

Education And Training ◽

Repeated Measurements ◽

Individual Participant ◽

Research In Education ◽

The Individual ◽

And Training ◽

Non Parametric

Aims: outcomes of research in education and training are partly a function of the context in which that study takes place, the questions we ask, and what is feasible. Many questions are about learning, which involves repeated measurements in a particular time window, and the practical context is usually such that offering an intervention to some but not to all learners does not make sense or is unethical. For quality assurance and other purposes, education and training centers may have very locally oriented questions that they seek to answer, such as whether an intervention can be considered effective in their context of small numbers of learners. While the rationale behind the design and outcomes of this kind of studies may be of interest to a much wider community, for example to study the transferability of findings to other contexts, people are often discouraged to report on the outcomes of such studies at conferences or in educational research journals. The aim of this paper is to counter that discouragement and instead encourage people to see small numbers as an opportunity instead of as a problem.Method: a worked example of a parametric and a non-parametric method for this type of situation, using simulated data in the zero-cost Open Source statistical program R version 4.0.5.Results: contrary to the non-parametric method, the parametric method can provide estimates of intervention effectiveness for the individual participant, account for trends in different phases of a study. However, the non-parametric method provides a solution in several situations where the parametric method should be used.Conclusion: Given the costs of research, the lessons to be learned from research, and statistical methods available, small numbers should be considered an opportunity, not a problem.

Download Full-text

SPARSim single cell: a count data simulator for scRNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz752 ◽

2019 ◽

Cited By ~ 2

Author(s):

Giacomo Baruzzo ◽

Ilaria Patuzzi ◽

Barbara Di Camillo

Keyword(s):

Single Cell ◽

Count Data ◽

Simulated Data ◽

Real Data ◽

R Package ◽

Supplementary Information ◽

Rna Seq ◽

Distribution Of Zeros ◽

New Methods ◽

Research Fields

Abstract Motivation Single cell RNA-seq (scRNA-seq) count data show many differences compared with bulk RNA-seq count data, making the application of many RNA-seq pre-processing/analysis methods not straightforward or even inappropriate. For this reason, the development of new methods for handling scRNA-seq count data is currently one of the most active research fields in bioinformatics. To help the development of such new methods, the availability of simulated data could play a pivotal role. However, only few scRNA-seq count data simulators are available, often showing poor or not demonstrated similarity with real data. Results In this article we present SPARSim, a scRNA-seq count data simulator based on a Gamma-Multivariate Hypergeometric model. We demonstrate that SPARSim allows to generate count data that resemble real data in terms of count intensity, variability and sparsity, performing comparably or better than one of the most used scRNA-seq simulator, Splat. In particular, SPARSim simulated count matrices well resemble the distribution of zeros across different expression intensities observed in real count data. Availability and implementation SPARSim R package is freely available at http://sysbiobig.dei.unipd.it/? q=SPARSim and at https://gitlab.com/sysbiobig/sparsim. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A Combinatorial Approach for Single-cell Variant Detection via Phylogenetic Inference

10.1101/693960 ◽

2019 ◽

Author(s):

Mohammadamin Edrisi ◽

Hamim Zafar ◽

Luay Nakhleh

Keyword(s):

Single Cell ◽

Human Cancer ◽

Evolutionary Relationship ◽

Error Rates ◽

Intratumor Heterogeneity ◽

Combinatorial Approach ◽

Cancer Dataset ◽

Inference Problem ◽

Single Cell Sequencing ◽

Joint Inference

AbstractSingle-cell sequencing provides a powerful approach for elucidating intratumor heterogeneity by resolving cell-to-cell variability. However, it also poses additional challenges including elevated error rates, allelic dropout and non-uniform coverage. A recently introduced single-cell-specific mutation detection algorithm leverages the evolutionary relationship between cells for denoising the data. However, due to its probabilistic nature, this method does not scale well with the number of cells. Here, we develop a novel combinatorial approach for utilizing the genealogical relationship of cells in detecting mutations from noisy single-cell sequencing data. Our method, called scVILP, jointly detects mutations in individual cells and reconstructs a perfect phylogeny among these cells. We employ a novel Integer Linear Program algorithm for deterministically and efficiently solving the joint inference problem. We show that scVILP achieves similar or better accuracy but significantly better runtime over existing methods on simulated data. We also applied scVILP to an empirical human cancer dataset from a high grade serous ovarian cancer patient.

Download Full-text

Scelestial: fast and accurate single-cell lineage tree inference based on a Steiner tree approximation algorithm

10.1101/2021.05.24.445405 ◽

2021 ◽

Author(s):

Mohammad-Hadi Foroughmand-Araabi ◽

Sama Goliaei ◽

Alice Carolyn McHardy

Keyword(s):

Approximation Algorithm ◽

Single Cell ◽

Steiner Tree ◽

Missing Values ◽

Cell Lineage ◽

Error Rates ◽

Steiner Tree Problem ◽

Tree Reconstruction ◽

Tree Inference ◽

Lineage Tree

Single-cell genome sequencing provides a highly granular view of biological systems but is affected by high error rates, allelic amplification bias, and uneven genome coverage. This creates a need for data-specific computational methods, for purposes such as for cell lineage tree inference. The objective of cell lineage tree reconstruction is to infer the evolutionary process that generated a set of observed cell genomes. Lineage trees may enable a better understanding of tumor formation and growth, as well as of organ development for healthy body cells. We describe a method, Scelestial, for lineage tree reconstruction from single-cell data, which is based on an approximation algorithm for the Steiner tree problem and is a generalization of the neighbor-joining method. We adapt the algorithm to efficiently select a limited subset of potential sequences as internal nodes, in the presence of missing values, and to minimize cost by lineage tree-based missing value imputation. In a comparison against seven state-of-the-art single-cell lineage tree reconstruction algorithms - BitPhylogeny, OncoNEM, SCITE, SiFit, SASC, SCIPhI, and SiCloneFit - on simulated and real single-cell tumor samples, Scelestial performed best at reconstructing trees in terms of accuracy and run time. Scelestial has been implemented in C++. It is also available as an R package named RScelestial.

Download Full-text

SCIΦ: Single-cell mutation identification via phylogenetic inference

10.1101/290908 ◽

2018 ◽

Cited By ~ 1

Author(s):

Jochen Singer ◽

Jack Kuipers ◽

Katharina Jahn ◽

Niko Beerenwinkel

Keyword(s):

Single Cell ◽

Lymphoblastic Leukemia ◽

Evolutionary Relationship ◽

Simulated Data ◽

Error Rates ◽

Cancer Therapies ◽

Sequencing Data ◽

Allelic Dropout ◽

Single Cell Sequencing ◽

Real World Datasets

AbstractUnderstanding the evolution of cancer is important for the development of appropriate cancer therapies. The task is challenging because tumors evolve as heterogeneous cell populations with an unknown number of genetically distinct subclones of varying frequencies. Conventional approaches based on bulk sequencing are limited in addressing this challenge as clones cannot be observed directly. Single-cell sequencing holds the promise of resolving the heterogeneity of tumors; however, it has its own challenges including elevated error rates, allelic dropout, and uneven coverage. Here, we develop a new approach to mutation detection in individual tumor cells by leveraging the evolutionary relationship among cells. Our method, called SCIΦ, jointly calls mutations in individual cells and estimates the tumor phylogeny among these cells. Employing a Markov Chain Monte Carlo scheme we robustly account for the various sources of noise in single-cell sequencing data. Our approach enables us to reliably call mutations in each single cell even in experiments with high dropout rates and missing data. We show that SCIΦ outperforms existing methods on simulated data and applied it to different real-world datasets, namely a whole exome breast cancer as well as a panel acute lymphoblastic leukemia dataset. Availability: https://github.com/cbg-ethz/SCIPhI

Download Full-text

Usefulness of the DETECT program for assessing the internal structure of dimensionality in simulated data and results of the Korean nursing licensing examination

Journal of Educational Evaluation for Health Professions ◽

10.3352/jeehp.2017.14.32 ◽

2017 ◽

Vol 14 ◽

pp. 32 ◽

Cited By ~ 1

Author(s):

Dong Gi Seo ◽

Younyoung Choi ◽

Sun Huh

Keyword(s):

Internal Structure ◽

Parametric Method ◽

Simulated Data ◽

Parametric Methods ◽

Simulation Data ◽

Content Areas ◽

Proper Number ◽

Examination Methods ◽

Licensing Examination ◽

Non Parametric

Purpose: The dimensionality of examinations provides empirical evidence of the internal test structure underlying the responses to a set of items. In turn, the internal structure is an important piece of evidence of the validity of an examination. Thus, the aim of this study was to investigate the performance of the DETECT program and to use it to examine the internal structure of the Korean nursing licensing examination. Methods: Non-parametric methods of dimensional testing, such as the DETECT program, have been proposed as ways of overcoming the limitations of traditional parametric methods. A non-parametric method (the DETECT program) was investigated using simulation data under several conditions and applied to the Korean nursing licensing examination. Results: The DETECT program performed well in terms of determining the number of underlying dimensions under several different conditions in the simulated data. Further, the DETECT program correctly revealed the internal structure of the Korean nursing licensing examination, meaning that it detected the proper number of dimensions and appropriately clustered the items within each dimension.Conclusion: The DETECT program performed well in detecting the number of dimensions and in assigning items for each dimension. This result implies that the DETECT method can be useful for examining the internal structure of assessments, such as licensing examinations, that possess relatively many domains and content areas.

Download Full-text

cgCorrect: A method to correct for confounding cell-cell variation due to cell growth in single-cell transcriptomics

10.1101/057463 ◽

2016 ◽

Author(s):

Thomas Blasi ◽

Florian Buettner ◽

Michael K. Strasser ◽

Carsten Marr ◽

Fabian J. Theis

Keyword(s):

Gene Expression ◽

Steady State ◽

Cell Growth ◽

Single Cell ◽

Cell Size ◽

Computational Analysis ◽

Simulated Data ◽

Supplementary Information ◽

Mrna Transcript ◽

Transcriptomics Data

AbstractMotivation: Accessing gene expression at the single cell level has unraveled often large heterogeneity among seemingly homogeneous cells, which remained obscured in traditional population based approaches. The computational analysis of single-cell transcriptomics data, however, still imposes unresolved challenges with respect to normalization, visualization and modeling the data. One such issue are differences in cell size, which introduce additional variability into the data, for which appropriate normalization techniques are needed. Otherwise, these differences in cell size may obscure genuine heterogeneities among cell populations and lead to overdispersed steady-state distributions of mRNA transcript numbers.Results: We present cgCorrect, a statistical framework to correct for differences in cell size that are due to cell growth in single-cell transcriptomics data. We derive the probability for the cell growth corrected mRNA transcript number given the measured, cell size dependent mRNA transcript number, based on the assumption that the average number of transcripts in a cell increases proportional to the cell’s volume during cell cycle. cgCorrect can be used for both data normalization, and to analyze steady-state distributions used to infer the gene expression mechanism. We demonstrate its applicability on both simulated data and single-cell quantitative real-time PCR data from mouse blood stem and progenitor cells. We show that correcting for differences in cell size affects the interpretation of the data obtained by typically performed computational analysis.Availability: A Matlab implementation of cgCorrect is available at http://icb.helmholtz-muenchen.de/cgCorrectSupplementary information: Supplementary information are available online. The simulated data set is available at http://icb.helmholtz-muenchen.de/cgCorrect

Download Full-text

AdImpute: An Imputation Method for Single-Cell RNA-Seq Data Based on Semi-Supervised Autoencoders

Frontiers in Genetics ◽

10.3389/fgene.2021.739677 ◽

2021 ◽

Vol 12 ◽

Author(s):

Li Xu ◽

Yin Xu ◽

Tong Xue ◽

Xinyu Zhang ◽

Jin Li

Keyword(s):

Single Cell ◽

Missing Values ◽

Simulated Data ◽

Real Data ◽

Imputation Method ◽

Data Sets ◽

Silent Genes ◽

Downstream Analysis ◽

The Cost ◽

Simulated Data Sets

Motivation: The emergence of single-cell RNA sequencing (scRNA-seq) technology has paved the way for measuring RNA levels at single-cell resolution to study precise biological functions. However, the presence of a large number of missing values in its data will affect downstream analysis. This paper presents AdImpute: an imputation method based on semi-supervised autoencoders. The method uses another imputation method (DrImpute is used as an example) to fill the results as imputation weights of the autoencoder, and applies the cost function with imputation weights to learn the latent information in the data to achieve more accurate imputation.Results: As shown in clustering experiments with the simulated data sets and the real data sets, AdImpute is more accurate than other four publicly available scRNA-seq imputation methods, and minimally modifies the biologically silent genes. Overall, AdImpute is an accurate and robust imputation method.

Download Full-text