Halvade somatic: Somatic variant calling with Apache Spark

Dries Decap; Louise de Schaetzen van Brienen; Maarten Larmuseau; Pascal Costanza; Charlotte Herzeel; Roel Wuyts; Kathleen Marchal; Jan Fostier

doi:10.1093/gigascience/giab094

Halvade somatic: Somatic variant calling with Apache Spark

GigaScience ◽

10.1093/gigascience/giab094 ◽

2022 ◽

Vol 11 (1) ◽

Author(s):

Dries Decap ◽

Louise de Schaetzen van Brienen ◽

Maarten Larmuseau ◽

Pascal Costanza ◽

Charlotte Herzeel ◽

...

Keyword(s):

Best Practices ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Computing Time ◽

Variant Calling ◽

Apache Spark ◽

Normal Sample ◽

Whole Genome ◽

Sequencing Data ◽

Somatic Variant

Abstract Background The accurate detection of somatic variants from sequencing data is of key importance for cancer treatment and research. Somatic variant calling requires a high sequencing depth of the tumor sample, especially when the detection of low-frequency variants is also desired. In turn, this leads to large volumes of raw sequencing data to process and hence, large computational requirements. For example, calling the somatic variants according to the GATK best practices guidelines requires days of computing time for a typical whole-genome sequencing sample. Findings We introduce Halvade Somatic, a framework for somatic variant calling from DNA sequencing data that takes advantage of multi-node and/or multi-core compute platforms to reduce runtime. It relies on Apache Spark to provide scalable I/O and to create and manage data streams that are processed on different CPU cores in parallel. Halvade Somatic contains all required steps to process the tumor and matched normal sample according to the GATK best practices recommendations: read alignment (BWA), sorting of reads, preprocessing steps such as marking duplicate reads and base quality score recalibration (GATK), and, finally, calling the somatic variants (Mutect2). Our approach reduces the runtime on a single 36-core node to 19.5 h compared to a runtime of 84.5 h for the original pipeline, a speedup of 4.3 times. Runtime can be further decreased by scaling to multiple nodes, e.g., we observe a runtime of 1.36 h using 16 nodes, an additional speedup of 14.4 times. Halvade Somatic supports variant calling from both whole-genome sequencing and whole-exome sequencing data and also supports Strelka2 as an alternative or complementary variant calling tool. We provide a Docker image to facilitate single-node deployment. Halvade Somatic can be executed on a variety of compute platforms, including Amazon EC2 and Google Cloud. Conclusions To our knowledge, Halvade Somatic is the first somatic variant calling pipeline that leverages Big Data processing platforms and provides reliable, scalable performance. Source code is freely available.

Download Full-text

Estimating sequencing error rates using families

BioData Mining ◽

10.1186/s13040-021-00259-6 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Kelley Paskov ◽

Jae-Yoon Jung ◽

Brianna Chrisman ◽

Nate T. Stockham ◽

Peter Washington ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Exome Sequencing ◽

Genome Sequencing ◽

Variant Calling ◽

Error Rates ◽

Sequencing Error ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Platform ◽

Whole Exome

Abstract Background As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. Results We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method’s versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. Conclusion Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.

Download Full-text

Comprehensive identification of somatic nucleotide variants in human brain tissue

Genome Biology ◽

10.1186/s13059-021-02285-3 ◽

2021 ◽

Vol 22 (1) ◽

Cited By ~ 1

Author(s):

Yifan Wang ◽

◽

Taejeong Bae ◽

Jeremy Thorpe ◽

Maxwell A. Sherman ◽

...

Keyword(s):

Best Practices ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Somatic Mosaicism ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Single Individual ◽

Sequencing Data ◽

Specificity And Sensitivity ◽

Neuropsychiatric Diseases

Abstract Background Post-zygotic mutations incurred during DNA replication, DNA repair, and other cellular processes lead to somatic mosaicism. Somatic mosaicism is an established cause of various diseases, including cancers. However, detecting mosaic variants in DNA from non-cancerous somatic tissues poses significant challenges, particularly if the variants only are present in a small fraction of cells. Results Here, the Brain Somatic Mosaicism Network conducts a coordinated, multi-institutional study to examine the ability of existing methods to detect simulated somatic single-nucleotide variants (SNVs) in DNA mixing experiments, generate multiple replicates of whole-genome sequencing data from the dorsolateral prefrontal cortex, other brain regions, dura mater, and dural fibroblasts of a single neurotypical individual, devise strategies to discover somatic SNVs, and apply various approaches to validate somatic SNVs. These efforts lead to the identification of 43 bona fide somatic SNVs that range in variant allele fractions from ~ 0.005 to ~ 0.28. Guided by these results, we devise best practices for calling mosaic SNVs from 250× whole-genome sequencing data in the accessible portion of the human genome that achieve 90% specificity and sensitivity. Finally, we demonstrate that analysis of multiple bulk DNA samples from a single individual allows the reconstruction of early developmental cell lineage trees. Conclusions This study provides a unified set of best practices to detect somatic SNVs in non-cancerous tissues. The data and methods are freely available to the scientific community and should serve as a guide to assess the contributions of somatic SNVs to neuropsychiatric diseases.

Download Full-text

Comprehensive identification of somatic nucleotide variants in human brain tissue

10.1101/2020.10.10.332213 ◽

2020 ◽

Author(s):

Yifan Wang ◽

Taejeong Bae ◽

Jeremy Thorpe ◽

Maxwell A. Sherman ◽

Attila G. Jones ◽

...

Keyword(s):

Best Practices ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Somatic Mosaicism ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Single Individual ◽

Sequencing Data ◽

Specificity And Sensitivity ◽

Neuropsychiatric Diseases

AbstractPost-zygotic mutations incurred during DNA replication, DNA repair, and other cellular processes lead to somatic mosaicism. Somatic mosaicism is an established cause of various diseases, including cancers. However, detecting mosaic variants in DNA from non-cancerous somatic tissues poses significant challenges, particularly if the variants only are present in a small fraction of cells. Here, the Brain Somatic Mosaicism Network conducted a coordinated, multi-institutional study to: (i) examine the ability of existing methods to detect simulated somatic single nucleotide variants (SNVs) in DNA mixing experiments; (ii) generate multiple replicates of whole genome sequencing data from the dorsolateral prefrontal cortex, other brain regions, dura mater, and dural fibroblasts of a single neurotypical individual; (iii) devise strategies to discover somatic SNVs; and (iv) apply various approaches to validate somatic SNVs. These efforts led to the identification of 43 bona fide somatic SNVs that ranged in variant allele fractions from ~0.005 to ~0.28. Guided by these results, we devised best practices for calling mosaic SNVs from 250X whole genome sequencing data in the accessible portion of the human genome that achieve 90% specificity and sensitivity. Finally, we demonstrated that analysis of multiple bulk DNA samples from a single individual allows the reconstruction of early developmental cell lineage trees. Thus, this study provides a unified set of best practices to detect somatic SNVs in non-cancerous tissues. The data and methods are freely available to the scientific community and should serve as a guide to assess the contributions of somatic SNVs to neuropsychiatric diseases.

Download Full-text

Comparison of three variant callers for human whole genome sequencing

10.1101/461798 ◽

2018 ◽

Author(s):

Anna Supernat ◽

Oskar Valdimar Vidarsson ◽

Vidar M. Steen ◽

Tomasz Stokowy

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Single Gene ◽

Reference Sample ◽

Variant Calling ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Whole Exome ◽

Indel Calling

ABSTRACTTesting of patients with genetics-related disorders is in progress of shifting from single gene assays to gene panel sequencing, whole-exome sequencing (WES) and whole-genome sequencing (WGS). Since WGS is unquestionably becoming a new foundation for molecular analyses, we decided to compare three currently used tools for variant calling of human whole genome sequencing data. We tested DeepVariant, a new TensorFlow machine learning-based variant caller, and compared this tool to GATK 4.0 and SpeedSeq, using 30×, 15× and 10× WGS data of the well-known NA12878 DNA reference sample.According to our comparison, the performance on SNV calling was almost similar in 30× data, with all three variant callers reaching F-Scores (i.e. harmonic mean of recall and precision) equal to 0.98. In contrast, DeepVariant was more precise in indel calling than GATK and SpeedSeq, as demonstrated by F-Scores of 0.94, 0.90 and 0.84, respectively.We conclude that the DeepVariant tool has great potential and usefulness for analysis of WGS data in medical genetics.

Download Full-text

Empirical evaluation of variant calling accuracy using ultra-deep whole-genome sequencing data

Scientific Reports ◽

10.1038/s41598-018-38346-0 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 12

Author(s):

Toshihiro Kishikawa ◽

Yukihide Momozawa ◽

Takeshi Ozeki ◽

Taisei Mushiroda ◽

Hidenori Inohara ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Empirical Evaluation ◽

Variant Calling ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data

Download Full-text

Empirical design of a variant quality control pipeline for whole genome sequencing data using replicate discordance

Scientific Reports ◽

10.1038/s41598-019-52614-7 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 3

Author(s):

Robert P. Adelson ◽

Alan E. Renton ◽

Wentian Li ◽

Nir Barzilai ◽

Gil Atzmon ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Variant Calling ◽

Read Depth ◽

Concordance Rate ◽

Whole Genome Sequencing Data ◽

Analysis Tool ◽

Whole Genome ◽

Sequencing Data ◽

Genome Wide

Abstract The success of next-generation sequencing depends on the accuracy of variant calls. Few objective protocols exist for QC following variant calling from whole genome sequencing (WGS) data. After applying QC filtering based on Genome Analysis Tool Kit (GATK) best practices, we used genotype discordance of eight samples that were sequenced twice each to evaluate the proportion of potentially inaccurate variant calls. We designed a QC pipeline involving hard filters to improve replicate genotype concordance, which indicates improved accuracy of genotype calls. Our pipeline analyzes the efficacy of each filtering step. We initially applied this strategy to well-characterized variants from the ClinVar database, and subsequently to the full WGS dataset. The genome-wide biallelic pipeline removed 82.11% of discordant and 14.89% of concordant genotypes, and improved the concordance rate from 98.53% to 99.69%. The variant-level read depth filter most improved the genome-wide biallelic concordance rate. We also adapted this pipeline for triallelic sites, given the increasing proportion of multiallelic sites as sample sizes increase. For triallelic sites containing only SNVs, the concordance rate improved from 97.68% to 99.80%. Our QC pipeline removes many potentially false positive calls that pass in GATK, and may inform future WGS studies prior to variant effect analysis.

Download Full-text

From whole genome sequencing data toward a simple genotyping tool: application to the animal pathogen Mycobacterium bovis

10.26226/morressier.56d5ba2ad462b80296c965c0 ◽

2016 ◽

Author(s):

Lorraine Michelet

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Mycobacterium Bovis ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data

Download Full-text

Plasmids or no plasmids? A comparison between the agilent TapeStation and whole-genome sequencing data in a large-scale bacterial sequencing project

10.26226/morressier.56d5ba27d462b80296c95fe7 ◽

2016 ◽

Author(s):

Sarah Alexander

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Project

Download Full-text

Whole Genome Sequencing Refines Knowledge on the Population Structure of Mycobacterium bovis from a Multi-Host Tuberculosis System

Microorganisms ◽

10.3390/microorganisms9081585 ◽

2021 ◽

Vol 9 (8) ◽

pp. 1585

Author(s):

Ana C. Reis ◽

Liliana C. M. Salvador ◽

Suelee Robbe-Austerman ◽

Rogério Tenreiro ◽

Ana Botelho ◽

...

Keyword(s):

Population Structure ◽

Whole Genome Sequencing ◽

Wild Boar ◽

Genome Sequencing ◽

Mycobacterium Bovis ◽

Red Deer ◽

Variable Number Tandem Repeat ◽

Variant Calling ◽

Whole Genome ◽

Network Analyses

Classical molecular analyses of Mycobacterium bovis based on spoligotyping and Variable Number Tandem Repeat (MIRU-VNTR) brought the first insights into the epidemiology of animal tuberculosis (TB) in Portugal, showing high genotypic diversity of circulating strains that mostly cluster within the European 2 clonal complex. Previous surveillance provided valuable information on the prevalence and spatial occurrence of TB and highlighted prevalent genotypes in areas where livestock and wild ungulates are sympatric. However, links at the wildlife–livestock interfaces were established mainly via classical genotype associations. Here, we apply whole genome sequencing (WGS) to cattle, red deer and wild boar isolates to reconstruct the M. bovis population structure in a multi-host, multi-region disease system and to explore links at a fine genomic scale between M. bovis from wildlife hosts and cattle. Whole genome sequences of 44 representative M. bovis isolates, obtained between 2003 and 2015 from three TB hotspots, were compared through single nucleotide polymorphism (SNP) variant calling analyses. Consistent with previous results combining classical genotyping with Bayesian population admixture modelling, SNP-based phylogenies support the branching of this M. bovis population into five genetic clades, three with apparent geographic specificities, as well as the establishment of an SNP catalogue specific to each clade, which may be explored in the future as phylogenetic markers. The core genome alignment of SNPs was integrated within a spatiotemporal metadata framework to further structure this M. bovis population by host species and TB hotspots, providing a baseline for network analyses in different epidemiological and disease control contexts. WGS of M. bovis isolates from Portugal is reported for the first time in this pilot study, refining the spatiotemporal context of TB at the wildlife–livestock interface and providing further support to the key role of red deer and wild boar on disease maintenance. The SNP diversity observed within this dataset supports the natural circulation of M. bovis for a long time period, as well as multiple introduction events of the pathogen in this Iberian multi-host system.

Download Full-text

High-precision and cost-efficient sequencing for real-time COVID-19 surveillance

Scientific Reports ◽

10.1038/s41598-021-93145-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Sung Yong Park ◽

Gina Faraci ◽

Pamela M. Ward ◽

Jane F. Emerson ◽

Ha Youn Lee

Keyword(s):

Los Angeles ◽

Whole Genome Sequencing ◽

Real Time ◽

Genome Sequencing ◽

High Precision ◽

High Throughput Sequencing ◽

Whole Genome ◽

Sequencing Data ◽

Public Health Response ◽

Cost Efficient

AbstractCOVID-19 global cases have climbed to more than 33 million, with over a million total deaths, as of September, 2020. Real-time massive SARS-CoV-2 whole genome sequencing is key to tracking chains of transmission and estimating the origin of disease outbreaks. Yet no methods have simultaneously achieved high precision, simple workflow, and low cost. We developed a high-precision, cost-efficient SARS-CoV-2 whole genome sequencing platform for COVID-19 genomic surveillance, CorvGenSurv (Coronavirus Genomic Surveillance). CorvGenSurv directly amplified viral RNA from COVID-19 patients’ Nasopharyngeal/Oropharyngeal (NP/OP) swab specimens and sequenced the SARS-CoV-2 whole genome in three segments by long-read, high-throughput sequencing. Sequencing of the whole genome in three segments significantly reduced sequencing data waste, thereby preventing dropouts in genome coverage. We validated the precision of our pipeline by both control genomic RNA sequencing and Sanger sequencing. We produced near full-length whole genome sequences from individuals who were COVID-19 test positive during April to June 2020 in Los Angeles County, California, USA. These sequences were highly diverse in the G clade with nine novel amino acid mutations including NSP12-M755I and ORF8-V117F. With its readily adaptable design, CorvGenSurv grants wide access to genomic surveillance, permitting immediate public health response to sudden threats.

Download Full-text