LEVIATHAN: efficient discovery of large structural variants by leveraging long-range information from Linked-Reads data

Linked-Reads technologies, popularized by 10x Genomics, combine the high- quality and low cost of short-reads sequencing with a long-range information by adding barcodes that tag reads originating from the same long DNA fragment. Thanks to their high-quality and long-range information, such reads are thus particularly useful for various applications such as genome scaffolding and structural variant calling. As a result, multiple structural variant calling methods were developed within the last few years. However, these methods were mainly tested on human data, and do not run well on non-human organisms, for which reference genomes are highly fragmented, or sequencing data display high levels of heterozygosity. Moreover, even on human data, most tools still require large amounts of computing resources. We present LEVIATHAN, a new structural variant calling tool that aims to address these issues, and especially better scale and apply to a wide variety of organisms. Our method relies on a barcode index, that allows to quickly compare the similarity of all possible pairs of regions in terms of amount of common barcodes. Region pairs sharing a sufficient number of barcodes are then considered as potential structural variants, and complementary, classical short reads methods are applied to further refine the breakpoint coordinates. Our experiments on simulated data underline that our method compares well to the state-of-the-art, both in terms of recall and precision, and also in terms of resource consumption. Moreover, LEVIATHAN was successfully applied to a real dataset from a non-model organism, while all other tools either failed to run or required unreasonable amounts of resources. LEVIATHAN is implemented in C++, supported on Linux platforms, and available under AGPL-3.0 License at https://github.com/morispi/LEVIATHAN.

Download Full-text

GRIDSS, PURPLE, LINX: Unscrambling the tumor genome via integrated analysis of structural variation and copy number

10.1101/781013 ◽

2019 ◽

Cited By ~ 8

Author(s):

Daniel L. Cameron ◽

Jonathan Baber ◽

Charles Shale ◽

Anthony T. Papenfuss ◽

Jose Espejo Valle-Inclan ◽

...

Keyword(s):

Copy Number ◽

Variant Calling ◽

Genomic Rearrangements ◽

Whole Genome Sequencing Data ◽

Integrated Analysis ◽

Derivative Chromosome ◽

Structural Variants ◽

Sequencing Data ◽

Structural Variant ◽

Complex Events

AbstractWe have developed a novel, integrated and comprehensive purity, ploidy, structural variant and copy number somatic analysis toolkit for whole genome sequencing data of paired tumor/normal samples. We show that the combination of using GRIDSS for somatic structural variant calling and PURPLE for somatic copy number alteration calling allows highly sensitive, precise and consistent copy number and structural variant determination, as well as providing novel insights for short structural variants and regions of complex local topology. LINX, an interpretation tool, leverages the integrated structural variant and copy number calling to cluster individual structural variants into higher order events and chains them together to predict local derivative chromosome structure. LINX classifies and extensively annotates genomic rearrangements including simple and reciprocal breaks, LINE, viral and pseudogene insertions, and complex events such as chromothripsis. LINX also comprehensively calls genic fusions including chained fusions. Finally, our toolkit provides novel visualisation methods providing insight into complex genomic rearrangements.

Download Full-text

SimFFPE and FilterFFPE: improving structural variant calling in FFPE samples

GigaScience ◽

10.1093/gigascience/giab065 ◽

2021 ◽

Vol 10 (9) ◽

Cited By ~ 1

Author(s):

Lanying Wei ◽

Martin Dugas ◽

Sarah Sandmann

Keyword(s):

Next Generation Sequencing ◽

Variant Calling ◽

Real Data ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Structural Variants ◽

Sequencing Data ◽

Structural Variant ◽

Ffpe Samples ◽

Generation Sequencing

Abstract Background Artifact chimeric reads are enriched in next-generation sequencing data generated from formalin-fixed paraffin-embedded (FFPE) samples. Previous work indicated that these reads are characterized by erroneous split-read support that is interpreted as evidence of structural variants. Thus, a large number of false-positive structural variants are detected. To our knowledge, no tool is currently available to specifically call or filter structural variants in FFPE samples. To overcome this gap, we developed 2 R packages: SimFFPE and FilterFFPE. Results SimFFPE is a read simulator, specifically designed for next-generation sequencing data from FFPE samples. A mixture of characteristic artifact chimeric reads, as well as normal reads, is generated. FilterFFPE is a filtration algorithm, removing artifact chimeric reads from sequencing data while keeping real chimeric reads. To evaluate the performance of FilterFFPE, we performed structural variant calling with 3 common tools (Delly, Lumpy, and Manta) with and without prior filtration with FilterFFPE. After applying FilterFFPE, the mean positive predictive value improved from 0.27 to 0.48 in simulated samples and from 0.11 to 0.27 in real samples, while sensitivity remained basically unchanged or even slightly increased. Conclusions FilterFFPE improves the performance of SV calling in FFPE samples. It was validated by analysis of simulated and real data.

Download Full-text

Aquila_stLFR: assembly based variant calling package for stLFR and hybrid assembly for linked-reads

10.1101/742239 ◽

2019 ◽

Cited By ~ 1

Author(s):

Xin Zhou ◽

Lu Zhang ◽

Xiaodong Fang ◽

Yichen Liu ◽

David L. Dill ◽

...

Keyword(s):

De Novo ◽

Low Cost ◽

Variant Calling ◽

Hybrid Assembly ◽

Structural Variants ◽

Sequencing Data ◽

Single Tube ◽

Large Numbers ◽

Key Characteristics ◽

Hybrid Assemblies

AbstractHuman diploid genome assembly enables identifying maternal and paternal genetic variations. Algorithms based on 10x linked-read sequencing have been developed for de novo assembly, variant calling and haplotyping. Another linked-read technology, single tube long fragment read (stLFR), has recently provided a low-cost single tube solution that can enable long fragment data. However, no existing software is available for human diploid assembly and variant calls. We develop Aquila stLFR to adapt to the key characteristics of stLFR. Aquila stLFR assembles near perfect diploid assembled contigs, and the assembly-based variant calling shows that Aquila stLFR detects large numbers of structural variants which were not easily spanned by Illumina short-reads. Furthermore, the hybrid assembly mode Aquila hybrid allows a hybrid assembly based on both stLFR and 10x linked-reads libraries, demonstrating that these two technologies can always be complementary to each other for assembly to improve contiguity and the variants detection, regardless of assembly quality of the library itself from single sequencing technology. The overlapped structural variants (SVs) from two independent sequencing data of the same individual, and the SVs from hybrid assemblies provide us a high-confidence profile to study them.AvailabilitySource code and documentation are available on https://github.com/maiziex/Aquila_stLFR.

Download Full-text

SvABA: Genome-wide detection of structural variants and indels by local assembly

10.1101/105080 ◽

2017 ◽

Cited By ~ 9

Author(s):

Jeremiah Wala ◽

Pratiti Bandopadhayay ◽

Noah Greenwald ◽

Ryan O’Rourke ◽

Ted Sharpe ◽

...

Keyword(s):

Variant Calling ◽

Accurate Method ◽

Structural Variants ◽

Sequencing Data ◽

Cancer Driver ◽

Insertion And Deletion ◽

Genome Wide ◽

Cancer Genomes ◽

Local Assembly ◽

Genomic Regions

AbstractStructural variants (SVs), including small insertion and deletion variants (indels), are challenging to detect through standard alignment-based variant calling methods. Sequence assembly offers a powerful approach to identifying SVs, but is difficult to apply at-scale genome-wide for SV detection due to its computational complexity and the difficulty of extracting SVs from assembly contigs. We describe SvABA, an efficient and accurate method for detecting SVs from short-read sequencing data using genome-wide local assembly with low memory and computing requirements. We evaluated SvABA’s performance on the NA12878 human genome and in simulated and real cancer genomes. SvABA demonstrates superior sensitivity and specificity across a large spectrum of SVs, and substantially improved detection performance for variants in the 20-300 bp range, compared with existing methods. SvABA also identifies complex somatic rearrangements with chains of short (< 1,000 bp) templated-sequence insertions copied from distant genomic regions. We applied SvABA to 344 cancer genomes from 11 cancer types, and found that templated-sequence insertions occur in ~4% of all somatic rearrangements. Finally, we demonstrate that SvABA can identify sites of viral integration and cancer driver alterations containing medium-sized SVs.

Download Full-text

Detailed comparison of two popular variant calling packages for exome and targeted exon studies

10.7287/peerj.preprints.403v2 ◽

2014 ◽

Author(s):

Charles D Warden ◽

Aaron W Adamson ◽

Susan L Neuhausen ◽

Xiwei Wu

Keyword(s):

Gene List ◽

Variant Calling ◽

Detailed Comparison ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

High Quality ◽

Genome Analysis Toolkit ◽

High Concordance ◽

The Impact ◽

Processing Steps

The Genome Analysis Toolkit (GATK) is often considered to be the “gold standard” for variant calling of single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels) from short-read sequencing data aligned against a reference genome. There have been a number of variant calling comparisons against GATK, but an adequate comparison against VarScan may have not yet been performed. More specifically, we compared four lists of variants called by GATK (using the UnifiedGenotyper and the HaplotypeCaller algorithms, with and without filtering low quality variants) and three lists of variants called using VarScan (with varying sets of parameters). Variant calling was performed on three datasets (1 targeted exon dataset and 2 exome datasets), each with approximately a dozen subjects. We found that running VarScan with a conservative set of parameters (referred to as “VarScan-Cons”) resulted in a high quality gene list, with high concordance (>97%) when compared to high quality variants called by the GATK UnifiedGenotyper and HaplotypeCaller. These conservative parameters result in decreased sensitivity, but the VarScan-Cons variant list could still recover 84-88% of the high-quality GATK SNPs in the exome datasets. We also accessed the impact of pre-processing (e.g., indel realignment and quality score base recalibration using GATK). In most cases, these pre-processing steps had only a modest impact on the variant calls, but the importance of the pre-processing steps varied between datasets and variant callers. More broadly, we believe the metrics used for comparison in this study can be useful in accessing the quality of variant calls in the context of a specific experimental design. As an example, a limited number of variant calling comparisons are also performed on two additional variant callers.

Download Full-text

Structural variant analysis for linked-read sequencing data with gemtools

Bioinformatics ◽

10.1093/bioinformatics/btz239 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4397-4399 ◽

Cited By ~ 2

Author(s):

S U Greer ◽

H P Ji

Keyword(s):

Supplementary Information ◽

Supplementary Data ◽

Structural Variants ◽

Sequencing Data ◽

Structural Variant ◽

Single Dna Molecules ◽

Long Reads ◽

Depth Analysis ◽

Basic Functions ◽

Variant Analysis

Abstract Summary Linked-read sequencing generates synthetic long reads which are useful for the detection and analysis of structural variants (SVs). The software associated with 10× Genomics linked-read sequencing, Long Ranger, generates the essential output files (BAM, VCF, SV BEDPE) necessary for downstream analyses. However, to perform downstream analyses requires the user to customize their own tools to handle the unique features of linked-read sequencing data. Here, we describe gemtools, a collection of tools for the downstream and in-depth analysis of SVs from linked-read data. Gemtools uses the barcoded aligned reads and the Megabase-scale phase blocks to determine haplotypes of SV breakpoints and delineate complex breakpoint configurations at the resolution of single DNA molecules. The gemtools package is a suite of tools that provides the user with the flexibility to perform basic functions on their linked-read sequencing output in order to address even more questions. Availability and implementation The gemtools package is freely available for download at: https://github.com/sgreer77/gemtools. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SomatoSim: precision simulation of somatic single nucleotide variants

BMC Bioinformatics ◽

10.1186/s12859-021-04024-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Marwan A. Hawari ◽

Celine S. Hong ◽

Leslie G. Biesecker

Keyword(s):

High Throughput Sequencing ◽

Variant Calling ◽

Simulated Data ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Somatic Variant ◽

Simulation Tools ◽

Gold Standard Dataset ◽

High Level

Abstract Background Somatic single nucleotide variants have gained increased attention because of their role in cancer development and the widespread use of high-throughput sequencing techniques. The necessity to accurately identify these variants in sequencing data has led to a proliferation of somatic variant calling tools. Additionally, the use of simulated data to assess the performance of these tools has become common practice, as there is no gold standard dataset for benchmarking performance. However, many existing somatic variant simulation tools are limited because they rely on generating entirely synthetic reads derived from a reference genome or because they do not allow for the precise customizability that would enable a more focused understanding of single nucleotide variant calling performance. Results SomatoSim is a tool that lets users simulate somatic single nucleotide variants in sequence alignment map (SAM/BAM) files with full control of the specific variant positions, number of variants, variant allele fractions, depth of coverage, read quality, and base quality, among other parameters. SomatoSim accomplishes this through a three-stage process: variant selection, where candidate positions are selected for simulation, variant simulation, where reads are selected and mutated, and variant evaluation, where SomatoSim summarizes the simulation results. Conclusions SomatoSim is a user-friendly tool that offers a high level of customizability for simulating somatic single nucleotide variants. SomatoSim is available at https://github.com/BieseckerLab/SomatoSim.

Download Full-text

Detailed comparison of two popular variant calling packages for exome and targeted exon studies

10.7287/peerj.preprints.403v3 ◽

2014 ◽

Author(s):

Charles D Warden ◽

Aaron W Adamson ◽

Susan L Neuhausen ◽

Xiwei Wu

Keyword(s):

Gene List ◽

Variant Calling ◽

Detailed Comparison ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

High Quality ◽

Genome Analysis Toolkit ◽

Comprehensive Comparison ◽

The Impact ◽

Processing Steps

The Genome Analysis Toolkit (GATK) is commonly used for variant calling of single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels) from short-read sequencing data aligned against a reference genome. There have been a number of variant calling comparisons against GATK, but an equally comprehensive comparison for VarScan not yet been performed. More specifically, we compared four lists of variants called by GATK (using the UnifiedGenotyper and the HaplotypeCaller algorithms, with and without filtering low quality variants) and three lists of variants called using VarScan (with varying sets of parameters). Variant calling was performed on three datasets (1 targeted exon dataset and 2 exome datasets), each with approximately a dozen subjects. We found that running VarScan with a conservative set of parameters (referred to as “VarScan-Cons”) resulted in a high quality gene list, with high concordance (>97%) when compared to high quality variants called by the GATK UnifiedGenotyper and HaplotypeCaller. These conservative parameters result in decreased sensitivity, but the VarScan-Cons variant list could still recover 84-88% of the high-quality GATK SNPs in the exome datasets. We also assessed the impact of pre-processing (e.g., indel realignment and quality score base recalibration using GATK). In most cases, these pre-processing steps had only a modest impact on the variant calls, but the importance of the pre-processing steps varied between datasets and variant callers. More broadly, we believe the metrics used for comparison in this study can be useful in assessing the quality of variant calls in the context of a specific experimental design. As an example, a limited number of variant calling comparisons are also performed on two additional variant callers.

Download Full-text

Effect of lossy compression of quality scores on variant calling

10.1101/029843 ◽

2015 ◽

Cited By ~ 1

Author(s):

Idoia Ochoa ◽

Mikel Hernaez ◽

Rachel Goldfeder ◽

Tsachy Weissman ◽

Euan Ashley

Keyword(s):

Dna Sequencing ◽

Consensus Sequence ◽

Variant Calling ◽

Simulated Data ◽

Genomic Data ◽

Original Data ◽

Lossy Compression ◽

Sequencing Data ◽

Indel Detection ◽

The Cost

Recent advancements in sequencing technology have led to a drastic reduction in the cost of genome sequencing. This development has generated an unprecedented amount of genomic data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear. Bioinformatic algorithms to identify SNPs and INDELs from next-generation DNA sequencing data use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. We analyze several lossy compressors introduced recently in the literature. Specifically, we investigate how the output of the variant caller when using the original data (uncompressed) differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets such as the GIAB (Genome In A Bottle) consensus sequence for NA12878 and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed. We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the uncompressed data. Further, in some cases lossy compression can lead to variant calling performance which is superior to that using the uncompressed file. We envisage our findings and framework serving as a benchmark in future development and analyses of lossy genomic data compressors. The \emph{Supplementary Data} can be found at \url{http://web.stanford.edu/~iochoa/supplementEffectLossy.zip}.

Download Full-text

2-kupl: mapping-free variant detection from DNA-seq data of matched samples

BMC Bioinformatics ◽

10.1186/s12859-021-04185-6 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yunfeng Wang ◽

Haoliang Xue ◽

Christine Pourcel ◽

Yang Du ◽

Daniel Gautheret

Keyword(s):

Dna Sequences ◽

Reference Genome ◽

Point Mutations ◽

Variant Calling ◽

Low Complexity ◽

Structural Variants ◽

Sequencing Data ◽

Bacterial Strains ◽

Two Samples ◽

Variant Detection

Abstract Background The detection of genome variants, including point mutations, indels and structural variants, is a fundamental and challenging computational problem. We address here the problem of variant detection between two deep-sequencing (DNA-seq) samples, such as two human samples from an individual patient, or two samples from distinct bacterial strains. The preferred strategy in such a case is to align each sample to a common reference genome, collect all variants and compare these variants between samples. Such mapping-based protocols have several limitations. DNA sequences with large indels, aggregated mutations and structural variants are hard to map to the reference. Furthermore, DNA sequences cannot be mapped reliably to genomic low complexity regions and repeats. Results We introduce 2-kupl, a k-mer based, mapping-free protocol to detect variants between two DNA-seq samples. On simulated and actual data, 2-kupl achieves higher accuracy than other mapping-free protocols. Applying 2-kupl to prostate cancer whole exome sequencing data, we identify a number of candidate variants in hard-to-map regions and propose potential novel recurrent variants in this disease. Conclusions We developed a mapping-free protocol for variant calling between matched DNA-seq samples. Our protocol is suitable for variant detection in unmappable genome regions or in the absence of a reference genome.

Download Full-text