Detailed comparison of two popular variant calling packages for exome and targeted exon studies

The Genome Analysis Toolkit (GATK) is often considered to be the “gold standard” for variant calling of single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels) from short-read sequencing data aligned against a reference genome. There have been a number of variant calling comparisons against GATK, but an adequate comparison against VarScan may have not yet been performed. More specifically, we compared four lists of variants called by GATK (using the UnifiedGenotyper and the HaplotypeCaller algorithms, with and without filtering low quality variants) and three lists of variants called using VarScan (with varying sets of parameters). Variant calling was performed on three datasets (1 targeted exon dataset and 2 exome datasets), each with approximately a dozen subjects. We found that running VarScan with a conservative set of parameters (referred to as “VarScan-Cons”) resulted in a high quality gene list, with high concordance (>97%) when compared to high quality variants called by the GATK UnifiedGenotyper and HaplotypeCaller. These conservative parameters result in decreased sensitivity, but the VarScan-Cons variant list could still recover 84-88% of the high-quality GATK SNPs in the exome datasets. We also accessed the impact of pre-processing (e.g., indel realignment and quality score base recalibration using GATK). In most cases, these pre-processing steps had only a modest impact on the variant calls, but the importance of the pre-processing steps varied between datasets and variant callers. More broadly, we believe the metrics used for comparison in this study can be useful in accessing the quality of variant calls in the context of a specific experimental design. As an example, a limited number of variant calling comparisons are also performed on two additional variant callers.

Download Full-text

Detailed comparison of two popular variant calling packages for exome and targeted exon studies

10.7287/peerj.preprints.403v1 ◽

2014 ◽

Author(s):

Charles D Warden ◽

Aaron W Adamson ◽

Susan L Neuhausen ◽

Xiwei Wu

Keyword(s):

Gene List ◽

Variant Calling ◽

Detailed Comparison ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

High Quality ◽

Genome Analysis Toolkit ◽

High Concordance ◽

The Impact ◽

Processing Steps

The Genome Analysis Toolkit (GATK) is often considered to be the “gold standard” for variant calling of single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels) from short-read sequencing data aligned against a reference genome. There have been a number of variant calling comparisons against GATK, but we felt that an adequate comparison against VarScan may have not yet been performed. More specifically, we compared four lists of variants called by GATK (using the UnifiedGenotyper and the HaplotypeCaller algorithms, with and without filtering low quality variants) and three lists of variants called using VarScan (with varying sets of parameters). Variant calling was performed on three datasets (1 targeted exon dataset and 2 exome datasets), each with approximately a dozen subjects. We found that running VarScan with a conservative set of parameters (referred to as “VarScan-Cons”) resulted in a high quality gene list, with high concordance (>97%) when compared to high quality variants called by the GATK UnifiedGenotyper and HaplotypeCaller. These conservative parameters result in decreased sensitivity, but the VarScan-Cons variant list could still recover 84-88% of the high-quality GATK SNPs in the exome datasets. We also accessed the impact of pre-processing (e.g., indel realignment and quality score base recalibration using GATK). In most cases, these pre-processing steps had only a modest impact on the variant calls, but the importance of the pre-processing steps varied between datasets and variant callers. More broadly, we believe the metrics used for comparison in this study can be useful in accessing the quality of variant calls in the context of a specific experimental design. As an example, a limited number of variant calling comparisons are also performed on two additional variant callers.

Download Full-text

Detailed comparison of two popular variant calling packages for exome and targeted exon studies

10.7287/peerj.preprints.403v3 ◽

2014 ◽

Author(s):

Charles D Warden ◽

Aaron W Adamson ◽

Susan L Neuhausen ◽

Xiwei Wu

Keyword(s):

Gene List ◽

Variant Calling ◽

Detailed Comparison ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

High Quality ◽

Genome Analysis Toolkit ◽

Comprehensive Comparison ◽

The Impact ◽

Processing Steps

The Genome Analysis Toolkit (GATK) is commonly used for variant calling of single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels) from short-read sequencing data aligned against a reference genome. There have been a number of variant calling comparisons against GATK, but an equally comprehensive comparison for VarScan not yet been performed. More specifically, we compared four lists of variants called by GATK (using the UnifiedGenotyper and the HaplotypeCaller algorithms, with and without filtering low quality variants) and three lists of variants called using VarScan (with varying sets of parameters). Variant calling was performed on three datasets (1 targeted exon dataset and 2 exome datasets), each with approximately a dozen subjects. We found that running VarScan with a conservative set of parameters (referred to as “VarScan-Cons”) resulted in a high quality gene list, with high concordance (>97%) when compared to high quality variants called by the GATK UnifiedGenotyper and HaplotypeCaller. These conservative parameters result in decreased sensitivity, but the VarScan-Cons variant list could still recover 84-88% of the high-quality GATK SNPs in the exome datasets. We also assessed the impact of pre-processing (e.g., indel realignment and quality score base recalibration using GATK). In most cases, these pre-processing steps had only a modest impact on the variant calls, but the importance of the pre-processing steps varied between datasets and variant callers. More broadly, we believe the metrics used for comparison in this study can be useful in assessing the quality of variant calls in the context of a specific experimental design. As an example, a limited number of variant calling comparisons are also performed on two additional variant callers.

Download Full-text

Detailed comparison of two popular variant calling packages for exome and targeted exon studies

10.7287/peerj.preprints.403 ◽

2014 ◽

Author(s):

Charles D Warden ◽

Aaron W Adamson ◽

Susan L Neuhausen ◽

Xiwei Wu

Keyword(s):

Gene List ◽

Variant Calling ◽

Detailed Comparison ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

High Quality ◽

Genome Analysis Toolkit ◽

Comprehensive Comparison ◽

The Impact ◽

Processing Steps

Download Full-text

LEVIATHAN: efficient discovery of large structural variants by leveraging long-range information from Linked-Reads data

10.1101/2021.03.25.437002 ◽

2021 ◽

Author(s):

Pierre Morisse ◽

Fabrice Legeai ◽

Claire Lemaitre

Keyword(s):

Long Range ◽

Model Organism ◽

Variant Calling ◽

Simulated Data ◽

Structural Variants ◽

Sequencing Data ◽

High Quality ◽

Short Reads ◽

Structural Variant ◽

Human Data

Linked-Reads technologies, popularized by 10x Genomics, combine the high- quality and low cost of short-reads sequencing with a long-range information by adding barcodes that tag reads originating from the same long DNA fragment. Thanks to their high-quality and long-range information, such reads are thus particularly useful for various applications such as genome scaffolding and structural variant calling. As a result, multiple structural variant calling methods were developed within the last few years. However, these methods were mainly tested on human data, and do not run well on non-human organisms, for which reference genomes are highly fragmented, or sequencing data display high levels of heterozygosity. Moreover, even on human data, most tools still require large amounts of computing resources. We present LEVIATHAN, a new structural variant calling tool that aims to address these issues, and especially better scale and apply to a wide variety of organisms. Our method relies on a barcode index, that allows to quickly compare the similarity of all possible pairs of regions in terms of amount of common barcodes. Region pairs sharing a sufficient number of barcodes are then considered as potential structural variants, and complementary, classical short reads methods are applied to further refine the breakpoint coordinates. Our experiments on simulated data underline that our method compares well to the state-of-the-art, both in terms of recall and precision, and also in terms of resource consumption. Moreover, LEVIATHAN was successfully applied to a real dataset from a non-model organism, while all other tools either failed to run or required unreasonable amounts of resources. LEVIATHAN is implemented in C++, supported on Linux platforms, and available under AGPL-3.0 License at https://github.com/morispi/LEVIATHAN.

Download Full-text

DNA from dried blood spots yields high quality sequences for exome analysis

10.1101/2020.05.19.105304 ◽

2020 ◽

Cited By ~ 1

Author(s):

Uma Sunderam ◽

Aashish N. Adhikari ◽

Kunal Kundu ◽

Jennifer M. Puck ◽

Robert Currier ◽

...

Keyword(s):

Dna Damage ◽

Reference Genome ◽

Variant Calling ◽

Dried Blood Spots ◽

Sequencing Data ◽

High Quality ◽

Potential Health ◽

Screening Programs ◽

Blood Spots ◽

Variant Discovery

AbstractBackgroundDNA sequencing of archived dried blood spots (DBS) collected by newborn screening programs constitutes a potential health resource to study newborn disorders and understand genotype-phenotype relationships. However, its essential to verify that sequencing reads from DBS derived DNA are suitable for variant discovery.ResultsWe explored 16 metrics to comprehensively assess the quality of sequencing reads from 180 DBS and 35 whole blood (WB) samples. These metrics were used to assess a) mapping of reads to the reference genome, b) degree of DNA damage, and c) variant calling. Reads from both sets mapped with similar efficiencies, had similar overall DNA damage rates, measured by the mismatch rate with the reference genome, and produced variant calls sets with similar Transition-Transversion ratios. While evaluating single nucleotide changes that may have arisen from DNA damage, we observed that the A>T and T>A changes were more frequent in DNA from DBS than from WB. However, this did not affect the accuracy of variant calling, with DBS samples yielding a comparable count of high quality SNVs and indels in samples with at least 50x coverage.ConclusionsOverall, DBS DNA provided exome sequencing data of sufficient quality for clinical interpretation.

Download Full-text

Bayesian network analysis of plasma microRNA sequencing data in patients with venous thrombosis

European Heart Journal Supplements ◽

10.1093/eurheartj/suaa008 ◽

2020 ◽

Vol 22 (Supplement_C) ◽

pp. C34-C45 ◽

Cited By ~ 3

Author(s):

Florian Thibord ◽

Gaëlle Munsch ◽

Claire Perret ◽

Pierre Suchon ◽

Maguelonne Roux ◽

...

Keyword(s):

Venous Thrombosis ◽

Association Studies ◽

Statistical Significance ◽

Genome Wide Association Studies ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

New Associations ◽

Significance Threshold ◽

The Impact ◽

Plasma Mirna

Abstract MicroRNAs (miRNAs) are small regulatory RNAs participating to several biological processes and known to be involved in various pathologies. Measurable in body fluids, miRNAs have been proposed to serve as efficient biomarkers for diseases and/or associated traits. Here, we performed a next-generation-sequencing based profiling of plasma miRNAs in 344 patients with venous thrombosis (VT) and assessed the association of plasma miRNA levels with several haemostatic traits and the risk of VT recurrence. Among the most significant findings, we detected an association between hsa-miR-199b-3p and haematocrit levels (P = 0.0016), these two markers having both been independently reported to associate with VT risk. We also observed suggestive evidence for association of hsa-miR-370-3p (P = 0.019), hsa-miR-27b-3p (P = 0.016) and hsa-miR-222-3p (P = 0.049) with VT recurrence, the observations at the latter two miRNAs confirming the recent findings of Wang et al. Besides, by conducting Genome-Wide Association Studies on miRNA levels and meta-analyzing our results with some publicly available, we identified 21 new associations of single nucleotide polymorphisms with plasma miRNA levels at the statistical significance threshold of P < 5 × 10−8, some of these associations pertaining to thrombosis associated mechanisms. In conclusion, this study provides novel data about the impact of miRNAs’ variability in haemostasis and new arguments supporting the association of few miRNAs with the risk of recurrence in patients with venous thrombosis.

Download Full-text

Implications of the Novel Mutations in the SARS-CoV-2 Genome for Transmission, Disease Severity, and the Vaccine Development

Frontiers in Medicine ◽

10.3389/fmed.2021.636532 ◽

2021 ◽

Vol 8 ◽

Author(s):

Hikmet Akkiz

Keyword(s):

Amino Acid ◽

Disease Severity ◽

Vaccine Development ◽

Experimental Studies ◽

Spike Protein ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Novel Mutations ◽

Single Nucleotide ◽

The Impact

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the causative virus of the coronavirus disease 2019 (COVID-19), has been identified in China in late December 2019. SARS-CoV-2 is an enveloped, positive-sense, single-stranded RNA betacoronavirus of the Coronaviridae family. Coronaviruses have genetic proofreading mechanism that corrects copying mistakes and thus SARS-CoV-2 genetic diversity is extremely low. Despite lower mutation rate of the virus, researchers have detected a total of 12,706 mutations in the SARS-CoV-2 genome, the majority of which were single nucleotide polymorphisms. Sequencing data revealed that the SARS-CoV-2 accumulates two-single nucleotide mutations per month in its genome. Recently, an amino acid aspartate (D) to glycine (G) (D614G) mutation due to an adenine to guanine nucleotide change at position 23,403 at the 614th amino-acid position of the spike protein in the original reference genotype has been identified. The SARS-CoV-2 viruses that carry the spike protein D614G mutation have become dominant variant around the world. The D614G mutation has been found to be associated with 3 other mutations in the spike protein. Clinical and pseudovirus experimental studies have demonstrated that the spike protein D614G mutation alters the virus phenotype. However, the impact of the mutation on the rate of transmission between people, disease severity and the vaccine and therapeutic development remains unclear. Three variants of SARS-CoV-2 have recently been identified. They are B.1.1.7 (UK) variant, B.1.351 (N501Y.V2, South African) variant and B.1.1.28 (Brazilian) variant. Epidemiological data suggest that they have a higher transmissibility than the original variant. There are reports that some vaccines are less efficacious against the B.1.351 variant. This review article discusses the effects of novel mutations in the SARS-CoV-2 genome on transmission, clinical outcomes and vaccine development.

Download Full-text

Localised community circulation of SARS-CoV-2 viruses with an increased accumulation of single nucleotide polymorphisms that adversely affect the sensitivity of real-time reverse transcription assays targeting Nucleocapsid protein

10.1101/2021.03.22.21254006 ◽

2021 ◽

Cited By ~ 1

Author(s):

Catherine Moore ◽

Louise Davies ◽

Rhiannydd Rees ◽

Laura Gifford ◽

Heather Lewis ◽

...

Keyword(s):

Nucleocapsid Protein ◽

Quality Management System ◽

Data Availability ◽

N Gene ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Gene Target ◽

Additional Mutation ◽

Target Sites ◽

The Impact

SummaryCurrently the primary method for confirming acute SARS-CoV-2 infection is through the use of molecular assays that target highly conserved regions within the viral genome. Many, if not most of the diagnostic targets currently in use were produced early in the pandemic, using genomes sequenced and shared in early 2020. As viral diversity increases, mutations may arise in diagnostic target sites that have an impact on the performance of diagnostic tests. Here, we report on a local outbreak of SARS-CoV-2 which had gained an additional mutation at position 28890 of the nucleocapsid protein, on a background of pre-existing mutations at positions 28881, 28882, 28883 in one of the main circulating viral lineages in Wales at that time. The impact of this additional mutation had a statistically significant impact on the Ct value reported for the N gene target designed by the Chinese CDC and used in a number of commercial diagnostic products. Further investigation identified that, in viral genomes sequenced from Wales over the summer of 2020, the N gene had a higher rate of mutations in diagnostic target sites than other targets, with 115 issues identified affecting over 10% of all cases sequenced between February and the end of August 2020. In comparison an issue was identified for ORFab, the next most affected target, in less than 1.4% of cases over the same time period. This work emphasises the potential impact that mutations in diagnostic target sites can have on tracking local outbreaks, as well as demonstrating the value of genomics as a routine tool for identifying and explaining potential diagnostic primer issues as part of a laboratory quality management system. This work also indicates that with increasing genomic sequencing data availability, there is a need to re-evaluate the diagnostic targets that are in use for SARS-CoV-2 testing, to better target regions that are now demonstrated to be of lower variability.

Download Full-text

Next-generation Sequence-analysis Toolkit (NeST): A standardized bioinformatics framework for analyzing Single Nucleotide Polymorphisms in next-generation sequencing data

10.1101/323535 ◽

2018 ◽

Author(s):

Shashidhar Ravishankar ◽

Sarah E. Schmedes ◽

Dhruviben S. Patel ◽

Mateusz Plucinski ◽

Venkatachalam Udhayakumar ◽

...

Keyword(s):

Next Generation Sequencing ◽

Single Nucleotide Polymorphisms ◽

Variant Calling ◽

Next Generation Sequencing Data ◽

Nucleotide Polymorphisms ◽

Next Generation ◽

Sequencing Data ◽

Single Nucleotide ◽

Bioinformatics Tools ◽

Generation Sequencing

AbstractRapid advancements in next-generation sequencing (NGS) technologies have led to the development of numerous bioinformatics tools and pipelines. As these tools vary in their output function and complexity and some are not well-standardized, it is harder to choose a suitable pipeline to identify variants in NGS data. Here, we present NeST (NGS-analysis Toolkit), a modular consensus-based variant calling framework. NeST uses a combination of variant callers to overcome potential biases of an individual method used alone. NeST consists of four modules, that integrate open-source bioinformatics tools, a custom Variant Calling Format (VCF) parser and a summarization utility, that generate high-quality consensus variant calls. NeST was validated using targeted-amplicon deep sequencing data from 245 Plasmodium falciparum isolates to identify single-nucleotide polymorphisms conferring drug resistance. The results were verified using Sanger sequencing data for the same dataset in a supporting publication [28]. NeST offers a user-friendly pipeline for variant calling with standardized outputs and minimal computational demands for easy deployment for use with various organisms and applications.

Download Full-text

Low-frequency variant calling from high-quality mtDNA sequencing data v1 (protocols.io.nfkdbkw)

protocols.io ◽

10.17504/protocols.io.nfkdbkw ◽

2018 ◽

Cited By ~ 2

Author(s):

Marita A ◽

James Stewart

Keyword(s):

Low Frequency ◽

Variant Calling ◽

Sequencing Data ◽

High Quality

Download Full-text