Handling of targeted amplicon sequencing data focusing on index hopping and demultiplexing using a nested metabarcoding approach in ecology

Yasemin Guenay-Greunke; David A. Bohan; Michael Traugott; Corinna Wallinger

doi:10.1038/s41598-021-98018-4

Handling of targeted amplicon sequencing data focusing on index hopping and demultiplexing using a nested metabarcoding approach in ecology

Scientific Reports ◽

10.1038/s41598-021-98018-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Yasemin Guenay-Greunke ◽

David A. Bohan ◽

Michael Traugott ◽

Corinna Wallinger

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Cost Effective ◽

Amplicon Sequencing ◽

Sequencing Depth ◽

Sequencing Error ◽

Sequencing Data ◽

Large Sample ◽

Sequencing Errors ◽

Plant Feeding

AbstractHigh-throughput sequencing platforms are increasingly being used for targeted amplicon sequencing because they enable cost-effective sequencing of large sample sets. For meaningful interpretation of targeted amplicon sequencing data and comparison between studies, it is critical that bioinformatic analyses do not introduce artefacts and rely on detailed protocols to ensure that all methods are properly performed and documented. The analysis of large sample sets and the use of predefined indexes create challenges, such as adjusting the sequencing depth across samples and taking sequencing errors or index hopping into account. However, the potential biases these factors introduce to high-throughput amplicon sequencing data sets and how they may be overcome have rarely been addressed. On the example of a nested metabarcoding analysis of 1920 carabid beetle regurgitates to assess plant feeding, we investigated: (i) the variation in sequencing depth of individually tagged samples and the effect of library preparation on the data output; (ii) the influence of sequencing errors within index regions and its consequences for demultiplexing; and (iii) the effect of index hopping. Our results demonstrate that despite library quantification, large variation in read counts and sequencing depth occurred among samples and that the sequencing error rate in bioinformatic software is essential for accurate adapter/primer trimming and demultiplexing. Moreover, setting an index hopping threshold to avoid incorrect assignment of samples is highly recommended.

Download Full-text

Synthetic Sequencing Standards: A Guide to Database Choice for Rumen Microbiota Amplicon Sequencing Analysis

Frontiers in Microbiology ◽

10.3389/fmicb.2020.606825 ◽

2020 ◽

Vol 11 ◽

Author(s):

Paul E. Smith ◽

Sinead M. Waters ◽

Ruth Gómez Expósito ◽

Hauke Smidt ◽

Ciara A. Carberry ◽

...

Keyword(s):

High Throughput Sequencing ◽

Cost Effective ◽

Amplicon Sequencing ◽

Gas Production ◽

Reference Database ◽

Specific Reference ◽

Sequencing Analysis ◽

Sequencing Data ◽

Rumen Microbiota ◽

Reference Databases

Our understanding of complex microbial communities, such as those residing in the rumen, has drastically advanced through the use of high throughput sequencing (HTS) technologies. Indeed, with the use of barcoded amplicon sequencing, it is now cost effective and computationally feasible to identify individual rumen microbial genera associated with ruminant livestock nutrition, genetics, performance and greenhouse gas production. However, across all disciplines of microbial ecology, there is currently little reporting of the use of internal controls for validating HTS results. Furthermore, there is little consensus of the most appropriate reference database for analyzing rumen microbiota amplicon sequencing data. Therefore, in this study, a synthetic rumen-specific sequencing standard was used to assess the effects of database choice on results obtained from rumen microbial amplicon sequencing. Four DADA2 reference training sets (RDP, SILVA, GTDB, and RefSeq + RDP) were compared to assess their ability to correctly classify sequences included in the rumen-specific sequencing standard. In addition, two thresholds of phylogenetic bootstrapping, 50 and 80, were applied to investigate the effect of increasing stringency. Sequence classification differences were apparent amongst the databases. For example the classification of Clostridium differed between all databases, thus highlighting the need for a consistent approach to nomenclature amongst different reference databases. It is hoped the effect of database on taxonomic classification observed in this study, will encourage research groups across various microbial disciplines to develop and routinely use their own microbiome-specific reference standard to validate analysis pipelines and database choice.

Download Full-text

Powerful Inference with the D-statistic on Low-Coverage Whole-Genome Data

10.1101/127852 ◽

2017 ◽

Cited By ~ 1

Author(s):

Samuele Soraggi ◽

Carsten Wiuf ◽

Anders Albrechtsen

Keyword(s):

Error Correction ◽

Genetic Relationship ◽

High Throughput Sequencing ◽

Sequencing Depth ◽

Human Populations ◽

Sequencing Data ◽

Sequencing Errors ◽

Genome Data ◽

High Throughput Sequencing Data ◽

External Population

ABSTRACTThe detection of ancient gene flow between human populations is an important issue in population genetics. A common tool for detecting ancient admixture events is the D-statistic. The D-statistic is based on the hypothesis of a genetic relationship that involves four populations, whose correctness is assessed by evaluating specific coincidences of alleles between the groups.When working with high throughput sequencing data calling genotypes accurately is not always possible, therefore the D-statistic currently samples a single base from the reads of one individual per population. This implies ignoring much of the information in the data, an issue especially striking in the case of ancient genomes.We provide a significant improvement to overcome the problems of the D-statistic by considering all reads from multiple individuals in each population. We also apply type-specific error correction to combat the problems of sequencing errors and show a way to correct for introgression from an external population that is not part of the supposed genetic relationship, and how this leads to an estimate of the admixture rate.We prove that the D-statistic is approximated by a standard normal. Furthermore we show that our method outperforms the traditional D-statistic in detecting admixtures. The power gain is most pronounced for low/medium sequencing depth (1-10X) and performances are as good as with perfectly called genotypes at a sequencing depth of 2X. We show the reliability of error correction on scenarios with simulated errors and ancient data, and correct for introgression in known scenarios to estimate the admixture rates.

Download Full-text

Binnacle: Using Scaffolds to Improve the Contiguity and Quality of Metagenomic Bins

Frontiers in Microbiology ◽

10.3389/fmicb.2021.638561 ◽

2021 ◽

Vol 12 ◽

Author(s):

Harihara Subrahmaniam Muralidharan ◽

Nidhi Shah ◽

Jacquelyn S. Meisel ◽

Mihai Pop

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Mobile Elements ◽

Shotgun Sequencing ◽

Strain Level ◽

Level Variation ◽

Sequencing Data ◽

Sequencing Errors ◽

Complete Genomes

High-throughput sequencing has revolutionized the field of microbiology, however, reconstructing complete genomes of organisms from whole metagenomic shotgun sequencing data remains a challenge. Recovered genomes are often highly fragmented, due to uneven abundances of organisms, repeats within and across genomes, sequencing errors, and strain-level variation. To address the fragmented nature of metagenomic assemblies, scientists rely on a process called binning, which clusters together contigs inferred to originate from the same organism. Existing binning algorithms use oligonucleotide frequencies and contig abundance (coverage) within and across samples to group together contigs from the same organism. However, these algorithms often miss short contigs and contigs from regions with unusual coverage or DNA composition characteristics, such as mobile elements. Here, we propose that information from assembly graphs can assist current strategies for metagenomic binning. We use MetaCarvel, a metagenomic scaffolding tool, to construct assembly graphs where contigs are nodes and edges are inferred based on paired-end reads. We developed a tool, Binnacle, that extracts information from the assembly graphs and clusters scaffolds into comprehensive bins. Binnacle also provides wrapper scripts to integrate with existing binning methods. The Binnacle pipeline can be found on GitHub (https://github.com/marbl/binnacle). We show that binning graph-based scaffolds, rather than contigs, improves the contiguity and quality of the resulting bins, and captures a broader set of the genes of the organisms being reconstructed.

Download Full-text

Ehapp2: Estimate haplotype frequencies from pooled sequencing data with prior database information

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720016500177 ◽

2016 ◽

Vol 14 (04) ◽

pp. 1650017

Author(s):

Chang-Chang Cao ◽

Xiao Sun

Keyword(s):

Large Scale ◽

Linear Equations ◽

Cost Effective ◽

Relative Difference ◽

Sequencing Error ◽

Sequencing Data ◽

Sequencing Errors ◽

Pooled Sequencing ◽

Haplotype Frequencies ◽

The Cost

To reduce the cost of large-scale re-sequencing, multiple individuals are pooled together and sequenced called pooled sequencing. Pooled sequencing could provide a cost-effective alternative to sequencing individuals separately. To facilitate the application of pooled sequencing in haplotype-based diseases association analysis, the critical procedure is to accurately estimate haplotype frequencies from pooled samples. Here we present Ehapp2 for estimating haplotype frequencies from pooled sequencing data by utilizing a database which provides prior information of known haplotypes. We first translate the problem of estimating frequency for each haplotype into finding a sparse solution for a system of linear equations, where the NNREG algorithm is employed to achieve the solution. Simulation experiments reveal that Ehapp2 is robust to sequencing errors and able to estimate the frequencies of haplotypes with less than 3% average relative difference for pooled sequencing of mixture of real Drosophila haplotypes with 50× total coverage even when the sequencing error rate is as high as 0.05. Owing to the strategy that proportions for local haplotypes spanning multiple SNPs are accurately calculated first, Ehapp2 retains excellent estimation for recombinant haplotypes resulting from chromosomal crossover. Comparisons with present methods reveal that Ehapp2 is state-of-the-art for many sequencing study designs and more suitable for current massive parallel sequencing.

Download Full-text

Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction

Briefings in Bioinformatics ◽

10.1093/bib/bbv029 ◽

2015 ◽

Vol 17 (1) ◽

pp. 154-179 ◽

Cited By ~ 139

Author(s):

David Laehnemann ◽

Arndt Borkhardt ◽

Alice Carolyn McHardy

Keyword(s):

High Throughput ◽

Deep Sequencing ◽

High Throughput Sequencing ◽

Sequencing Data ◽

Sequencing Errors ◽

Deep Sequencing Data

Download Full-text

Measuring genetic differentiation from Pool-seq data

10.1101/282400 ◽

2018 ◽

Cited By ~ 3

Author(s):

Valentin Hivert ◽

Raphël Leblois ◽

Eric J. Petit ◽

Mathieu Gautier ◽

Renaud Vitalis

Keyword(s):

Genetic Differentiation ◽

High Throughput Sequencing ◽

Model Misspecification ◽

Cost Effective ◽

Sequencing Data ◽

Sequencing Errors ◽

Cost Effective Alternative ◽

Method Of Moments Estimator ◽

Dna Pool

AbstractThe recent advent of high throughput sequencing and genotyping technologies enables the comparison of patterns of polymorphisms at a very large number of markers. While the characterization of genetic structure from individual sequencing data remains expensive for many non-model species, it has been shown that sequencing pools of individual DNAs (Pool-seq) represents an attractive and cost-effective alternative. However, analyzing sequence read counts from a DNA pool instead of individual genotypes raises statistical challenges in deriving correct estimates of genetic differentiation. In this article, we provide a method-of-moments estimator of FST for Pool-seq data, based on an analysis-of-variance framework. We show, by means of simulations, that this new estimator is unbiased, and outperforms previously proposed estimators. We evaluate the robustness of our estimator to model misspecification, such as sequencing errors and uneven contributions of individual DNAs to the pools. Last, by reanalyzing published Pool-seq data of different ecotypes of the prickly sculpin Cottus asper, we show how the use of an unbiased FST estimator may question the interpretation of population structure inferred from previous analyses.

Download Full-text

Linkage Disequilibrium Estimation in Low Coverage High-Throughput Sequencing Data

10.1101/235937 ◽

2017 ◽

Cited By ~ 1

Author(s):

Timothy P. Bilton ◽

John C. McEwan ◽

Shannon M. Clarke ◽

Rudiger Brauning ◽

Tracey C. van Stijn ◽

...

Keyword(s):

Linkage Disequilibrium ◽

High Throughput ◽

High Throughput Sequencing ◽

Cost Effective ◽

Likelihood Method ◽

Sequencing Data ◽

Diverse Range ◽

Pairwise Linkage Disequilibrium ◽

Large Populations ◽

Low Coverage

AbstractHigh-throughput sequencing methods that multiplex a large number of individuals have provided a cost-effective approach for discovering genome-wide genetic variation in large populations. These sequencing methods are increasingly being utilized in population genetic studies across a diverse range of species. One side-effect of these methods, however, is that one or more alleles at a particular locus may not be sequenced, particularly when the sequencing depth is low, resulting in some heterozygous genotypes being called as homozygous. Under-called heterozygous genotypes have a profound effect on the estimation of linkage disequilibrium and, if not taken into account, leads to inaccurate estimates. We developed a new likelihood method, GUS-LD, to estimate pairwise linkage disequilibrium using low coverage sequencing data that accounts for under-called heterozygous genotypes. Our findings show that accurate estimates were obtained using GUS-LD on low coverage sequencing data, whereas underestimation of linkage disequilibrium results if no adjustment is made for under-called heterozygotes.

Download Full-text

vcfView: An Extensible Data Visualization and Quality Assurance Platform for Integrated Somatic Variant Analysis

Cancer Informatics ◽

10.1177/1176935120972377 ◽

2020 ◽

Vol 19 ◽

pp. 117693512097237

Author(s):

Brian O’Sullivan ◽

Cathal Seoighe

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Somatic Mutations ◽

Driver Mutations ◽

Sequencing Data ◽

Data Set ◽

Therapeutic Implications ◽

Cancer Driver ◽

Sequencing Errors ◽

The Status

Motivation: Somatic mutations can have critical prognostic and therapeutic implications for cancer patients. Although targeted methods are often used to assay specific cancer driver mutations, high throughput sequencing is frequently applied to discover novel driver mutations and to determine the status of less-frequent driver mutations. The task of recovering somatic mutations from these data is nontrivial as somatic mutations must be distinguished from germline variants, sequencing errors, and other artefacts. Consequently, bioinformatics pipelines for recovery of somatic mutations from high throughput sequencing typically involve a large number of analytical choices in the form of quality filters. Results: We present vcfView, an interactive tool designed to support the evaluation of somatic mutation calls from cancer sequencing data. The tool takes as input a single variant call format (VCF) file and enables researchers to explore the impacts of analytical choices on the mutant allele frequency spectrum, on mutational signatures and on annotated somatic variants in genes of interest. It allows variants that have failed variant caller filters to be re-examined to improve sensitivity or guide the design of future experiments. It is extensible, allowing other algorithms to be incorporated easily. Availability: The shiny application can be downloaded from GitHub ( https://github.com/BrianOSullivanGit/vcfView ). All data processing is performed within R to ensure platform independence. The app has been tested on RStudio, version 1.1.456, with base R 3.6.2 and Shiny 1.4.0. A vignette based on a publicly available data set is also available on GitHub.

Download Full-text

Equivolumetric Protocol Generates Library Sizes Proportional to Total Microbial Load in 16S Amplicon Sequencing

Frontiers in Microbiology ◽

10.3389/fmicb.2021.638231 ◽

2021 ◽

Vol 12 ◽

Author(s):

Giuliano Netto Flores Cruz ◽

Ana Paula Christoff ◽

Luiz Felipe Valter de Oliveira

Keyword(s):

16S Rrna ◽

High Throughput ◽

High Throughput Sequencing ◽

Predictive Performance ◽

Amplicon Sequencing ◽

Microbial Load ◽

Cumulative Probability ◽

Sequencing Data ◽

Probability Models ◽

Order Of Magnitude

High-throughput sequencing of 16S rRNA amplicon has been extensively employed to perform microbiome characterization worldwide. As a culture-independent methodology, it has allowed high-level profiling of sample bacterial composition directly from samples. However, most studies are limited to information regarding relative bacterial abundances (sample proportions), ignoring scenarios in which sample microbe biomass can vary widely. Here, we use an equivolumetric protocol for 16S rRNA amplicon library preparation capable of generating Illumina sequencing data responsive to input DNA, recovering proportionality between observed read counts and absolute bacterial abundances within each sample. Under specified conditions, we show that the estimation of colony-forming units (CFU), the most common unit of bacterial abundance in classical microbiology, is challenged mostly by resolution and taxon-to-taxon variation. We propose Bayesian cumulative probability models to address such issues. Our results indicate that predictive errors vary consistently below one order of magnitude for total microbial load and abundance of observed bacteria. We also demonstrate our approach has the potential to generalize to previously unseen bacteria, but predictive performance is hampered by specific taxa of uncommon profile. Finally, it remains clear that high-throughput sequencing data are not inherently restricted to sample proportions only, and such technologies bear the potential to meet the working scales of traditional microbiology.

Download Full-text

Improved Efficiency and Reliability of NGS Amplicon Sequencing Data Analysis for Genetic Diagnostic Procedures Using AGSA Software

BioMed Research International ◽

10.1155/2016/5623089 ◽

2016 ◽

Vol 2016 ◽

pp. 1-11 ◽

Cited By ~ 2

Author(s):

Axel Poulet ◽

Maud Privat ◽

Flora Ponelle ◽

Sandrine Viala ◽

Stephanie Decousus ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Low Cost ◽

Familial Risk ◽

454 Sequencing ◽

Diagnostic Procedures ◽

Amplicon Sequencing ◽

Sequencing Data ◽

Dideoxy Sequencing ◽

Ideal Situation

Screening forBRCAmutations in women with familial risk of breast or ovarian cancer is an ideal situation for high-throughput sequencing, providing large amounts of low cost data. However, 454, Roche, and Ion Torrent, Thermo Fisher, technologies produce homopolymer-associated indel errors, complicating their use in routine diagnostics. We developed software, named AGSA, which helps to detect false positive mutations in homopolymeric sequences. Seventy-two familial breast cancer cases were analysed in parallel by amplicon 454 pyrosequencing and Sanger dideoxy sequencing for genetic variations of theBRCAgenes. All 565 variants detected by dideoxy sequencing were also detected by pyrosequencing. Furthermore, pyrosequencing detected 42 variants that were missed with Sanger technique. Six amplicons contained homopolymer tracts in the coding sequence that were systematically misread by the software supplied by Roche. Read data plotted as histograms by AGSA software aided the analysis considerably and allowed validation of the majority of homopolymers. As an optimisation, additional 250 patients were analysed using microfluidic amplification of regions of interest (Access Array Fluidigm) of the BRCA genes, followed by 454 sequencing and AGSA analysis. AGSA complements a complete line of high-throughput diagnostic sequence analysis, reducing time and costs while increasing reliability, notably for homopolymer tracts.

Download Full-text