sequencing error
Recently Published Documents


TOTAL DOCUMENTS

160
(FIVE YEARS 76)

H-INDEX

16
(FIVE YEARS 3)

2022 ◽  
Author(s):  
Kumeren Nadaraj Govender ◽  
David W Eyre

Culture-independent metagenomic detection of microbial species has the potential to provide rapid and precise real-time diagnostic results. However, it is potentially limited by sequencing and classification errors. We use simulated and real-world data to benchmark rates of species misclassification using 100 reference genomes for each of ten common bloodstream pathogens and six frequent blood culture contaminants (n=1600). Simulating both with and without sequencing error for both the Illumina and Oxford Nanopore platforms, we evaluated commonly used classification tools including Kraken2, Bracken, and Centrifuge, utilising mini (8GB) and standard (30-50GB) databases. Bracken with the standard database performed best, the median percentage of reads across both sequencing platforms identified correctly to the species level was 98.46% (IQR 93.0:99.3) [range 57.1:100]. For Kraken2 with a mini database, a commonly used combination, median species-level identification was 79.3% (IQR 39.1:88.8) [range 11.2:100]. Classification performance varied by species, with E. coli being more challenging to classify correctly (59.4% to 96.4% reads with correct species, varying by tool used). By filtering out shorter Nanopore reads (<3500bp) we found performance similar or superior to Illumina sequencing, despite higher sequencing error rates. Misclassification was more common when the misclassified species had a higher average nucleotide identity to the true species. Our findings highlight taxonomic misclassification of sequencing data occurs and varies by sequencing and analysis workflow. This “bioinformatic contamination” should be accounted for in metagenomic pipelines to ensure accurate results that can support clinical decision making.


2022 ◽  
Vol 23 (1) ◽  
Author(s):  
Nadia M. Davidson ◽  
Ying Chen ◽  
Teresa Sadras ◽  
Georgina L. Ryland ◽  
Piers Blombery ◽  
...  

AbstractIn cancer, fusions are important diagnostic markers and targets for therapy. Long-read transcriptome sequencing allows the discovery of fusions with their full-length isoform structure. However, due to higher sequencing error rates, fusion finding algorithms designed for short reads do not work. Here we present JAFFAL, to identify fusions from long-read transcriptome sequencing. We validate JAFFAL using simulations, cell lines, and patient data from Nanopore and PacBio. We apply JAFFAL to single-cell data and find fusions spanning three genes demonstrating transcripts detected from complex rearrangements. JAFFAL is available at https://github.com/Oshlack/JAFFA/wiki.


Genes ◽  
2021 ◽  
Vol 12 (12) ◽  
pp. 1847
Author(s):  
Jie Xia ◽  
Lequn Wang ◽  
Guijun Zhang ◽  
ZuoChun Man ◽  
Luonan Chen

Rapid advances in single-cell genomics sequencing (SCGS) have allowed researchers to characterize tumor heterozygosity with unprecedented resolution and reveal the phylogenetic relationships between tumor cells or clones. However, high sequencing error rates of current SCGS data, i.e., false positives, false negatives, and missing bases, severely limit its application. Here, we present a deep learning framework, RDAClone, to recover genotype matrices from noisy data with an extended robust deep autoencoder, cluster cells into subclones by the Louvain-Jaccard method, and further infer evolutionary relationships between subclones by the minimum spanning tree. Studies on both simulated and real datasets demonstrate its robustness and superiority in data denoising, cell clustering, and evolutionary tree reconstruction, particularly for large datasets.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Tao Jiang ◽  
Shiqi Liu ◽  
Shuqi Cao ◽  
Yadong Liu ◽  
Zhe Cui ◽  
...  

Abstract Background With the rapid development of long-read sequencing technologies, it is possible to reveal the full spectrum of genetic structural variation (SV). However, the expensive cost, finite read length and high sequencing error for long-read data greatly limit the widespread adoption of SV calling. Therefore, it is urgent to establish guidance concerning sequencing coverage, read length, and error rate to maintain high SV yields and to achieve the lowest cost simultaneously. Results In this study, we generated a full range of simulated error-prone long-read datasets containing various sequencing settings and comprehensively evaluated the performance of SV calling with state-of-the-art long-read SV detection methods. The benchmark results demonstrate that almost all SV callers perform better when the long-read data reach 20× coverage, 20 kbp average read length, and approximately 10–7.5% or below 1% error rates. Furthermore, high sequencing coverage is the most influential factor in promoting SV calling, while it also directly determines the expensive costs. Conclusions Based on the comprehensive evaluation results, we provide important guidelines for selecting long-read sequencing settings for efficient SV calling. We believe these recommended settings of long-read sequencing will have extraordinary guiding significance in cutting-edge genomic studies and clinical practices.


Animals ◽  
2021 ◽  
Vol 11 (11) ◽  
pp. 3186
Author(s):  
Eunkyung Choi ◽  
Sun Hee Kim ◽  
Seung Jae Lee ◽  
Euna Jo ◽  
Jinmu Kim ◽  
...  

Trematomus loennbergii Regan, 1913, is an evolutionarily important marine fish species distributed in the Antarctic Ocean. However, its genome has not been studied to date. In the present study, whole genome sequencing was performed using next-generation sequencing (NGS) technology to characterize its genome and develop genomic microsatellite markers. The 25-mer frequency distribution was estimated to be the best, and the genome size was predicted to be 815,042,992 bp. The heterozygosity, average rate of read duplication, and sequencing error rates were 0.536%, 0.724%, and 0.292%, respectively. These data were used to analyze microsatellite markers, and a total of 2,264,647 repeat motifs were identified. The most frequent repeat motif was di-nucleotide with 87.00% frequency, followed by tri-nucleotide (10.45%), tetra-nucleotide (1.94%), penta-nucleotide (0.34%), and hexa-nucleotide (0.27%). The AC repeat motif was the most abundant motif among di-nucleotides and among all repeat motifs. Among microsatellite markers, 181 markers were selected and PCR technology was used to validate several markers. A total of 15 markers produced only one band. In summary, these results provide a good basis for further studies, including evolutionary biology studies and population genetics of Antarctic fish species.


2021 ◽  
Vol 8 (Supplement_1) ◽  
pp. S497-S498
Author(s):  
Mohamad Sater ◽  
Remy Schwab ◽  
Ian Herriott ◽  
Tim Farrell ◽  
Miriam Huntley

Abstract Background Healthcare associated infections (HAIs) are a major contributor to patient morbidity and mortality worldwide. HAIs are increasingly important due to the rise of multidrug resistant pathogens which can lead to deadly nosocomial outbreaks. Current methods for investigating transmissions are slow, costly, or have poor detection resolution. A rapid, cost-effective and high-resolution method to identify transmission events is imperative to guide infection control. Whole genome sequencing of infecting pathogens paired with a single nucleotide polymorphism (SNP) analysis can provide high-resolution clonality determination, yet these methods typically have long turnaround times. Here we examined the utility of the Oxford Nanopore Technologies (ONT) platform, a rapid sequencing technology, for whole genome sequencing based transmission analysis. Methods We developed a SNP calling pipeline customized for ONT data, which exhibit higher sequencing error rates and can therefore be challenging for transmission analysis. The pipeline leverages the latest basecalling tools as well as a suite of custom variant calling and filtering algorithms to achieve highest accuracy in clonality calls compared to Illumina-based sequencing. We also capitalize on ONT long reads by assembling outbreak-specific genomes in order to overcome the need for an external reference genome. Results We examined 20 bacterial isolates from 5 HAI investigations previously performed at Day Zero Diagnostics as part of epiXact®, our commercialized Illumina-based HAI sequencing and analysis service. Using the ONT data and pipeline, we achieved greater than 90% SNP-calling sensitivity and precision, allowing 100% accuracy of clonality classification compared to Illumina-based results across common HAI species. We demonstrate the validity and increased resolution of our SNP analysis pipeline using assembled genomes from each outbreak. We also demonstrate that this ONT-based workflow can produce isolate to transmission determination (i.e. including WGS and analysis) in less than 24 hours. SNP calling performance ONT-based SNP calling sensitivity and precision compared to Illumina-based pipeline Conclusion We demonstrate the utility of ONT for HAI investigation, establishing the potential to transform healthcare epidemiology with same-day high-resolution transmission determination. Disclosures Mohamad Sater, PhD, Day Zero Diagnostics (Employee, Shareholder) Remy Schwab, MSc, Day Zero Diagnostics (Employee, Shareholder) Ian Herriott, BS, Day Zero Diagnostics (Employee, Shareholder) Tim Farrell, MS, Day Zero Diagnostics, Inc. (Employee, Shareholder) Miriam Huntley, PhD, Day Zero Diagnostics (Employee, Shareholder)


2021 ◽  
pp. 1085-1095
Author(s):  
Jonathan Bieler ◽  
Christian Pozzorini ◽  
Jessica Garcia ◽  
Alex C. Tuck ◽  
Morgane Macheret ◽  
...  

PURPOSE The ability of next-generation sequencing (NGS) assays to interrogate thousands of genomic loci has revolutionized genetic testing. However, translation to the clinic is impeded by false-negative results that pose a risk to patients. In response, regulatory bodies are calling for reliability measures to be reported alongside NGS results. Existing methods to estimate reliability do not account for sample- and position-specific variability, which can be significant. Here, we report an approach that computes reliability metrics for every genomic position and sample interrogated by an NGS assay. METHODS Our approach predicts the limit of detection (LOD), the lowest reliably detectable variant fraction, by taking technical factors into account. We initially explored how LOD is affected by input material amount, library conversion rate, sequencing coverage, and sequencing error rate. This revealed that LOD depends heavily on genomic context and sample properties. Using these insights, we developed a computational approach to predict LOD on the basis of a biophysical model of the NGS workflow. We focused on targeted assays for cell-free DNA, but, in principle, this approach applies to any NGS assay. RESULTS We validated our approach by showing that it accurately predicts LOD and distinguishes reliable from unreliable results when screening 580 lung cancer samples for actionable mutations. Compared with a standard variant calling workflow, our approach avoided most false negatives and improved interassay concordance from 94% to 99%. CONCLUSION Our approach, which we name LAVA (LOD-aware variant analysis), reports the LOD for every position and sample interrogated by an NGS assay. This enables reliable results to be identified and improves the transparency and safety of genetic tests.


PLoS ONE ◽  
2021 ◽  
Vol 16 (10) ◽  
pp. e0257521
Author(s):  
Clara Delahaye ◽  
Jacques Nicolas

Oxford Nanopore Technologies’ (ONT) long read sequencers offer access to longer DNA fragments than previous sequencer generations, at the cost of a higher error rate. While many papers have studied read correction methods, few have addressed the detailed characterization of observed errors, a task complicated by frequent changes in chemistry and software in ONT technology. The MinION sequencer is now more stable and this paper proposes an up-to-date view of its error landscape, using the most mature flowcell and basecaller. We studied Nanopore sequencing error biases on both bacterial and human DNA reads. We found that, although Nanopore sequencing is expected not to suffer from GC bias, it is a crucial parameter with respect to errors. In particular, low-GC reads have fewer errors than high-GC reads (about 6% and 8% respectively). The error profile for homopolymeric regions or regions with short repeats, the source of about half of all sequencing errors, also depends on the GC rate and mainly shows deletions, although there are some reads with long insertions. Another interesting finding is that the quality measure, although over-estimated, offers valuable information to predict the error rate as well as the abundance of reads. We supplemented this study with an analysis of a rapeseed RNA read set and shown a higher level of errors with a higher level of deletion in these data. Finally, we have implemented an open source pipeline for long-term monitoring of the error profile, which enables users to easily compute various analysis presented in this work, including for future developments of the sequencing device. Overall, we hope this work will provide a basis for the design of better error-correction methods.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Yasemin Guenay-Greunke ◽  
David A. Bohan ◽  
Michael Traugott ◽  
Corinna Wallinger

AbstractHigh-throughput sequencing platforms are increasingly being used for targeted amplicon sequencing because they enable cost-effective sequencing of large sample sets. For meaningful interpretation of targeted amplicon sequencing data and comparison between studies, it is critical that bioinformatic analyses do not introduce artefacts and rely on detailed protocols to ensure that all methods are properly performed and documented. The analysis of large sample sets and the use of predefined indexes create challenges, such as adjusting the sequencing depth across samples and taking sequencing errors or index hopping into account. However, the potential biases these factors introduce to high-throughput amplicon sequencing data sets and how they may be overcome have rarely been addressed. On the example of a nested metabarcoding analysis of 1920 carabid beetle regurgitates to assess plant feeding, we investigated: (i) the variation in sequencing depth of individually tagged samples and the effect of library preparation on the data output; (ii) the influence of sequencing errors within index regions and its consequences for demultiplexing; and (iii) the effect of index hopping. Our results demonstrate that despite library quantification, large variation in read counts and sequencing depth occurred among samples and that the sequencing error rate in bioinformatic software is essential for accurate adapter/primer trimming and demultiplexing. Moreover, setting an index hopping threshold to avoid incorrect assignment of samples is highly recommended.


Cells ◽  
2021 ◽  
Vol 10 (10) ◽  
pp. 2577
Author(s):  
Imogen A. Wright ◽  
Kayla E. Delaney ◽  
Mary Grace K. Katusiime ◽  
Johannes C. Botha ◽  
Susan Engelbrecht ◽  
...  

HIV-1 proviral single-genome sequencing by limiting-dilution polymerase chain reaction (PCR) amplification is important for differentiating the sequence-intact from defective proviruses that persist during antiretroviral therapy (ART). Intact proviruses may rebound if ART is interrupted and are the barrier to an HIV cure. Oxford Nanopore Technologies (ONT) sequencing offers a promising, cost-effective approach to the sequencing of long amplicons such as near full-length HIV-1 proviruses, but the high diversity of HIV-1 and the ONT sequencing error render analysis of the generated data difficult. NanoHIV is a new tool that uses an iterative consensus generation approach to construct accurate, near full-length HIV-1 proviral single-genome sequences from ONT data. To validate the approach, single-genome sequences generated using NanoHIV consensus building were compared to Illumina® consensus building of the same nine single-genome near full-length amplicons and an average agreement of 99.4% was found between the two sequencing approaches.


Sign in / Sign up

Export Citation Format

Share Document