Benchmarking hybrid assembly approaches for genomic analyses of bacterial pathogens using Illumina and Oxford Nanopore sequencing

Zhao Chen; David L. Erickson; Jianghong Meng

doi:10.1186/s12864-020-07041-8

Benchmarking hybrid assembly approaches for genomic analyses of bacterial pathogens using Illumina and Oxford Nanopore sequencing

BMC Genomics ◽

10.1186/s12864-020-07041-8 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Zhao Chen ◽

David L. Erickson ◽

Jianghong Meng

Keyword(s):

Virulence Genes ◽

Reference Genome ◽

Bacterial Pathogens ◽

Nanopore Sequencing ◽

Hybrid Assembly ◽

Pan Genome ◽

Long Reads ◽

Oxford Nanopore ◽

Hybrid Assemblies ◽

Reference Genomes

Abstract Background We benchmarked the hybrid assembly approaches of MaSuRCA, SPAdes, and Unicycler for bacterial pathogens using Illumina and Oxford Nanopore sequencing by determining genome completeness and accuracy, antimicrobial resistance (AMR), virulence potential, multilocus sequence typing (MLST), phylogeny, and pan genome. Ten bacterial species (10 strains) were tested for simulated reads of both mediocre- and low-quality, whereas 11 bacterial species (12 strains) were tested for real reads. Results Unicycler performed the best for achieving contiguous genomes, closely followed by MaSuRCA, while all SPAdes assemblies were incomplete. MaSuRCA was less tolerant of low-quality long reads than SPAdes and Unicycler. The hybrid assemblies of five antimicrobial-resistant strains with simulated reads provided consistent AMR genotypes with the reference genomes. The MaSuRCA assembly of Staphylococcus aureus with real reads contained msr(A) and tet(K), while the reference genome and SPAdes and Unicycler assemblies harbored blaZ. The AMR genotypes of the reference genomes and hybrid assemblies were consistent for the other five antimicrobial-resistant strains with real reads. The numbers of virulence genes in all hybrid assemblies were similar to those of the reference genomes, irrespective of simulated or real reads. Only one exception existed that the reference genome and hybrid assemblies of Pseudomonas aeruginosa with mediocre-quality long reads carried 241 virulence genes, whereas 184 virulence genes were identified in the hybrid assemblies of low-quality long reads. The MaSuRCA assemblies of Escherichia coli O157:H7 and Salmonella Typhimurium with mediocre-quality long reads contained 126 and 118 virulence genes, respectively, while 110 and 107 virulence genes were detected in their MaSuRCA assemblies of low-quality long reads, respectively. All approaches performed well in our MLST and phylogenetic analyses. The pan genomes of the hybrid assemblies of S. Typhimurium with mediocre-quality long reads were similar to that of the reference genome, while SPAdes and Unicycler were more tolerant of low-quality long reads than MaSuRCA for the pan-genome analysis. All approaches functioned well in the pan-genome analysis of Campylobacter jejuni with real reads. Conclusions Our research demonstrates the hybrid assembly pipeline of Unicycler as a superior approach for genomic analyses of bacterial pathogens using Illumina and Oxford Nanopore sequencing.

Download Full-text

Benchmarking Long-Read Assemblers for Genomic Analyses of Bacterial Pathogens Using Oxford Nanopore Sequencing

International Journal of Molecular Sciences ◽

10.3390/ijms21239161 ◽

2020 ◽

Vol 21 (23) ◽

pp. 9161

Author(s):

Zhao Chen ◽

David L. Erickson ◽

Jianghong Meng

Keyword(s):

Virulence Genes ◽

Bacterial Pathogens ◽

Error Rates ◽

Nanopore Sequencing ◽

Long Reads ◽

Oxford Nanopore ◽

Genomic Analyses ◽

Long Read ◽

Genome Analyses ◽

Assembly Algorithms

Oxford Nanopore sequencing can be used to achieve complete bacterial genomes. However, the error rates of Oxford Nanopore long reads are greater compared to Illumina short reads. Long-read assemblers using a variety of assembly algorithms have been developed to overcome this deficiency, which have not been benchmarked for genomic analyses of bacterial pathogens using Oxford Nanopore long reads. In this study, long-read assemblers, namely Canu, Flye, Miniasm/Racon, Raven, Redbean, and Shasta, were thus benchmarked using Oxford Nanopore long reads of bacterial pathogens. Ten species were tested for mediocre- and low-quality simulated reads, and 10 species were tested for real reads. Raven was the most robust assembler, obtaining complete and accurate genomes. All Miniasm/Racon and Raven assemblies of mediocre-quality reads provided accurate antimicrobial resistance (AMR) profiles, while the Raven assembly of Klebsiella variicola with low-quality reads was the only assembly with an accurate AMR profile among all assemblers and species. All assemblers functioned well for predicting virulence genes using mediocre-quality and real reads, whereas only the Raven assemblies of low-quality reads had accurate numbers of virulence genes. Regarding multilocus sequence typing (MLST), Miniasm/Racon was the most effective assembler for mediocre-quality reads, while only the Raven assemblies of Escherichia coli O157:H7 and K. variicola with low-quality reads showed positive MLST results. Miniasm/Racon and Raven were the best performers for MLST using real reads. The Miniasm/Racon and Raven assemblies showed accurate phylogenetic inference. For the pan-genome analyses, Raven was the strongest assembler for simulated reads, whereas Miniasm/Racon and Raven performed the best for real reads. Overall, the most robust and accurate assembler was Raven, closely followed by Miniasm/Racon.

Download Full-text

Optimised use of Oxford Nanopore Flowcells for Hybrid Assemblies

10.1101/2020.03.05.979278 ◽

2020 ◽

Cited By ~ 1

Author(s):

Samuel Lipworth ◽

Hayleah Pickford ◽

Nicholas Sanderson ◽

Kevin K Chau ◽

James Kavanagh ◽

...

Keyword(s):

Large Scale ◽

Total Output ◽

Hybrid Assembly ◽

Human Pathogens ◽

Sequencing Data ◽

Short Read ◽

Long Reads ◽

Oxford Nanopore ◽

The Cost ◽

Hybrid Assemblies

AbstractHybrid assemblies are highly valuable for studies of Enterobacteriaceae due to their ability to fully resolve the structure of mobile genetic elements, such as plasmids, which are involved in the carriage of clinically important genes (e.g. those involved in AMR/virulence). The widespread application of this technique is currently primarily limited by cost. Recent data has suggested that non-inferior, and even superior, hybrid assemblies can be produced using a fraction of the total output from a multiplexed nanopore (Oxford Nanopore Technologies [ONT]) flowcell run. In this study we sought to determine the optimal minimal running time for flowcells when acquiring reads for hybrid assembly. We then evaluated whether the ONT wash kit might allow users to exploit shorter running times by sequencing multiple libraries per flowcell. After 24 hours of sequencing, most chromosomes and plasmids had circularised and there was no benefit associated with longer running times. Quality was similar at 12 hours suggesting shorter running times are likely to be acceptable for certain applications (e.g. plasmid genomics). The ONT wash kit was highly effective in removing DNA between libraries. Contamination between libraries did not appear to affect subsequent hybrid assemblies, even when the same barcodes were used successively on a single flowcell. Utilising shorter run-times in combination with between-library nuclease washes allows at least 36 Enterobacteriaceae isolates to be sequenced per flowcell, significantly reducing the per isolate sequencing cost. Ultimately this will facilitate large-scale studies utilising hybrid assembly advancing our understanding of the genomics of key human pathogens.Data SummaryRaw sequencing data is available via NCBI under project accession number PRJNA604975. Sample accession numbers are provided in table S1.Assemblies are available via Figshare https://doi.org/10.6084/m9.figshare.11816532.v1.Impact StatementMost existing sequencing data has been acquired from short-read platforms (eg. Illumina). For some species of bacteria, clinically important genes, such as those involved in antibiotic resistance and/or virulence, are carried on plasmids. Whilst Illumina sequencing is highly accurate, it is generally unable to resolve complete genomic structures due to repetitive regions. Hybrid assembly uses long reads to scaffold together short-read contigs, maximising the benefits of both technologies. A major limiting factor to using hybrid assemblies at scale is the cost of sequencing the same isolate with two different technologies. Here we show that high-quality hybrid assemblies can be created for most isolates using significantly shorter run-times than are currently standard. We demonstrate that a simple washing step allows several libraries to be run on the same flowcell, facilitating the ability to take advantage of shorter running times. Adding nuclease means that contamination between libraries is minimal and has no significant effect on the quality of subsequent hybrid assemblies. This approach reduces the cost of acquiring long reads by >30%, paving the way for large-scale studies utilising hybrid assemblies which will likely significantly enhance our understanding of the genomics of important human pathogens.

Download Full-text

Complete Genome Sequences of 12 Quinolone-Resistant Escherichia coli Strains Containing qnrS1 Based on Hybrid Assemblies

Microbiology Resource Announcements ◽

10.1128/mra.01190-20 ◽

2021 ◽

Vol 10 (4) ◽

Author(s):

Håkon Kaspersen ◽

Thomas H. A. Haverkamp ◽

Hanna Karin Ilag ◽

Øivind Øines ◽

Camilla Sekse ◽

...

Keyword(s):

Escherichia Coli ◽

Complete Genome ◽

Flow Cell ◽

Hybrid Assembly ◽

Genome Sequences ◽

Content Type ◽

Short Reads ◽

Long Reads ◽

Long Read ◽

Hybrid Assemblies

ABSTRACT In total, 12 quinolone-resistant Escherichia coli (QREC) strains containing qnrS1 were submitted to long-read sequencing using a FLO-MIN106 flow cell on a MinION device. The long reads were assembled with short reads (Illumina) and analyzed using the MOB-suite pipeline. Six of these QREC genome sequences were closed after hybrid assembly.

Download Full-text

Evaluation of Germline Structural Variant Calling Methods for Nanopore Sequencing Data

Frontiers in Genetics ◽

10.3389/fgene.2021.761791 ◽

2021 ◽

Vol 12 ◽

Author(s):

Davide Bolognini ◽

Alberto Magi

Keyword(s):

Variant Calling ◽

Research Report ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Factors Affecting ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Sequencing Studies ◽

Long Read

Structural variants (SVs) are genomic rearrangements that involve at least 50 nucleotides and are known to have a serious impact on human health. While prior short-read sequencing technologies have often proved inadequate for a comprehensive assessment of structural variation, more recent long reads from Oxford Nanopore Technologies have already been proven invaluable for the discovery of large SVs and hold the potential to facilitate the resolution of the full SV spectrum. With many long-read sequencing studies to follow, it is crucial to assess factors affecting current SV calling pipelines for nanopore sequencing data. In this brief research report, we evaluate and compare the performances of five long-read SV callers across four long-read aligners using both real and synthetic nanopore datasets. In particular, we focus on the effects of read alignment, sequencing coverage, and variant allele depth on the detection and genotyping of SVs of different types and size ranges and provide insights into precision and recall of SV callsets generated by integrating the various long-read aligners and SV callers. The computational pipeline we propose is publicly available at https://github.com/davidebolo1993/EViNCe and can be adjusted to further evaluate future nanopore sequencing datasets.

Download Full-text

Kmer2SNP: reference-free SNP calling from raw reads based on matching

10.1101/2020.05.17.100305 ◽

2020 ◽

Author(s):

Yanbo Li ◽

Yu Lin

Keyword(s):

Reference Genome ◽

State Of The Art ◽

Fundamental Problem ◽

Disease Diagnosis ◽

Hybrid Assembly ◽

Snp Calling ◽

Sequencing Technologies ◽

Order Of Magnitude ◽

Maximum Weight Matching ◽

Reference Genomes

AbstractThe development of DNA sequencing technologies provides the opportunity to call heterozygous SNPs for each individual. SNP calling is a fundamental problem of genetic analysis and has many applications, such as gene-disease diagnosis, drug design, and ancestry inference. Reference-based SNP calling approaches generate highly accurate results, but they face serious limitations especially when high-quality reference genomes are not available for many species. Although reference-free approaches have the potential to call SNPs without using the reference genome, they have not been widely applied on large and complex genomes because existing approaches suffer from low recall/precision or high runtime.We develop a reference-free algorithm Kmer2SNP to call SNP directly from raw reads. Kmer2SNP first computes the k-mer frequency distribution from reads and identifies potential heterozygous k-mers which only appear in one haplotype. Kmer2SNP then constructs a graph by choosing these heterozygous k-mers as vertices and connecting edges between pairs of heterozygous k-mers that might correspond to SNPs. Kmer2SNP further assigns a weight to each edge using overlapping information between heterozygous k-mers, computes a maximum weight matching and finally outputs SNPs as edges between k-mer pairs in the matching.We benchmark Kmer2SNP against reference-free methods including hybrid (assembly-based) and assembly-free methods on both simulated and real datasets. Experimental results show that Kmer2SNP achieves better SNP calling quality while being an order of magnitude faster than the state-of-the-art methods. Kmer2SNP shows the potential of calling SNPs only using k-mers from raw reads without assembly. The source code is freely available at https://github.com/yanboANU/Kmer2SNP.

Download Full-text

Inclusion of Oxford Nanopore long reads improves all microbial and phage metagenome-assembled genomes from a complex aquifer system

10.1101/2019.12.18.880807 ◽

2019 ◽

Cited By ~ 1

Author(s):

Will A. Overholt ◽

Martin Hölzer ◽

Patricia Geesink ◽

Celia Diezel ◽

Manja Marz ◽

...

Keyword(s):

Cost Benefit Analysis ◽

Hybrid Approach ◽

Cost Benefit ◽

16S Rrna Genes ◽

Rrna Genes ◽

Aquifer System ◽

Sequencing Platform ◽

Long Reads ◽

Oxford Nanopore ◽

Hybrid Assemblies

AbstractAssembling microbial and phage genomes from metagenomes is a powerful and appealing method to understand structure-function relationships in complex environments. In order to compare the recovery of genomes from microorganisms and their phages from groundwater, we generated shotgun metagenomes with Illumina sequencing accompanied by long reads derived from the Oxford Nanopore sequencing platform. Assembly and metagenome-assembled genome (MAG) metrics for both microbes and viruses were determined from Illumina-only assemblies and a hybrid assembly approach. Strikingly, the hybrid approach more than doubled the number of mid to high-quality MAGs (> 50% completion, < 10% redundancy), generated nearly four-fold more phage genomes, and improved all associated genome metrics relative to the Illumina only method. The hybrid assemblies yielded MAGs that were on average 7.8% more complete, with 133 fewer contigs and a 14 kbp greater N50. Furthermore, the longer contigs from the hybrid approach generated microbial MAGs that had a higher proportion of rRNA genes. We demonstrate this usefulness by linking microbial MAGs containing 16S rRNA genes with extensive amplicon dataset. This work provides quantitative data to inform a cost-benefit analysis on the decision to supplement shotgun metagenomic projects with long reads towards the goal of recovering genomes from environmentally abundant groups.

Download Full-text

Telomere-to-telomere gapless chromosomes of banana using nanopore sequencing

Communications Biology ◽

10.1038/s42003-021-02559-3 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Caroline Belser ◽

Franc-Christophe Baurens ◽

Benjamin Noel ◽

Guillaume Martin ◽

Corinne Cruaud ◽

...

Keyword(s):

Musa Acuminata ◽

Genetic Maps ◽

Nanopore Sequencing ◽

Genome Coverage ◽

Long Reads ◽

Oxford Nanopore ◽

A Genome ◽

Long Read ◽

Genome Assemblies ◽

First Time

AbstractLong-read technologies hold the promise to obtain more complete genome assemblies and to make them easier. Coupled with long-range technologies, they can reveal the architecture of complex regions, like centromeres or rDNA clusters. These technologies also make it possible to know the complete organization of chromosomes, which remained complicated before even when using genetic maps. However, generating a gapless and telomere-to-telomere assembly is still not trivial, and requires a combination of several technologies and the choice of suitable software. Here, we report a chromosome-scale assembly of a banana genome (Musa acuminata) generated using Oxford Nanopore long-reads. We generated a genome coverage of 177X from a single PromethION flowcell with near 17X with reads longer than 75 kbp. From the 11 chromosomes, 5 were entirely reconstructed in a single contig from telomere to telomere, revealing for the first time the content of complex regions like centromeres or clusters of paralogous genes.

Download Full-text

nf-core/mag: a best-practice pipeline for metagenome hybrid assembly and binning

10.1101/2021.08.29.458094 ◽

2021 ◽

Author(s):

Sabrina Krakau ◽

Daniel Straub ◽

Hadrien Gourlé ◽

Gisela Gabernet ◽

Sven Nahnsen

Keyword(s):

Microbial Communities ◽

Best Practice ◽

Metagenomic Data ◽

Hybrid Assembly ◽

Individual Genome ◽

Long Reads ◽

Metagenome Assembly ◽

Group Information ◽

Genome Level ◽

Reference Genomes

The analysis of shotgun metagenomic data provides valuable insights into microbial communities, while allowing resolution at individual genome level. In absence of complete reference genomes, this requires the reconstruction of metagenome assembled genomes (MAGs) from sequencing reads. We present the nf-core/mag pipeline for metagenome assembly, binning and taxonomic classification. It can optionally combine short and long reads to increase assembly continuity and utilize sample-wise group-information for co-assembly and genome binning. The pipeline is easy to install - all dependencies are provided within containers -, portable and reproducible. It is written in Nextflow and developed as part of the nf-core initiative for best-practice pipeline development. All code is hosted on GitHub under the nf-core organization https://github.com/nf-core/mag and released under the MIT license.

Download Full-text

De-novo Assembly of Limnospira fusiformis Using Ultra-Long Reads

Frontiers in Microbiology ◽

10.3389/fmicb.2021.657995 ◽

2021 ◽

Vol 12 ◽

Author(s):

McKenna Hicks ◽

Thuy-Khanh Tran-Dao ◽

Logan Mulroney ◽

David L. Bernick

Keyword(s):

Phylogenetic Analysis ◽

Type Strain ◽

Reference Genome ◽

De Novo ◽

Illumina Miseq ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Oxford Nanopore Technologies ◽

Rdna Analysis

The Limnospira genus is a recently established clade that is economically important due to its worldwide use in biotechnology and agriculture. This genus includes organisms that were reclassified from Arthrospira, which are commercially marketed as “Spirulina.” Limnospira are photoautotrophic organisms that are widely used for research in nutrition, medicine, bioremediation, and biomanufacturing. Despite its widespread use, there is no closed genome for the Limnospira genus, and no reference genome for the type strain, Limnospira fusiformis. In this work, the L. fusiformis genome was sequenced using Oxford Nanopore Technologies MinION and assembled using only ultra-long reads (>35 kb). This assembly was polished with Illumina MiSeq reads sourced from an axenic L. fusiformis culture; axenicity was verified via microscopy and rDNA analysis. Ultra-long read sequencing resulted in a 6.42 Mb closed genome assembled as a single contig with no plasmid. Phylogenetic analysis placed L. fusiformis in the Limnospira clade; some Arthrospira were also placed in this clade, suggesting a misclassification of these strains. This work provides a fully closed and accurate reference genome for the economically important type strain, L. fusiformis. We also present a rapid axenicity method to isolate L. fusiformis. These contributions enable future biotechnological development of L. fusiformis by way of genetic engineering.

Download Full-text

Telomere-to-telomere gapless chromosomes of banana using nanopore sequencing

10.1101/2021.04.16.440017 ◽

2021 ◽

Author(s):

Caroline Belser ◽

Franc-Christophe Baurens ◽

Benjamin Noel ◽

Guillaume Martin ◽

Corinne Cruaud ◽

...

Keyword(s):

Musa Acuminata ◽

Genetic Maps ◽

Nanopore Sequencing ◽

Genome Coverage ◽

Long Reads ◽

Oxford Nanopore ◽

A Genome ◽

Long Read ◽

Genome Assemblies ◽

First Time

AbstractLong-read technologies hold the promise to obtain more complete genome assemblies and to make them easier. Coupled with long-range technologies, they can reveal the architecture of complex regions, like centromeres or rDNA clusters. These technologies also make it possible to know the complete organization of chromosomes, which remained complicated before even when using genetic maps. However, generating a gapless and telomere-to-telomere assembly is still not trivial, and requires a combination of several technologies and the choice of suitable software. Here, we report a chromosome-scale assembly of a banana genome (Musa acuminata) generated using Oxford Nanopore long-reads. We generated a genome coverage of 177X from a single PromethION flowcell with near 17X with reads longer than 75Kb. From the 11 chromosomes, 5 were entirely reconstructed in a single contig from telomere to telomere, revealing for the first time the content of complex regions like centromeres or clusters of paralogous genes.

Download Full-text