Efficient long single molecule sequencing for cost effective and accurate sequencing, haplotyping, and de novo assembly

Mapping Intimacies ◽

10.1101/324392 ◽

2018 ◽

Author(s):

Ou Wang ◽

Robert Chin ◽

Xiaofang Cheng ◽

Michelle Ka Wu ◽

Qing Mao ◽

...

Keyword(s):

Single Molecule ◽

Genome Assembly ◽

De Novo ◽

Low Cost ◽

Variant Calling ◽

Cost Effective ◽

High Quality ◽

Single Molecule Sequencing ◽

Single Tube ◽

Complex Structural

Obtaining accurate sequences from long DNA molecules is very important for genome assembly and other applications. Here we describe single tube long fragment read (stLFR), a technology that enables this a low cost. It is based on adding the same barcode sequence to sub-fragments of the original long DNA molecule (DNA co-barcoding). To achieve this efficiently, stLFR uses the surface of microbeads to create millions of miniaturized barcoding reactions in a single tube. Using a combinatorial process up to 3.6 billion unique barcode sequences were generated on beads, enabling practically non-redundant co-barcoding with 50 million barcodes per sample. Using stLFR, we demonstrate efficient unique co-barcoding of over 8 million 20-300 kb genomic DNA fragments. Analysis of the genome of the human genome NA12878 with stLFR demonstrated high quality variant calling and phasing into contigs up to N50 34 Mb. We also demonstrate detection of complex structural variants and complete diploid de novo assembly of NA12878. These analyses were all performed using single stLFR libraries and their construction did not significantly add to the time or cost of whole genome sequencing (WGS) library preparation. stLFR represents an easily automatable solution that enables high quality sequencing, phasing, SV detection, scaffolding, cost-effective diploid de novo genome assembly, and other long DNA sequencing applications.

Download Full-text

AsmMix: A pipeline for high quality diploid de novo assembly

10.1101/2021.01.15.426893 ◽

2021 ◽

Author(s):

Pei Wu ◽

Chao Liu ◽

Ou Wang ◽

Xia Zhao ◽

Fang Chen ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Variant Calling ◽

The Other ◽

Second Step ◽

Small Scale ◽

Mixing Process ◽

High Quality ◽

Single Molecule Sequencing ◽

Long Read

AbstractIn this paper, we report a pipeline, AsmMix, which is capable of producing both contiguous and high-quality diploid genomes. The pipeline consists of two steps. In the first step, two sets of assemblies are generated: one is based on co-barcoded reads, which are highly accurate and haplotype-resolved but contain many gaps, the other assembly is based on single-molecule sequencing reads, which is contiguous but error-prone. In the second step, those two sets of assemblies are compared and integrated into a haplotype-resolved assembly with fewer errors. We test our pipeline using a dataset of human genome NA24385, perform variant calling from those assemblies and then compare against GIAB Benchmark. We show that AsmMix pipeline could produce highly contiguous, accurate, and haplotype-resolved assemblies. Especially the assembly mixing process could effectively reduce small-scale errors in the long read assembly.

Download Full-text

Chromosome-scale assembly comparison of the Korean Reference Genome KOREF from PromethION and PacBio with Hi-C mapping information

GigaScience ◽

10.1093/gigascience/giz125 ◽

2019 ◽

Vol 8 (12) ◽

Cited By ~ 6

Author(s):

Hui-Su Kim ◽

Sungwon Jeon ◽

Changjae Kim ◽

Yeon Kyung Kim ◽

Yun Sung Cho ◽

...

Keyword(s):

Human Genome ◽

Single Molecule ◽

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Low Cost ◽

Cost Effective ◽

Sequencing Data ◽

Smrt Sequencing ◽

Human Genome Assembly

Abstract Background Long DNA reads produced by single-molecule and pore-based sequencers are more suitable for assembly and structural variation discovery than short-read DNA fragments. For de novo assembly, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are the favorite options. However, PacBio's SMRT sequencing is expensive for a full human genome assembly and costs more than $40,000 US for 30× coverage as of 2019. ONT PromethION sequencing, on the other hand, is 1/12 the price of PacBio for the same coverage. This study aimed to compare the cost-effectiveness of ONT PromethION and PacBio's SMRT sequencing in relation to the quality. Findings We performed whole-genome de novo assemblies and comparison to construct an improved version of KOREF, the Korean reference genome, using sequencing data produced by PromethION and PacBio. With PromethION, an assembly using sequenced reads with 64× coverage (193 Gb, 3 flowcell sequencing) resulted in 3,725 contigs with N50s of 16.7 Mb and a total genome length of 2.8 Gb. It was comparable to a KOREF assembly constructed using PacBio at 62× coverage (188 Gb, 2,695 contigs, and N50s of 17.9 Mb). When we applied Hi-C–derived long-range mapping data, an even higher quality assembly for the 64× coverage was achieved, resulting in 3,179 scaffolds with an N50 of 56.4 Mb. Conclusion The pore-based PromethION approach provided a high-quality chromosome-scale human genome assembly at a low cost with long maximum contig and scaffold lengths and was more cost-effective than PacBio at comparable quality measurements.

Download Full-text

Chromosome-scale assembly comparison of the Korean Reference Genome KOREF from PromethION and PacBio with Hi-C mapping information

10.1101/674804 ◽

2019 ◽

Cited By ~ 2

Author(s):

Hui-Su Kim ◽

Sungwon Jeon ◽

Changjae Kim ◽

Yeon Kyung Kim ◽

Yun Sung Cho ◽

...

Keyword(s):

Human Genome ◽

Single Molecule ◽

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Low Cost ◽

Cost Effective ◽

Sequencing Data ◽

Smrt Sequencing ◽

Human Genome Assembly

AbstractBackgroundLong DNA reads produced by single molecule and pore-based sequencers are more suitable for assembly and structural variation discovery than short read DNA fragments. For de novo assembly, PacBio and Oxford Nanopore Technologies (ONT) are favorite options. However, PacBio’s SMRT sequencing is expensive for a full human genome assembly and costs over 40,000 USD for 30x coverage as of 2019. ONT PromethION sequencing, on the other hand, is one-twelfth the price of PacBio for the same coverage. This study aimed to compare the cost-effectiveness of ONT PromethION and PacBio’s SMRT sequencing in relation to the quality.FindingsWe performed whole genome de novo assemblies and comparison to construct an improved version of KOREF, the Korean reference genome, using sequencing data produced by PromethION and PacBio. With PromethION, an assembly using sequenced reads with 64x coverage (193 Gb, 3 flowcell sequencing) resulted in 3,725 contigs with N50s of 16.7 Mbp and a total genome length of 2.8 Gbp. It was comparable to a KOREF assembly constructed using PacBio at 62x coverage (188 Gbp, 2,695 contigs and N50s of 17.9 Mbp). When we applied Hi-C-derived long-range mapping data, an even higher quality assembly for the 64x coverage was achieved, resulting in 3,179 scaffolds with an N50 of 56.4 Mbp.ConclusionThe pore-based PromethION approach provides a good quality chromosome-scale human genome assembly at a low cost with long maximum contig and scaffold lengths and is more cost-effective than PacBio at comparable quality measurements.

Download Full-text

SMARTdenovo: a de novo assembler using long noisy reads

Gigabyte ◽

10.46471/gigabyte.15 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Hailin Liu ◽

Shigang Wu ◽

Alun Li ◽

Jue Ruan

Keyword(s):

Error Correction ◽

Single Molecule ◽

Genome Assembly ◽

De Novo ◽

Structural Variants ◽

High Quality ◽

Single Molecule Sequencing ◽

Long Read ◽

Reference Quality

Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. It has also been widely used to study structural variants, phase haplotypes and more. Here, we introduce the assembler SMARTdenovo, a single-molecule sequencing (SMS) assembler that follows the overlap-layout-consensus (OLC) paradigm. SMARTdenovo (RRID: SCR_017622) was designed to be a rapid assembler, which, unlike contemporaneous SMS assemblers, does not require highly accurate raw reads for error correction. It has performed well in the evaluation of congeneric assemblers and has been successfully users for various assembly projects. It is compatible with Canu for assembling high-quality genomes, and several of the assembly strategies in this program have been incorporated into subsequent popular assemblers. The assembler has been in use since 2015; here we provide information on the development of SMARTdenovo and how to implement its algorithms into current projects.

Download Full-text

Improved de novo Genome Assembly: Linked-Read Sequencing Combined with Optical Mapping Produce a High Quality Mammalian Genome at Relatively Low Cost

10.1101/128348 ◽

2017 ◽

Cited By ~ 10

Author(s):

DW Mohr ◽

A Naguib ◽

NI Weisenfeld ◽

V Kumar ◽

P Shah ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Low Cost ◽

Cost Effective ◽

Conserved Synteny ◽

High Quality ◽

De Novo Genome Assembly ◽

Optical Maps ◽

Commercial Applications ◽

Hybrid Scaffolds

AbstractCurrent short-read methods have come to dominate genome sequencing because they are cost-effective, rapid, and accurate. However, short reads are most applicable when data can be aligned to a known reference. Two new methods for de novo assembly are linked-reads and restriction-site labeled optical maps. We combined commercial applications of these technologies for genome assembly of an endangered mammal, the Hawaiian Monk seal.We show that the linked-reads produced with 10X Genomics Chromium chemistry and assembled with Supernova v1.1 software produced scaffolds with an N50 of 22.23 Mbp with the longest individual scaffold of 84.06 Mbp. When combined with Bionano Genomics optical maps using Bionano RefAligner, the scaffold N50 increased to 29.65 Mbp for a total of 170 hybrid scaffolds, the longest of which was 84.78 Mbp. These results were 161X and 215X, respectively, improved over DISCOVAR de novo assemblies. The quality of the scaffolds was assessed using conserved synteny analysis of both the DNA sequence and predicted seal proteins relative to the genomes of humans and other species. We found large blocks of conserved synteny suggesting that the hybrid scaffolds were high quality. An inversion in one scaffold complementary to human chromosome 6 was found and confirmed by optical maps.The complementarity of linked-reads and optical maps is likely to make the production of high quality genomes more routine and economical and, by doing so, significantly improve our understanding of comparative genome biology.

Download Full-text

SMARTdenovo: A de novo Assembler Using Long Noisy Reads

10.20944/preprints202009.0207.v1 ◽

2020 ◽

Cited By ~ 1

Author(s):

Hailin Liu ◽

Shigang Wu ◽

Alun Li ◽

Jue Ruan

Keyword(s):

Error Correction ◽

Single Molecule ◽

Genome Assembly ◽

De Novo ◽

Structural Variants ◽

High Quality ◽

De Novo Genome Assembly ◽

Single Molecule Sequencing ◽

Long Read ◽

Reference Quality

Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. It also has been widely used to study structural variants, phase haplotypes and more. Here, we introduce the assembler— SMARTdenovo, which is an SMS assembler that follows the overlap-layout-consensus (OLC) paradigm. SMARTdenovo (RRID: SCR_017622) was designed to be a fast assembler that did not require highly accurate raw reads for error correction, unlike other, contemporaneous SMS assemblers. It has performed well for evaluating congeneric assemblers and has been successful for a variety of assembly projects. It is compatible with Canu for assembling high-quality genomes, and several of the assembly strategies in this program have been incorporated into subsequent popular assemblers. The assembler has been in use since 2015, and here we provide information on the development of SMARTdenovo and how to implement its algorithms into current projects.

Download Full-text

Aquila_stLFR: assembly based variant calling package for stLFR and hybrid assembly for linked-reads

10.1101/742239 ◽

2019 ◽

Cited By ~ 1

Author(s):

Xin Zhou ◽

Lu Zhang ◽

Xiaodong Fang ◽

Yichen Liu ◽

David L. Dill ◽

...

Keyword(s):

De Novo ◽

Low Cost ◽

Variant Calling ◽

Hybrid Assembly ◽

Structural Variants ◽

Sequencing Data ◽

Single Tube ◽

Large Numbers ◽

Key Characteristics ◽

Hybrid Assemblies

AbstractHuman diploid genome assembly enables identifying maternal and paternal genetic variations. Algorithms based on 10x linked-read sequencing have been developed for de novo assembly, variant calling and haplotyping. Another linked-read technology, single tube long fragment read (stLFR), has recently provided a low-cost single tube solution that can enable long fragment data. However, no existing software is available for human diploid assembly and variant calls. We develop Aquila stLFR to adapt to the key characteristics of stLFR. Aquila stLFR assembles near perfect diploid assembled contigs, and the assembly-based variant calling shows that Aquila stLFR detects large numbers of structural variants which were not easily spanned by Illumina short-reads. Furthermore, the hybrid assembly mode Aquila hybrid allows a hybrid assembly based on both stLFR and 10x linked-reads libraries, demonstrating that these two technologies can always be complementary to each other for assembly to improve contiguity and the variants detection, regardless of assembly quality of the library itself from single sequencing technology. The overlapped structural variants (SVs) from two independent sequencing data of the same individual, and the SVs from hybrid assemblies provide us a high-confidence profile to study them.AvailabilitySource code and documentation are available on https://github.com/maiziex/Aquila_stLFR.

Download Full-text

LRez: C ++ API and toolkit for analyzing and managing Linked-Reads data

Bioinformatics Advances ◽

10.1093/bioadv/vbab022 ◽

2021 ◽

Author(s):

Pierre Morisse ◽

Claire Lemaitre ◽

Fabrice Legeai

Keyword(s):

Genome Assembly ◽

Low Cost ◽

Variant Calling ◽

Supplementary Information ◽

Supplementary Data ◽

High Quality ◽

Dna Molecule ◽

Sequencing Technologies ◽

Wide Range ◽

Genomic Regions

Abstract Motivation Linked-Reads technologies combine both the high-quality and low cost of short-reads sequencing and long-range information, through the use of barcodes tagging reads which originate from a common long DNA molecule. This technology has been employed in a broad range of applications including genome assembly, phasing and scaffolding, as well as structural variant calling. However, to date, no tool or API dedicated to the manipulation of Linked-Reads data exist. Results We introduce LRez, a C ++ API and toolkit which allows easy management of Linked-Reads data. LRez includes various functionalities, for computing numbers of common barcodes between genomic regions, extracting barcodes from BAM files, as well as indexing and querying BAM, FASTQ and gzipped FASTQ files to quickly fetch all reads or alignments containing a given barcode. LRez is compatible with a wide range of Linked-Reads sequencing technologies, and can thus be used in any tool or pipeline requiring barcode processing or indexing, in order to improve their performances. Availability and implementation LRez is implemented in C ++, supported on Unix-based platforms, and available under AGPL-3.0 License at https://github.com/morispi/LRez, and as a bioconda module. Supplementary information Supplementary data are available at Bioinformatics Advances

Download Full-text

Ultra-low input single tube linked-read library method enables short-read NGS systems to generate highly accurate and economical long-range sequencing information for de novo genome assembly and haplotype phasing

10.1101/852947 ◽

2019 ◽

Cited By ~ 3

Author(s):

Zhoutao Chen ◽

Long Pham ◽

Tsai-Chin Wu ◽

Guoya Mo ◽

Yu Xia ◽

...

Keyword(s):

Long Range ◽

De Novo ◽

Low Cost ◽

Cost Effective ◽

De Novo Genome Assembly ◽

Short Read ◽

Single Tube ◽

Haplotype Phasing ◽

A Genome ◽

Long Read

AbstractLong-range sequencing information is required for haplotype phasing, de novo assembly and structural variation detection. Current long-read sequencing technologies can provide valuable long-range information but at a high cost with low accuracy and high DNA input requirement. We have developed a single-tube Transposase Enzyme Linked Long-read Sequencing (TELL-Seq™) technology, which enables a low-cost, high-accuracy and high-throughput short-read next generation sequencer to routinely generate over 100 Kb long-range sequencing information with as little as 0.1 ng input material. In a PCR tube, millions of clonally barcoded beads are used to uniquely barcode long DNA molecules in an open bulk reaction without dilution and compartmentation. The barcode linked reads are used to successfully assemble genomes ranging from microbes to human. These linked-reads also generate mega-base-long phased blocks and provide a cost-effective tool for detecting structural variants in a genome, which are important to identify compound heterozygosity in recessive Mendelian diseases and discover genetic drivers and diagnostic biomarkers in cancers.

Download Full-text

Hybrid de novo genome assembly of Chinese chestnut (Castanea mollissima)

GigaScience ◽

10.1093/gigascience/giz112 ◽

2019 ◽

Vol 8 (9) ◽

Cited By ~ 11

Author(s):

Yu Xing ◽

Yang Liu ◽

Qing Zhang ◽

Xinghua Nie ◽

Yamin Sun ◽

...

Keyword(s):

Single Molecule ◽

Genome Assembly ◽

Genetic Improvement ◽

De Novo ◽

Draft Genome ◽

Whole Genome Sequence ◽

Whole Genome ◽

High Quality ◽

Chinese Chestnut ◽

Castanea Mollissima

AbstractBackgroundThe Chinese chestnut (Castanea mollissima) is widely cultivated in China for nut production. This plant also plays an important ecological role in afforestation and ecosystem services. To facilitate and expand the use of C. mollissima for breeding and its genetic improvement, we report here the whole-genome sequence of C. mollissima.FindingsWe produced a high-quality assembly of the C. mollissima genome using Pacific Biosciences single-molecule sequencing. The final draft genome is ∼785.53 Mb long, with a contig N50 size of 944 kb, and we further annotated 36,479 protein-coding genes in the genome. Phylogenetic analysis showed that C. mollissima diverged from Quercus robur, a member of the Fagaceae family, ∼13.62 million years ago.ConclusionsThe high-quality whole-genome assembly of C. mollissima will be a valuable resource for further genetic improvement and breeding for disease resistance and nut quality.

Download Full-text