Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies

Mapping Intimacies ◽

10.1101/2021.07.02.450803 ◽

2021 ◽

Author(s):

Ann M Mc Cartney ◽

Kishwar Shafin ◽

Michael Alonge ◽

Andrey V Bzikadze ◽

Giulio Formenti ◽

...

Keyword(s):

Genome Assembly ◽

Tandem Repeats ◽

Hydatidiform Mole ◽

Segmental Duplications ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Human Genome Assembly ◽

Long Read ◽

Genome Assemblies ◽

Oxford Nanopore Technologies

Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first Telomere-to-Telomere (T2T) human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Though derived from highly accurate sequencing, evaluation revealed that the initial T2T draft assembly had evidence of small errors and structural misassemblies. To correct these errors, we designed a novel repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly QV to 73.9. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both PacBio HiFi and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies.

Download Full-text

Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies

10.21203/rs.3.rs-712747/v1 ◽

2021 ◽

Author(s):

Arang Rhie ◽

Ann Mc Cartney ◽

Kishwar Shafin ◽

Michael Alonge ◽

Andrey Bzikadze ◽

...

Keyword(s):

Genome Assembly ◽

Tandem Repeats ◽

Hydatidiform Mole ◽

Segmental Duplications ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Human Genome Assembly ◽

Long Read ◽

Genome Assemblies ◽

Oxford Nanopore Technologies

Abstract Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first Telomere-to-Telomere (T2T) human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Though derived from highly accurate sequencing, evaluation revealed that the initial T2T draft assembly had evidence of small errors and structural misassemblies. To correct these errors, we designed a novel repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly QV to 73.9. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both PacBio HiFi and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies

Download Full-text

Comparison of the two up-to-date sequencing technologies for genome assembly: HiFi reads of Pacbio Sequel II system and ultralong reads of Oxford Nanopore

10.1101/2020.02.13.948489 ◽

2020 ◽

Author(s):

Dandan Lang ◽

Shilai Zhang ◽

Pingping Ren ◽

Fan Liang ◽

Zongyi Sun ◽

...

Keyword(s):

Gene Families ◽

Single Chromosome ◽

Small Indels ◽

Base Level ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

Single Rice ◽

Genome Assemblies ◽

Oxford Nanopore Technologies

AbstractThe availability of reference genomes has revolutionized the study of biology. Multiple competing technologies have been developed to improve the quality and robustness of genome assemblies during the last decade. The two widely-used long read sequencing providers – Pacbio (PB) and Oxford Nanopore Technologies (ONT) – have recently updated their platforms: PB enable high throughput HiFi reads with base-level resolution with >99% and ONT generated reads as long as 2 Mb. We applied the two up-to-date platforms to one single rice individual, and then compared the two assemblies to investigate the advantages and limitations of each. The results showed that ONT ultralong reads delivered higher contiguity producing a total of 18 contigs of which 10 were assembled into a single chromosome compared to that of 394 contigs and three chromosome-level contigs for the PB assembly. The ONT ultralong reads also prevented assembly errors caused by long repetitive regions for which we observed a total 44 genes of false redundancies and 10 genes of false losses in the PB assembly leading to over/under-estimations of the gene families in those long repetitive regions. We also noted that the PB HiFi reads generated assemblies with considerably less errors at the level of single nucleotide and small InDels than that of the ONT assembly which generated an average 1.06 errors per Kb assembly and finally engendered 1,475 incorrect gene annotations via altered or truncated protein predictions.

Download Full-text

SLR: a scaffolding algorithm based on long reads and contig classification

BMC Bioinformatics ◽

10.1186/s12859-019-3114-9 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 4

Author(s):

Junwei Luo ◽

Mengna Lyu ◽

Ranran Chen ◽

Xiaohong Zhang ◽

Huimin Luo ◽

...

Keyword(s):

Genome Assembly ◽

Pacific Biosciences ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Long Reads ◽

Oxford Nanopore ◽

New Strategy ◽

Long Read ◽

Oxford Nanopore Technologies ◽

Unique Contigs

Abstract Background Scaffolding is an important step in genome assembly that orders and orients the contigs produced by assemblers. However, repetitive regions in contigs usually prevent scaffolding from producing accurate results. How to solve the problem of repetitive regions has received a great deal of attention. In the past few years, long reads sequenced by third-generation sequencing technologies (Pacific Biosciences and Oxford Nanopore) have been demonstrated to be useful for sequencing repetitive regions in genomes. Although some stand-alone scaffolding algorithms based on long reads have been presented, scaffolding still requires a new strategy to take full advantage of the characteristics of long reads. Results Here, we present a new scaffolding algorithm based on long reads and contig classification (SLR). Through the alignment information of long reads and contigs, SLR classifies the contigs into unique contigs and ambiguous contigs for addressing the problem of repetitive regions. Next, SLR uses only unique contigs to produce draft scaffolds. Then, SLR inserts the ambiguous contigs into the draft scaffolds and produces the final scaffolds. We compare SLR to three popular scaffolding tools by using long read datasets sequenced with Pacific Biosciences and Oxford Nanopore technologies. The experimental results show that SLR can produce better results in terms of accuracy and completeness. The open-source code of SLR is available at https://github.com/luojunwei/SLR. Conclusion In this paper, we describes SLR, which is designed to scaffold contigs using long reads. We conclude that SLR can improve the completeness of genome assembly.

Download Full-text

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab034 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Jean-Marc Aury ◽

Benjamin Istace

Keyword(s):

Single Molecule ◽

Direct Consequence ◽

High Quality ◽

Sequencing Errors ◽

Coding Regions ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.

Download Full-text

Improving the Chromosome-Level Genome Assembly of the Siamese Fighting Fish (Betta splendens) in a University Master’s Course

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401205 ◽

2020 ◽

Vol 10 (7) ◽

pp. 2179-2183 ◽

Cited By ~ 1

Author(s):

Stefan Prost ◽

Malte Petersen ◽

Martin Grethlein ◽

Sarah Joy Hahn ◽

Nina Kuschik-Maczollek ◽

...

Keyword(s):

Genome Assembly ◽

High Throughput Sequencing ◽

Siamese Fighting Fish ◽

Betta Splendens ◽

High Quality ◽

Sequencing Platform ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

Chromosome Level

Ever decreasing costs along with advances in sequencing and library preparation technologies enable even small research groups to generate chromosome-level assemblies today. Here we report the generation of an improved chromosome-level assembly for the Siamese fighting fish (Betta splendens) that was carried out during a practical university master’s course. The Siamese fighting fish is a popular aquarium fish and an emerging model species for research on aggressive behavior. We updated the current genome assembly by generating a new long-read nanopore-based assembly with subsequent scaffolding to chromosome-level using previously published Hi-C data. The use of ∼35x nanopore-based long-read data sequenced on a MinION platform (Oxford Nanopore Technologies) allowed us to generate a baseline assembly of only 1,276 contigs with a contig N50 of 2.1 Mbp, and a total length of 441 Mbp. Scaffolding using the Hi-C data resulted in 109 scaffolds with a scaffold N50 of 20.7 Mbp. More than 99% of the assembly is comprised in 21 scaffolds. The assembly showed the presence of 96.1% complete BUSCO genes from the Actinopterygii dataset indicating a high quality of the assembly. We present an improved full chromosome-level assembly of the Siamese fighting fish generated during a university master’s course. The use of ∼35× long-read nanopore data drastically improved the baseline assembly in terms of continuity. We show that relatively in-expensive high-throughput sequencing technologies such as the long-read MinION sequencing platform can be used in educational settings allowing the students to gain practical skills in modern genomics and generate high quality results that benefit downstream research projects.

Download Full-text

Picopore: A tool for reducing the storage size of Oxford Nanopore Technologies datasets without loss of functionality

F1000Research ◽

10.12688/f1000research.11022.3 ◽

2017 ◽

Vol 6 ◽

pp. 227 ◽

Cited By ~ 1

Author(s):

Scott Gigante

Keyword(s):

Data Storage ◽

Data Generation ◽

Biologically Relevant ◽

Sequencing Technologies ◽

Long Term Storage ◽

Oxford Nanopore ◽

Long Read ◽

Oxford Nanopore Technologies ◽

Term Storage

Oxford Nanopore Technologies' (ONT's) MinION and PromethION long-read sequencing technologies are emerging as genuine alternatives to established Next-Generation Sequencing technologies. A combination of the highly redundant file format and a rapid increase in data generation have created a significant problem both for immediate data storage on MinION-capable laptops, and for long-term storage on lab data servers. We developed Picopore, a software suite offering three methods of compression. Picopore's lossless and deep lossless methods provide a 25% and 44% average reduction in size, respectively, without removing any data from the files. Picopore's raw method provides an 88% average reduction in size, while retaining biologically relevant data for the end-user. All methods have the capacity to run in real-time in parallel to a sequencing run, reducing demand for both immediate and long-term storage space.

Download Full-text

A highly contiguous genome assembly of Brassica nigra (BB) and revised nomenclature for the pseudochromosomes

10.1101/2020.06.29.175869 ◽

2020 ◽

Author(s):

Kumar Paritosh ◽

Akshay Kumar Pradhan ◽

Deepak Pental

Keyword(s):

Genome Assembly ◽

Optical Mapping ◽

Indian Subcontinent ◽

Brassica Nigra ◽

Oilseed Crop ◽

B Genome ◽

Oxford Nanopore ◽

Long Read ◽

Black Mustard ◽

Genome Assemblies

AbstractBrassica nigra (BB), also called black mustard, is grown as a condiment crop in India. B. nigra represents the B genome of U’s triangle and is one of the progenitor species of B. juncea (AABB), an important oilseed crop of the Indian subcontinent. We report here a highly contiguous genome assembly of B. nigra variety Sangam. The genome assembly has been carried out using Oxford Nanopore long-read sequencing and optical mapping. The resulting chromosome-scale assembly is a significant improvement over the previous draft assemblies of B. nigra; five out of the eight pseudochromosomes were represented by one scaffold each. The assembled genome was annotated for the transposons, centromeric repeats, and genes. The B. nigra genome was compared with the recently available contiguous genome assemblies of B. rapa (AA), B. oleracea (CC), and B. juncea (AABB). Based on the maximum homology among the three diploid genomes of U’s triangle, we propose a new nomenclature for B. nigra pseudochromosomes, taking the B. rapa pseudochromosome nomenclature as the reference.

Download Full-text

Metagenomic data for Halichondria panicea from Illumina and Nanopore sequencing and preliminary genome assemblies for the sponge and two microbial symbionts.

10.1101/2021.10.18.464794 ◽

2021 ◽

Author(s):

Brian W Strehlow ◽

Astrid Schuster ◽

Warren R Francis ◽

Donald E Canfield

Keyword(s):

Additional Data ◽

Illumina Miseq ◽

Metagenomic Data ◽

Single Individual ◽

Halichondria Panicea ◽

Oxford Nanopore ◽

Long Read ◽

Microbial Symbionts ◽

Genome Assemblies ◽

Oxford Nanopore Technologies

Objectives: These data were collected to generate a novel reference metagenome for the sponge Halichondria panicea and its microbiome for subsequent differential expression analyses. Data description: These data include raw sequences from four separate sequencing runs of the metagenome of a single individual of H. panicea - one Illumina MiSeq (2x300 bp, paired-end) run and three Oxford Nanopore Technologies (ONT) long-read sequencing runs, generating 53.8 and 7.42 Gbp respectively. Comparing assemblies of Illumina, ONT and an Illumina-ONT hybrid revealed the hybrid to be the best assembly, comprising 163 Mbp in 63,555 scaffolds (N50: 3,084). This assembly, however, was still highly fragmented and only contained 52% of core metazoan genes (with 77.9% partial genes), so it was also not complete. However, this sponge is an emerging model species for field and laboratory work, and there is considerable interest in genomic sequencing of this species. Although the resultant assemblies from the data presented here are suboptimal, this data note can inform future studies by providing an estimated genome size and coverage requirements for future sequencing, sharing additional data to potentially improve other suboptimal assemblies of this species, and outlining potential limitations and pitfalls of the combined Illumina and ONT approach to novel genome sequencing.

Download Full-text

An improved genome assembly uncovers prolific tandem repeats in Atlantic cod

10.1101/060921 ◽

2016 ◽

Cited By ~ 5

Author(s):

Ole K. Tørresen ◽

Bastiaan Star ◽

Sissel Jentoft ◽

William B. Reinar ◽

Harald Grove ◽

...

Keyword(s):

Genome Assembly ◽

Gadus Morhua ◽

Tandem Repeats ◽

Atlantic Cod ◽

Genomic Variation ◽

Promoter Regions ◽

Sequencing Technologies ◽

Combining Data ◽

Genome Assemblies ◽

Multiple Assembly

AbstractBackground: The first Atlantic cod (Gadus morhua) genome assembly published in 2011 was one of the early genome assemblies exclusively based on high-throughput 454 pyrosequencing. Since then, rapid advances in sequencing technologies have led to a multitude of assemblies generated for complex genomes, although many of these are of a fragmented nature with a significant fraction of bases in gaps. The development of long-read sequencing and improved software now enable the generation of more contiguous genome assemblies.Results: By combining data from Illumina, 454 and the longer PacBio sequencing technologies, as well as integrating the results of multiple assembly programs, we have created a substantially improved version of the Atlantic cod genome assembly. The sequence contiguity of this assembly is increased fifty-fold and the proportion of gap-bases has been reduced fifteen-fold. Compared to other vertebrates, the assembly contains an unusual high density of tandem repeats (TRs). Indeed, retrospective analyses reveal that gaps in the first genome assembly were largely associated with these TRs. We show that 21 % of the TRs across the assembly, 19 % in the promoter regions and 12 % in the coding sequences are heterozygous in the sequenced individual.Conclusions: The inclusion of PacBio reads combined with the use of multiple assembly programs drastically improved the Atlantic cod genome assembly by successfully resolving long TRs. The high frequency of heterozygous TRs within or in the vicinity of genes in the genome indicate a considerable standing genomic variation in Atlantic cod populations, which is likely of evolutionary importance.

Download Full-text

Construction of a chromosome-scale long-read reference genome assembly for potato

GigaScience ◽

10.1093/gigascience/giaa100 ◽

2020 ◽

Vol 9 (9) ◽

Cited By ~ 3

Author(s):

Gina M Pham ◽

John P Hamilton ◽

Joshua C Wood ◽

Joseph T Burke ◽

Hainan Zhao ◽

...

Keyword(s):

Genome Sequence ◽

Reference Genome ◽

Agronomic Traits ◽

Solanum Tuberosum L ◽

Fold Increase ◽

High Quality ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

Oxford Nanopore Technologies

Abstract Background Worldwide, the cultivated potato, Solanum tuberosum L., is the No. 1 vegetable crop and a critical food security crop. The genome sequence of DM1–3 516 R44, a doubled monoploid clone of S. tuberosum Group Phureja, was published in 2011 using a whole-genome shotgun sequencing approach with short-read sequence data. Current advanced sequencing technologies now permit generation of near-complete, high-quality chromosome-scale genome assemblies at minimal cost. Findings Here, we present an updated version of the DM1–3 516 R44 genome sequence (v6.1) using Oxford Nanopore Technologies long reads coupled with proximity-by-ligation scaffolding (Hi-C), yielding a chromosome-scale assembly. The new (v6.1) assembly represents 741.6 Mb of sequence (87.8%) of the estimated 844 Mb genome, of which 741.5 Mb is non-gapped with 731.2 Mb anchored to the 12 chromosomes. Use of Oxford Nanopore Technologies full-length complementary DNA sequencing enabled annotation of 32,917 high-confidence protein-coding genes encoding 44,851 gene models that had a significantly improved representation of conserved orthologs compared with the previous annotation. The new assembly has improved contiguity with a 595-fold increase in N50 contig size, 99% reduction in the number of contigs, a 44-fold increase in N50 scaffold size, and an LTR Assembly Index score of 13.56, placing it in the category of reference genome quality. The improved assembly also permitted annotation of the centromeres via alignment to sequencing reads derived from CENH3 nucleosomes. Conclusions Access to advanced sequencing technologies and improved software permitted generation of a high-quality, long-read, chromosome-scale assembly and improved annotation dataset for the reference genotype of potato that will facilitate research aimed at improving agronomic traits and understanding genome evolution.

Download Full-text