Featherweight long read alignment using partitioned reference indexes

Mapping Intimacies ◽

10.1101/386847 ◽

2018 ◽

Author(s):

Hasindu Gamaarachchi ◽

Sri Parameswaran ◽

Martin A. Smith

Keyword(s):

Mobile Computing ◽

Human Genome ◽

Parameter Optimization ◽

Reference Genome ◽

State Of The Art ◽

Genomic Research ◽

Nanopore Sequencing ◽

Read Alignment ◽

Long Read ◽

Reference Genomes

AbstractThe advent of nanopore sequencing has realised portable genomic research and applications. However, state of the art long read aligners and large reference genomes are not compatible with most mobile computing devices due to their high memory requirements. We show how memory requirements can be reduced through parameter optimization and reference genome partitioning, but highlight the associated limitations and caveats of these approaches. We then demonstrate how these issues can be overcome through an appropriate merging technique. We extend the Minimap2 aligner and demonstrate that long read alignment to the human genome can be performed on a system with 2GB RAM with negligible impact on accuracy.

Download Full-text

Fast and sensitive mapping of error-prone nanopore sequencing reads with GraphMap

10.1101/020719 ◽

2015 ◽

Cited By ~ 1

Author(s):

Ivan Sovic ◽

Mile Sikic ◽

Andreas Wilm ◽

Shannon Nicole Fenlon ◽

Swaine Chen ◽

...

Keyword(s):

Human Genome ◽

Variant Calling ◽

Error Rates ◽

Nanopore Sequencing ◽

Structural Variants ◽

Specific Identification ◽

Long Reads ◽

Long Read ◽

Specific Error ◽

Very High

Exploiting the power of nanopore sequencing requires the development of new bioinformatics approaches to deal with its specific error characteristics. We present the first nanopore read mapper (GraphMap) that uses a read-funneling paradigm to robustly handle variable error rates and fast graph traversal to align long reads with speed and very high precision (>95%). Evaluation on MinION sequencing datasets against short and long-read mappers indicates that GraphMap increases mapping sensitivity by at least 15-80%. GraphMap alignments are the first to demonstrate consensus calling with <1 error in 100,000 bases, variant calling on the human genome with 76% improvement in sensitivity over the next best mapper (BWA-MEM), precise detection of structural variants from 100bp to 4kbp in length and species and strain-specific identification of pathogens using MinION reads. GraphMap is available open source under the MIT license at https://github.com/isovic/graphmap.

Download Full-text

Kmer2SNP: reference-free SNP calling from raw reads based on matching

10.1101/2020.05.17.100305 ◽

2020 ◽

Author(s):

Yanbo Li ◽

Yu Lin

Keyword(s):

Reference Genome ◽

State Of The Art ◽

Fundamental Problem ◽

Disease Diagnosis ◽

Hybrid Assembly ◽

Snp Calling ◽

Sequencing Technologies ◽

Order Of Magnitude ◽

Maximum Weight Matching ◽

Reference Genomes

AbstractThe development of DNA sequencing technologies provides the opportunity to call heterozygous SNPs for each individual. SNP calling is a fundamental problem of genetic analysis and has many applications, such as gene-disease diagnosis, drug design, and ancestry inference. Reference-based SNP calling approaches generate highly accurate results, but they face serious limitations especially when high-quality reference genomes are not available for many species. Although reference-free approaches have the potential to call SNPs without using the reference genome, they have not been widely applied on large and complex genomes because existing approaches suffer from low recall/precision or high runtime.We develop a reference-free algorithm Kmer2SNP to call SNP directly from raw reads. Kmer2SNP first computes the k-mer frequency distribution from reads and identifies potential heterozygous k-mers which only appear in one haplotype. Kmer2SNP then constructs a graph by choosing these heterozygous k-mers as vertices and connecting edges between pairs of heterozygous k-mers that might correspond to SNPs. Kmer2SNP further assigns a weight to each edge using overlapping information between heterozygous k-mers, computes a maximum weight matching and finally outputs SNPs as edges between k-mer pairs in the matching.We benchmark Kmer2SNP against reference-free methods including hybrid (assembly-based) and assembly-free methods on both simulated and real datasets. Experimental results show that Kmer2SNP achieves better SNP calling quality while being an order of magnitude faster than the state-of-the-art methods. Kmer2SNP shows the potential of calling SNPs only using k-mers from raw reads without assembly. The source code is freely available at https://github.com/yanboANU/Kmer2SNP.

Download Full-text

Generation of small interfering RNA (siRNA) database from SARS-CoV-2 genome sequences

10.21203/rs.3.pex-1207/v1 ◽

2020 ◽

Author(s):

Inácio Gomes Medeiros ◽

André Salim Khayat ◽

Beatriz Stransky ◽

Sidney Emanuel Batista dos Santos ◽

Paulo Pimentel de Assumpção ◽

...

Keyword(s):

Human Genome ◽

Design Process ◽

Small Interfering Rna ◽

Reference Genome ◽

Second Phase ◽

Computational Power ◽

Genome Sequences ◽

Interfering Rna ◽

Reference Genomes

Abstract This protocol aims to describe the building of a database of SARS-CoV-2 targets for siRNA approaches. Starting from the virus reference genome, we will derive sequences from 18 to 21nt-long and verify their similarity against the human genome and coding and non-coding transcriptome, as well as genomes from related viruses. We will also calculate a set of thermodynamic features for those sequences and will infer their efficiencies using three different predictors. The protocol has two main phases: at first, we align sequences against reference genomes. In the second one, we extract the features. The first phase varies in terms of duration, depending on computational power from the running machine and the number of reference genomes. Despite that, the second phase lasts about thirty minutes of execution, also depending on the number of cores of running machine. The constructed database aims to speed the design process by providing a broad set of possible SARS-CoV-2 sequences targets and siRNA sequences.

Download Full-text

An improved pig reference genome sequence to enable pig genetics and genomics research

GigaScience ◽

10.1093/gigascience/giaa051 ◽

2020 ◽

Vol 9 (6) ◽

Cited By ~ 12

Author(s):

Amanda Warr ◽

Nabeel Affara ◽

Bronwen Aken ◽

Hamid Beiki ◽

Derek M Bickhart ◽

...

Keyword(s):

Reference Genome ◽

Genomic Research ◽

Biomedical Model ◽

Model Species ◽

Domestic Pig ◽

Genomics Research ◽

Long Read ◽

Genetics And Genomics ◽

Genome Assemblies ◽

Chromosome Level

Abstract Background The domestic pig (Sus scrofa) is important both as a food source and as a biomedical model given its similarity in size, anatomy, physiology, metabolism, pathology, and pharmacology to humans. The draft reference genome (Sscrofa10.2) of a purebred Duroc female pig established using older clone-based sequencing methods was incomplete, and unresolved redundancies, short-range order and orientation errors, and associated misassembled genes limited its utility. Results We present 2 annotated highly contiguous chromosome-level genome assemblies created with more recent long-read technologies and a whole-genome shotgun strategy, 1 for the same Duroc female (Sscrofa11.1) and 1 for an outbred, composite-breed male (USMARCv1.0). Both assemblies are of substantially higher (>90-fold) continuity and accuracy than Sscrofa10.2. Conclusions These highly contiguous assemblies plus annotation of a further 11 short-read assemblies provide an unprecedented view of the genetic make-up of this important agricultural and biomedical model species. We propose that the improved Duroc assembly (Sscrofa11.1) become the reference genome for genomic research in pigs.

Download Full-text

AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes

10.1101/2021.02.16.431517 ◽

2021 ◽

Author(s):

Jeremie S. Kim ◽

Can Firtina ◽

Meryem Banu Cavlak ◽

Damla Senol Cali ◽

Nastaran Hajinazar ◽

...

Keyword(s):

Reference Genome ◽

State Of The Art ◽

Variant Calling ◽

Ground Truth ◽

Data Set ◽

C Elegans ◽

A Genome ◽

Downstream Analysis ◽

Similar Accuracy ◽

Reference Genomes

AbstractAs genome sequencing tools and techniques improve, researchers are able to incrementally assemble more accurate reference genomes, which enable sensitivity in read mapping and downstream analysis such as variant calling. A more sensitive downstream analysis is critical for a better understanding of the genome donor (e.g., health characteristics). Therefore, read sets from sequenced samples should ideally be mapped to the latest available reference genome that represents the most relevant population. Unfortunately, the increasingly large amount of available genomic data makes it prohibitively expensive to fully re-map each read set to its respective reference genome every time the reference is updated. There are several tools that attempt to accelerate the process of updating a read data set from one reference to another (i.e., remapping) by 1) identifying regions that appear similarly between two references and 2) updating the mapping location of reads that map to any of the identified regions in the old reference to the corresponding similar region in the new reference. The main drawback of existing approaches is that if a read maps to a region in the old reference that does not appear with a reasonable degree of similarity in the new reference, the read cannot be remapped. We find that, as a result of this drawback, a significant portion of annotations (i.e., coding regions in a genome) are lost when using state-of-the-art remapping tools. To address this major limitation in existing tools, we propose AirLift, a fast and comprehensive technique for remapping alignments from one genome to another. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces 1) the number of reads (out of the entire read set) that need to be fully mapped to the new reference by up to 99.99% and 2) the overall execution time to remap read sets between two reference genome versions by 6.7×, 6.6×, and 2.8× for large (human), medium (C. elegans), and small (yeast) reference genomes, respectively. We validate our remapping results with GATK and find that AirLift provides similar accuracy in identifying ground truth SNP and INDEL variants as the baseline of fully mapping a read set.Code AvailabilityAirLift source code and readme describing how to reproduce our results are available at https://github.com/CMU-SAFARI/AirLift.

Download Full-text

An improved pig reference genome sequence to enable pig genetics and genomics research

10.1101/668921 ◽

2019 ◽

Cited By ~ 16

Author(s):

Amanda Warr ◽

Nabeel Affara ◽

Bronwen Aken ◽

H. Beiki ◽

Derek M. Bickhart ◽

...

Keyword(s):

Reference Genome ◽

Genomic Research ◽

Biomedical Model ◽

Model Species ◽

Domestic Pig ◽

Genomics Research ◽

Long Read ◽

Genetics And Genomics ◽

Genome Assemblies ◽

Chromosome Level

AbstractThe domestic pig (Sus scrofa) is important both as a food source and as a biomedical model with high anatomical and immunological similarity to humans. The draft reference genome (Sscrofa10.2) of a purebred Duroc female pig established using older clone-based sequencing methods was incomplete and unresolved redundancies, short range order and orientation errors and associated misassembled genes limited its utility. We present two annotated highly contiguous chromosome-level genome assemblies created with more recent long read technologies and a whole genome shotgun strategy, one for the same Duroc female (Sscrofa11.1) and one for an outbred, composite breed male (USMARCv1.0). Both assemblies are of substantially higher (>90-fold) continuity and accuracy than Sscrofa10.2. These highly contiguous assemblies plus annotation of a further 11 short read assemblies provide an unprecedented view of the genetic make-up of this important agricultural and biomedical model species. We propose that the improved Duroc assembly (Sscrofa11.1) become the reference genome for genomic research in pigs.

Download Full-text

PyPore: a python toolbox for nanopore sequencing data handling

Bioinformatics ◽

10.1093/bioinformatics/btz269 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4445-4447 ◽

Cited By ~ 1

Author(s):

Roberto Semeraro ◽

Alberto Magi

Keyword(s):

Open Source Software ◽

Reference Genome ◽

State Of The Art ◽

Supplementary Information ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Software Packages ◽

Technological Improvement ◽

Fastq Format ◽

Oxford Nanopore

Abstract Motivation The recent technological improvement of Oxford Nanopore sequencing pushed the throughput of these devices to 10–20 Gb allowing the generation of millions of reads. For these reasons, the availability of fast software packages for evaluating experimental quality by generating highly informative and interactive summary plots is of fundamental importance. Results We developed PyPore, a three module python toolbox designed to handle raw FAST5 files from quality checking to alignment to a reference genome and to explore their features through the generation of browsable HTML files. The first module provides an interface to explore and evaluate the information contained in FAST5 and summarize them into informative quality measures. The second module converts raw data in FASTQ format, while the third module allows to easily use three state-of-the-art aligners and collects mapping statistics. Availability and implementation PyPore is an open-source software and is written in Python2.7, source code is freely available, for all OS platforms, in Github at https://github.com/rsemeraro/PyPore Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

THE JOURNEY SO FAR AND THE ROADMAP AHEAD

INDIAN DRUGS ◽

10.53879/id.58.01.p0005 ◽

2021 ◽

Vol 58 (01) ◽

pp. 5-6

Author(s):

Gopakumar G. Nair ◽

Keyword(s):

Human Genome ◽

Reference Genome ◽

Genome Project ◽

Genomic Research ◽

Turn Of The Millennium ◽

Great Progress ◽

Dear Reader ◽

Global Investors ◽

The Human Genome Project ◽

Pathogen Genome

Dear Reader, Current Covid times are introspection times, too. When the Human Genome Project was initiated in 1990, for determining the basic pairs that make up DNA and for identifying and mapping the entire genes of the human genome, the hue and cry made by the Indian NGOs kept India out of the project, at lease officially. Approximately, 20 research institutions globally, including some from China and Russia later, participated during the 13 years of the project, which concluded in 2003. The participating countries and institutions made major contributions and consequently became beneficiaries of great progress and major strides in genomic research. While China was already participating from 1990 and Russia joined in 2000, India realised the need and importance of moving into this field at the turn of the millennium. The 100K Pathogen Genome Project launched in 2012 in USA and the 100,000 Genomes Project, also of late 2012, by UK carried forward the genome project initiatives. The countries who took early initiatives were immensely benefited through major breakthroughs. For good (or bad?), China outpaced India in genomic research and was rewarded immensely through funding from major global investors. What about India? Better late than never. The DBT in India initiated the Genome India Project in January, 2020 with the aim of collecting a moderate 10,000 human genetic samples from across India to build a reference genome. Fortunately, the vociferous NGO lobbies have probably realised their folly in opposing the genome project participation by India in the 1990s and the Indian project of 2020 will hopefully progress.

Download Full-text

The Context and State of the Art in European Biobanking

Protecting Genetic Privacy in Biobanking through Data Protection Law ◽

10.1093/oso/9780192896476.003.0003 ◽

2021 ◽

pp. 19-39

Author(s):

Dara Hallinan

Keyword(s):

Human Genome ◽

Human Genome Project ◽

State Of The Art ◽

Genome Project ◽

Genomic Research ◽

Social Significance ◽

The Human Genome Project

This chapter examines the context and state of the art in European biobanks and biobanking. Specifically, it seeks to provide an overview of the emergence, function, and practice of the current European biobanking landscape. It begins by looking at the emergence of biobanks and biobanking, exploring the Human Genome Project (HGP). The chapter then focuses on genomic research—the activity biobanks support—and considers its social significance and prospects. Against this background, it offers a definition for the concepts of biobank and biobanking. This definition is then used to map the range of types of biobanks, and biobanking activity, identifiable across Europe. The chapter concludes with a consideration of trends which will define European biobanking in future.

Download Full-text

Benchmarking hybrid assembly approaches for genomic analyses of bacterial pathogens using Illumina and Oxford Nanopore sequencing

BMC Genomics ◽

10.1186/s12864-020-07041-8 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Zhao Chen ◽

David L. Erickson ◽

Jianghong Meng

Keyword(s):

Virulence Genes ◽

Reference Genome ◽

Bacterial Pathogens ◽

Nanopore Sequencing ◽

Hybrid Assembly ◽

Pan Genome ◽

Long Reads ◽

Oxford Nanopore ◽

Hybrid Assemblies ◽

Reference Genomes

Abstract Background We benchmarked the hybrid assembly approaches of MaSuRCA, SPAdes, and Unicycler for bacterial pathogens using Illumina and Oxford Nanopore sequencing by determining genome completeness and accuracy, antimicrobial resistance (AMR), virulence potential, multilocus sequence typing (MLST), phylogeny, and pan genome. Ten bacterial species (10 strains) were tested for simulated reads of both mediocre- and low-quality, whereas 11 bacterial species (12 strains) were tested for real reads. Results Unicycler performed the best for achieving contiguous genomes, closely followed by MaSuRCA, while all SPAdes assemblies were incomplete. MaSuRCA was less tolerant of low-quality long reads than SPAdes and Unicycler. The hybrid assemblies of five antimicrobial-resistant strains with simulated reads provided consistent AMR genotypes with the reference genomes. The MaSuRCA assembly of Staphylococcus aureus with real reads contained msr(A) and tet(K), while the reference genome and SPAdes and Unicycler assemblies harbored blaZ. The AMR genotypes of the reference genomes and hybrid assemblies were consistent for the other five antimicrobial-resistant strains with real reads. The numbers of virulence genes in all hybrid assemblies were similar to those of the reference genomes, irrespective of simulated or real reads. Only one exception existed that the reference genome and hybrid assemblies of Pseudomonas aeruginosa with mediocre-quality long reads carried 241 virulence genes, whereas 184 virulence genes were identified in the hybrid assemblies of low-quality long reads. The MaSuRCA assemblies of Escherichia coli O157:H7 and Salmonella Typhimurium with mediocre-quality long reads contained 126 and 118 virulence genes, respectively, while 110 and 107 virulence genes were detected in their MaSuRCA assemblies of low-quality long reads, respectively. All approaches performed well in our MLST and phylogenetic analyses. The pan genomes of the hybrid assemblies of S. Typhimurium with mediocre-quality long reads were similar to that of the reference genome, while SPAdes and Unicycler were more tolerant of low-quality long reads than MaSuRCA for the pan-genome analysis. All approaches functioned well in the pan-genome analysis of Campylobacter jejuni with real reads. Conclusions Our research demonstrates the hybrid assembly pipeline of Unicycler as a superior approach for genomic analyses of bacterial pathogens using Illumina and Oxford Nanopore sequencing.

Download Full-text