Kmer2SNP: reference-free SNP calling from raw reads based on matching

Mapping Intimacies ◽

10.1101/2020.05.17.100305 ◽

2020 ◽

Author(s):

Yanbo Li ◽

Yu Lin

Keyword(s):

Reference Genome ◽

State Of The Art ◽

Fundamental Problem ◽

Disease Diagnosis ◽

Hybrid Assembly ◽

Snp Calling ◽

Sequencing Technologies ◽

Order Of Magnitude ◽

Maximum Weight Matching ◽

Reference Genomes

AbstractThe development of DNA sequencing technologies provides the opportunity to call heterozygous SNPs for each individual. SNP calling is a fundamental problem of genetic analysis and has many applications, such as gene-disease diagnosis, drug design, and ancestry inference. Reference-based SNP calling approaches generate highly accurate results, but they face serious limitations especially when high-quality reference genomes are not available for many species. Although reference-free approaches have the potential to call SNPs without using the reference genome, they have not been widely applied on large and complex genomes because existing approaches suffer from low recall/precision or high runtime.We develop a reference-free algorithm Kmer2SNP to call SNP directly from raw reads. Kmer2SNP first computes the k-mer frequency distribution from reads and identifies potential heterozygous k-mers which only appear in one haplotype. Kmer2SNP then constructs a graph by choosing these heterozygous k-mers as vertices and connecting edges between pairs of heterozygous k-mers that might correspond to SNPs. Kmer2SNP further assigns a weight to each edge using overlapping information between heterozygous k-mers, computes a maximum weight matching and finally outputs SNPs as edges between k-mer pairs in the matching.We benchmark Kmer2SNP against reference-free methods including hybrid (assembly-based) and assembly-free methods on both simulated and real datasets. Experimental results show that Kmer2SNP achieves better SNP calling quality while being an order of magnitude faster than the state-of-the-art methods. Kmer2SNP shows the potential of calling SNPs only using k-mers from raw reads without assembly. The source code is freely available at https://github.com/yanboANU/Kmer2SNP.

Download Full-text

Challenges and Perspectives in the Epigenetics of Climate Change-Induced Forests Decline

Frontiers in Plant Science ◽

10.3389/fpls.2021.797958 ◽

2022 ◽

Vol 12 ◽

Author(s):

Isabel García-García ◽

Belén Méndez-Cea ◽

David Martín-Gálvez ◽

José Ignacio Seco ◽

Francisco Javier Gallego ◽

...

Keyword(s):

Climate Change ◽

Tree Species ◽

State Of The Art ◽

Forest Tree ◽

Epigenetic Modifications ◽

Sequencing Technologies ◽

Induced Stress ◽

Generation Times ◽

Reference Genomes ◽

Sessile Organisms

Forest tree species are highly vulnerable to the effects of climate change. As sessile organisms with long generation times, their adaptation to a local changing environment may rely on epigenetic modifications when allele frequencies are not able to shift fast enough. However, the current lack of knowledge on this field is remarkable, due to many challenges that researchers face when studying this issue. Huge genome sizes, absence of reference genomes and annotation, and having to analyze huge amounts of data are among these difficulties, which limit the current ability to understand how climate change drives tree species epigenetic modifications. In spite of this challenging framework, some insights on the relationships among climate change-induced stress and epigenomics are coming. Advances in DNA sequencing technologies and an increasing number of studies dealing with this topic must boost our knowledge on tree adaptive capacity to changing environmental conditions. Here, we discuss challenges and perspectives in the epigenetics of climate change-induced forests decline, aiming to provide a general overview of the state of the art.

Download Full-text

An Algorithm to Build a Multi-genome Reference

10.1101/2020.04.11.036871 ◽

2020 ◽

Cited By ~ 2

Author(s):

Leily Rabbani ◽

Jonas Müller ◽

Detlef Weigel

Keyword(s):

Reference Genome ◽

Single Species ◽

High Quality ◽

Sequencing Technologies ◽

Single Genome ◽

Mapping Sequence ◽

A Genome ◽

Shared Information ◽

Multiple Reference ◽

Reference Genomes

1AbstractMotivationNew DNA sequencing technologies have enabled the rapid analysis of many thousands of genomes from a single species. At the same time, the conventional approach of mapping sequencing reads against a single reference genome sequence is no longer adequate. However, even where multiple high-quality reference genomes are available, the problem remains how one would integrate results from pairwise analyses.ResultTo overcome the limits imposed by mapping sequence reads against a single reference genome, or serially mapping them against multiple reference genomes, we have developed the MGR method that allows simultaneous comparison against multiple high-quality reference genomes, in order to remove the bias that comes from using only a single-genome reference and to simplify downstream analyses. To this end, we present the MGR algorithm that creates a graph (MGR graph) as a multi-genome reference. To reduce the size and complexity of the multi-genome reference, highly similar orthologous1 and paralogous2 regions are collapsed while more substantial differences are retained. To evaluate the performance of our model, we have developed a genome compression tool, which can be used to estimate the amount of shared information between genomes.Availabilityhttps://github.com/LeilyR/[email protected]

Download Full-text

AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes

10.1101/2021.02.16.431517 ◽

2021 ◽

Author(s):

Jeremie S. Kim ◽

Can Firtina ◽

Meryem Banu Cavlak ◽

Damla Senol Cali ◽

Nastaran Hajinazar ◽

...

Keyword(s):

Reference Genome ◽

State Of The Art ◽

Variant Calling ◽

Ground Truth ◽

Data Set ◽

C Elegans ◽

A Genome ◽

Downstream Analysis ◽

Similar Accuracy ◽

Reference Genomes

AbstractAs genome sequencing tools and techniques improve, researchers are able to incrementally assemble more accurate reference genomes, which enable sensitivity in read mapping and downstream analysis such as variant calling. A more sensitive downstream analysis is critical for a better understanding of the genome donor (e.g., health characteristics). Therefore, read sets from sequenced samples should ideally be mapped to the latest available reference genome that represents the most relevant population. Unfortunately, the increasingly large amount of available genomic data makes it prohibitively expensive to fully re-map each read set to its respective reference genome every time the reference is updated. There are several tools that attempt to accelerate the process of updating a read data set from one reference to another (i.e., remapping) by 1) identifying regions that appear similarly between two references and 2) updating the mapping location of reads that map to any of the identified regions in the old reference to the corresponding similar region in the new reference. The main drawback of existing approaches is that if a read maps to a region in the old reference that does not appear with a reasonable degree of similarity in the new reference, the read cannot be remapped. We find that, as a result of this drawback, a significant portion of annotations (i.e., coding regions in a genome) are lost when using state-of-the-art remapping tools. To address this major limitation in existing tools, we propose AirLift, a fast and comprehensive technique for remapping alignments from one genome to another. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces 1) the number of reads (out of the entire read set) that need to be fully mapped to the new reference by up to 99.99% and 2) the overall execution time to remap read sets between two reference genome versions by 6.7×, 6.6×, and 2.8× for large (human), medium (C. elegans), and small (yeast) reference genomes, respectively. We validate our remapping results with GATK and find that AirLift provides similar accuracy in identifying ground truth SNP and INDEL variants as the baseline of fully mapping a read set.Code AvailabilityAirLift source code and readme describing how to reproduce our results are available at https://github.com/CMU-SAFARI/AirLift.

Download Full-text

Featherweight long read alignment using partitioned reference indexes

10.1101/386847 ◽

2018 ◽

Author(s):

Hasindu Gamaarachchi ◽

Sri Parameswaran ◽

Martin A. Smith

Keyword(s):

Mobile Computing ◽

Human Genome ◽

Parameter Optimization ◽

Reference Genome ◽

State Of The Art ◽

Genomic Research ◽

Nanopore Sequencing ◽

Read Alignment ◽

Long Read ◽

Reference Genomes

AbstractThe advent of nanopore sequencing has realised portable genomic research and applications. However, state of the art long read aligners and large reference genomes are not compatible with most mobile computing devices due to their high memory requirements. We show how memory requirements can be reduced through parameter optimization and reference genome partitioning, but highlight the associated limitations and caveats of these approaches. We then demonstrate how these issues can be overcome through an appropriate merging technique. We extend the Minimap2 aligner and demonstrate that long read alignment to the human genome can be performed on a system with 2GB RAM with negligible impact on accuracy.

Download Full-text

Benchmarking hybrid assembly approaches for genomic analyses of bacterial pathogens using Illumina and Oxford Nanopore sequencing

BMC Genomics ◽

10.1186/s12864-020-07041-8 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Zhao Chen ◽

David L. Erickson ◽

Jianghong Meng

Keyword(s):

Virulence Genes ◽

Reference Genome ◽

Bacterial Pathogens ◽

Nanopore Sequencing ◽

Hybrid Assembly ◽

Pan Genome ◽

Long Reads ◽

Oxford Nanopore ◽

Hybrid Assemblies ◽

Reference Genomes

Abstract Background We benchmarked the hybrid assembly approaches of MaSuRCA, SPAdes, and Unicycler for bacterial pathogens using Illumina and Oxford Nanopore sequencing by determining genome completeness and accuracy, antimicrobial resistance (AMR), virulence potential, multilocus sequence typing (MLST), phylogeny, and pan genome. Ten bacterial species (10 strains) were tested for simulated reads of both mediocre- and low-quality, whereas 11 bacterial species (12 strains) were tested for real reads. Results Unicycler performed the best for achieving contiguous genomes, closely followed by MaSuRCA, while all SPAdes assemblies were incomplete. MaSuRCA was less tolerant of low-quality long reads than SPAdes and Unicycler. The hybrid assemblies of five antimicrobial-resistant strains with simulated reads provided consistent AMR genotypes with the reference genomes. The MaSuRCA assembly of Staphylococcus aureus with real reads contained msr(A) and tet(K), while the reference genome and SPAdes and Unicycler assemblies harbored blaZ. The AMR genotypes of the reference genomes and hybrid assemblies were consistent for the other five antimicrobial-resistant strains with real reads. The numbers of virulence genes in all hybrid assemblies were similar to those of the reference genomes, irrespective of simulated or real reads. Only one exception existed that the reference genome and hybrid assemblies of Pseudomonas aeruginosa with mediocre-quality long reads carried 241 virulence genes, whereas 184 virulence genes were identified in the hybrid assemblies of low-quality long reads. The MaSuRCA assemblies of Escherichia coli O157:H7 and Salmonella Typhimurium with mediocre-quality long reads contained 126 and 118 virulence genes, respectively, while 110 and 107 virulence genes were detected in their MaSuRCA assemblies of low-quality long reads, respectively. All approaches performed well in our MLST and phylogenetic analyses. The pan genomes of the hybrid assemblies of S. Typhimurium with mediocre-quality long reads were similar to that of the reference genome, while SPAdes and Unicycler were more tolerant of low-quality long reads than MaSuRCA for the pan-genome analysis. All approaches functioned well in the pan-genome analysis of Campylobacter jejuni with real reads. Conclusions Our research demonstrates the hybrid assembly pipeline of Unicycler as a superior approach for genomic analyses of bacterial pathogens using Illumina and Oxford Nanopore sequencing.

Download Full-text

Large scale microbiome profiling in the cloud

Bioinformatics ◽

10.1093/bioinformatics/btz356 ◽

2019 ◽

Vol 35 (14) ◽

pp. i13-i22 ◽

Cited By ~ 1

Author(s):

Camilo Valdes ◽

Vitalii Stebliankin ◽

Giri Narasimhan

Keyword(s):

Large Scale ◽

Bacterial Population ◽

Reference Genome ◽

Supplementary Information ◽

Bacterial Genomes ◽

Reference Collection ◽

Order Of Magnitude ◽

Spark Framework ◽

Reference Genomes ◽

Microbiome Profiling

Abstract Motivation Bacterial metagenomics profiling for metagenomic whole sequencing (mWGS) usually starts by aligning sequencing reads to a collection of reference genomes. Current profiling tools are designed to work against a small representative collection of genomes, and do not scale very well to larger reference genome collections. However, large reference genome collections are capable of providing a more complete and accurate profile of the bacterial population in a metagenomics dataset. In this paper, we discuss a scalable, efficient and affordable approach to this problem, bringing big data solutions within the reach of laboratories with modest resources. Results We developed Flint, a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes. Flint takes advantage of Spark’s built-in parallelism and streaming engine architecture to quickly map reads against a large (170 GB) reference collection of 43 552 bacterial genomes from Ensembl. Flint runs on Amazon’s Elastic MapReduce service, and is able to profile 1 million Illumina paired-end reads against over 40 K genomes on 64 machines in 67 s—an order of magnitude faster than the state of the art, while using a much larger reference collection. Streaming the sequencing reads allows this approach to sustain mapping rates of 55 million reads per hour, at an hourly cluster cost of $8.00 USD, while avoiding the necessity of storing large quantities of intermediate alignments. Availability and implementation Flint is open source software, available under the MIT License (MIT). Source code is available at https://github.com/camilo-v/flint. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Scaffolding Contigs Using Multiple Reference Genomes

Computational Biology and Chemistry ◽

10.5772/intechopen.93456 ◽

2020 ◽

Author(s):

Yi-Kung Shieh ◽

Shu-Cheng Liu ◽

Chin Lung Lu

Keyword(s):

Genome Assembly ◽

Reference Genome ◽

State Of The Art ◽

Draft Genome ◽

Evolutionary Relationship ◽

The State ◽

Target Genome ◽

Multiple Reference ◽

Reference Genomes

Scaffolding is an important step of the genome assembly and its function is to order and orient the contigs in the assembly of a draft genome into larger scaffolds. Several single reference-based scaffolders have currently been proposed. However, a single reference genome may not be sufficient alone for a scaffolder to correctly scaffold a target draft genome, especially when the target genome and the reference genome have distant evolutionary relationship or some rearrangements. This motivates researchers to develop the so-called multiple reference-based scaffolders that can utilize multiple reference genomes, which may provide different but complementary types of scaffolding information, to scaffold the target draft genome. In this chapter, we will review some of the state-of-the-art multiple reference-based scaffolders, such as Ragout, MeDuSa and Multi-CAR, and give a complete introduction to Multi-CSAR, an improved extension of Multi-CAR.

Download Full-text

BrumiR: A toolkit for de novo discovery of microRNAs from sRNA-seq data

10.1101/2020.08.07.240689 ◽

2020 ◽

Author(s):

Carol Moraga ◽

Evelyn Sanchez ◽

Mariana Galvão Ferrarini ◽

Rodrigo A. Gutierrez ◽

Elena A. Vidal ◽

...

Keyword(s):

High Throughput Sequencing ◽

Reference Genome ◽

De Novo ◽

Regulation Of Gene Expression ◽

Additional Information ◽

Sequencing Technologies ◽

Mapping Tool ◽

Biological Insight ◽

Non Coding Rnas ◽

Reference Genomes

AbstractMicroRNAs (miRNAs) are small non-coding RNAs that are key players in the regulation of gene expression. In the last decade, with the increasing accessibility of high-throughput sequencing technologies, different methods have been developed to identify miRNAs, most of which rely on pre-existing reference genomes. However, when a reference genome is absent or is not of high quality, such identification becomes more difficult. In this context, we developed BrumiR, an algorithm that is able to discover miRNAs directly and exclusively from sRNA-seq data. We benchmarked BrumiR with datasets encompassing animal and plant species using real and simulated sRNA-seq experiments. The results demonstrate that BrumiR reaches the highest recall for miRNA discovery, while at the same time being much faster and more efficient than the state-of-the-art tools evaluated. The latter allows BrumiR to analyze a large number of sRNA-seq experiments, from plants or animals species. Moreover, BrumiR detects additional information regarding other expressed sequences (sRNAs, isomiRs, etc.), thus maximizing the biological insight gained from sRNA-seq experiments. Finally, when a reference genome is available, BrumiR provides a new mapping tool (BrumiR2ref) that performs an a posteriori exhaustive search to identify the precursor sequences. The code of BrumiR is freely available at https://github.com/camoragaq/BrumiR.

Download Full-text

Sequence Alignment Through the Looking Glass

10.1101/256859 ◽

2018 ◽

Author(s):

Raja Appuswamy ◽

Jacques Fellay ◽

Nimisha Chaturvedi

Keyword(s):

Data Analysis ◽

Sequence Alignment ◽

Reference Genome ◽

State Of The Art ◽

Genomic Data ◽

Next Generation ◽

Sequencing Technologies ◽

Alignment Algorithms ◽

Genomic Data Analysis ◽

Looking Glass

AbstractRapid advances in sequencing technologies are producing genomic data on an unprecedented scale. The first, and often one of the most time consuming, step of genomic data analysis is sequence alignment, where sequenced reads must be aligned to a reference genome. Several years of research on alignment algorithms has led to the development of several state-of-the-art sequence aligners that can map tens of thousands of reads per second.In this work, we answer the question “How do sequence aligners utilize modern processors?” We examine four state-of-the-art aligners running on an Intel processor and identify that all aligners leave the processor substantially underutilized. We perform an in-depth microarchitectural analysis to explore the interaction between aligner software and processor hardware. We identify bottlenecks that lead to processor underutilization and discuss the implications of our analysis on next-generation sequence aligner design.

Download Full-text

Nebula: ultra-efficient mapping-free structural variant genotyper

Nucleic Acids Research ◽

10.1093/nar/gkab025 ◽

2021 ◽

Author(s):

Parsoa Khorsand ◽

Fereydoun Hormozdiari

Keyword(s):

Large Scale ◽

Structural Variants ◽

Sequencing Technologies ◽

Generic Framework ◽

Common Genetic Variants ◽

Order Of Magnitude ◽

Complex Events ◽

Comparable Accuracy ◽

Using Data ◽

Computational Resources

Abstract Large scale catalogs of common genetic variants (including indels and structural variants) are being created using data from second and third generation whole-genome sequencing technologies. However, the genotyping of these variants in newly sequenced samples is a nontrivial task that requires extensive computational resources. Furthermore, current approaches are mostly limited to only specific types of variants and are generally prone to various errors and ambiguities when genotyping complex events. We are proposing an ultra-efficient approach for genotyping any type of structural variation that is not limited by the shortcomings and complexities of current mapping-based approaches. Our method Nebula utilizes the changes in the count of k-mers to predict the genotype of structural variants. We have shown that not only Nebula is an order of magnitude faster than mapping based approaches for genotyping structural variants, but also has comparable accuracy to state-of-the-art approaches. Furthermore, Nebula is a generic framework not limited to any specific type of event. Nebula is publicly available at https://github.com/Parsoa/Nebula.

Download Full-text