Acceleration of Nucleotide Semi-Global Alignment with Adaptive Banded Dynamic Programming

AbstractMotivationPairwise alignment of nucleotide sequences has previously been carried out using the seed- and-extend strategy, where we enumerate seeds (shared patterns) between sequences and then extend the seeds by Smith-Waterman-like semi-global dynamic programming to obtain full pairwise alignments. With the advent of massively parallel short read sequencers, algorithms and data structures for efficiently finding seeds have been extensively explored. However, recent advances in single-molecule sequencing technologies have enabled us to obtain millions of reads, each of which is orders of magnitude longer than those output by the short-read sequencers, demanding a faster algorithm for the extension step that accounts for most of the computation time required for pairwise local alignment. Our goal is to design a faster extension algorithm suitable for single-molecule sequencers with high sequencing error rates (e.g., 10-15%) and with more frequent insertions and deletions than substitutions.ResultsWe propose an adaptive banded dynamic programming algorithm for calculating pairwise semi-global alignment of nucleotide sequences that allows a relatively high insertion or deletion rate while keeping band width relatively low (e.g., 32 or 64 cells) regardless of sequence lengths. Our new algorithm eliminated mutual dependences between elements in a vector, allowing an efficient Single-Instruction-Multiple-Data parallelization. We experimentally demonstrate that our algorithm runs approximately 5× faster than the extension alignment algorithm in NCBI BLAST+ while retaining similar sensitivity (recall).We also show that our extension algorithm is more sensitive than the extension alignment routine in DALIGNER, while the computation time is comparable.AvailabilityThe implementation of the algorithm and the benchmarking scripts are available at https://github.com/ocxtal/[email protected]

Download Full-text

Consensify: A Method for Generating Pseudohaploid Genome Sequences from Palaeogenomic Datasets with Reduced Error Rates

Genes ◽

10.3390/genes11010050 ◽

2020 ◽

Vol 11 (1) ◽

pp. 50

Author(s):

Axel Barlow ◽

Stefanie Hartmann ◽

Javier Gonzalez ◽

Michael Hofreiter ◽

Johanna L. A. Paijmans

Keyword(s):

Clustering Analysis ◽

Error Rates ◽

Sequencing Error ◽

Genome Sequences ◽

High Quality ◽

Short Read ◽

Future Studies ◽

Sequencing Coverage ◽

Simplifying Assumptions ◽

Population Clustering

A standard practise in palaeogenome analysis is the conversion of mapped short read data into pseudohaploid sequences, frequently by selecting a single high-quality nucleotide at random from the stack of mapped reads. This controls for biases due to differential sequencing coverage, but it does not control for differential rates and types of sequencing error, which are frequently large and variable in datasets obtained from ancient samples. These errors have the potential to distort phylogenetic and population clustering analyses, and to mislead tests of admixture using D statistics. We introduce Consensify, a method for generating pseudohaploid sequences, which controls for biases resulting from differential sequencing coverage while greatly reducing error rates. The error correction is derived directly from the data itself, without the requirement for additional genomic resources or simplifying assumptions such as contemporaneous sampling. For phylogenetic and population clustering analysis, we find that Consensify is less affected by artefacts than methods based on single read sampling. For D statistics, Consensify is more resistant to false positives and appears to be less affected by biases resulting from different laboratory protocols than other frequently used methods. Although Consensify is developed with palaeogenomic data in mind, it is applicable for any low to medium coverage short read datasets. We predict that Consensify will be a useful tool for future studies of palaeogenomes.

Download Full-text

REDUCING THE SEARCH SPACE AND TIME COMPLEXITY OF NEEDLEMAN-WUNSCH ALGORITHM (GLOBAL ALIGNMENT) AND SMITH-WATERMAN ALGORITHM (LOCAL ALIGNMENT) FOR DNA SEQUENCE ALIGNMENT

Jurnal Teknologi ◽

10.11113/jt.v77.6564 ◽

2015 ◽

Vol 77 (20) ◽

Cited By ~ 1

Author(s):

F. N. Muhamad ◽

R. B. Ahmad ◽

S. Mohd. Asi ◽

M. N. Murad

Keyword(s):

Dynamic Programming ◽

Dna Sequences ◽

Sequence Comparison ◽

Dynamic Programming Algorithm ◽

Search Space ◽

Programming Algorithm ◽

Local Alignment ◽

Global Alignment ◽

Main Research ◽

Dna Sequence Alignment

The fundamental procedure of analyzing sequence content is sequence comparison. Sequence comparison can be defined as the problem of finding which parts of the sequences are similar and which parts are different, namely comparing two sequences to identify similarities and differences between them. A typical approach to solve this problem is to find a good and reasonable alignment between the two sequences. The main research in this project is to align the DNA sequences by using the Needleman-Wunsch algorithm for global alignment and Smith-Waterman algorithm for local alignment based on the Dynamic Programming algorithm. The Dynamic Programming Algorithm is guaranteed to find optimal alignment by exploring all possible alignments and choosing the best through the scoring and traceback techniques. The algorithms proposed and evaluated are to reduce the gaps in aligning sequences as well as the length of the sequences aligned without compromising the quality or correctness of results. In order to verify the accuracy and consistency of measurements obtained in Needleman-Wunsch and Smith-Waterman algorithms the data is compared with Emboss (global) and Emboss (local) with 600 strands test data.

Download Full-text

A survey and evaluations of histogram-based statistics in alignment-free sequence comparison

Briefings in Bioinformatics ◽

10.1093/bib/bbx161 ◽

2017 ◽

Vol 20 (4) ◽

pp. 1222-1237 ◽

Cited By ~ 10

Author(s):

Brian B Luczak ◽

Benjamin T James ◽

Hani Z Girgis

Keyword(s):

Query Sequence ◽

Sequence Length ◽

Local Alignment ◽

Global Alignment ◽

Alignment Algorithm ◽

Earth Mover’S Distance ◽

Earth Mover's Distance ◽

Alignment Free ◽

Length Difference ◽

Alignment Algorithms

Abstract Motivation Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. Results We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover’s distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover’s distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. Availability The source code of the benchmarking tool is available as Supplementary Materials.

Download Full-text

Consensify: a method for generating pseudohaploid genome sequences from palaeogenomic datasets with reduced error rates

10.1101/498915 ◽

2018 ◽

Cited By ~ 2

Author(s):

Axel Barlow ◽

Stefanie Hartmann ◽

Javier Gonzalez ◽

Michael Hofreiter ◽

Johanna L.A. Paijmans

Keyword(s):

Clustering Analysis ◽

Branch Length ◽

Genetic Distances ◽

Error Rates ◽

Sequencing Error ◽

Short Read ◽

Future Studies ◽

Sequencing Coverage ◽

Simplifying Assumptions ◽

Population Clustering

A standard practise in palaeogenome analysis is the conversion of mapped short read data into pseudohaploid sequences, typically by selecting a single high quality nucleotide at random from the stack of mapped reads. This controls for biases due to differential sequencing coverage but it does not control for differential rates and types of sequencing error, which are frequently large and variable in datasets obtained from ancient samples. These errors have the potential to distort phylogenetic and population clustering analyses, and to mislead tests of admixture using D statistics. We introduce Consensify, a method for generating pseudohaploid sequences which controls for biases resulting from differential sequencing coverage while greatly reducing error rates. The error correction is derived directly from the data itself, without the requirement for additional genomic resources or simplifying assumptions such as contemporaneous sampling. For phylogenetic analysis, we find that Consensify is less affected by branch length artefacts than methods based on standard pseudohaploidisation, and it performs similarly for population clustering analysis based on genetic distances. For D statistics, Consensify is more resistant to false positives and appears to be less affected by biases resulting from different laboratory protocols than other available methods. Although Consensify is developed with palaeogenomic data in mind, it is applicable for any low to medium coverage short read datasets. We predict that Consenify will be a useful tool for future studies of palaeogenomes.

Download Full-text

Overlap detection on long, error-prone sequencing reads via smooth q-gram

Bioinformatics ◽

10.1093/bioinformatics/btaa252 ◽

2020 ◽

Vol 36 (19) ◽

pp. 4838-4845

Author(s):

Yan Song ◽

Haixu Tang ◽

Haoyu Zhang ◽

Qin Zhang

Keyword(s):

Single Molecule ◽

De Novo ◽

Error Rates ◽

Supplementary Information ◽

Sequencing Error ◽

Fragment Assembly ◽

Detection Algorithms ◽

Third Generation Sequencing ◽

Oxford Nanopore ◽

Assembly Algorithms

Abstract Motivation Third generation sequencing techniques, such as the Single Molecule Real Time technique from PacBio and the MinION technique from Oxford Nanopore, can generate long, error-prone sequencing reads which pose new challenges for fragment assembly algorithms. In this paper, we study the overlap detection problem for error-prone reads, which is the first and most critical step in the de novo fragment assembly. We observe that all the state-of-the-art methods cannot achieve an ideal accuracy for overlap detection (in terms of relatively low precision and recall) due to the high sequencing error rates, especially when the overlap lengths between reads are relatively short (e.g. <2000 bases). This limitation appears inherent to these algorithms due to their usage of q-gram-based seeds under the seed-extension framework. Results We propose smooth q-gram, a variant of q-gram that captures q-gram pairs within small edit distances and design a novel algorithm for detecting overlapping reads using smooth q-gram-based seeds. We implemented the algorithm and tested it on both PacBio and Nanopore sequencing datasets. Our benchmarking results demonstrated that our algorithm outperforms the existing q-gram-based overlap detection algorithms, especially for reads with relatively short overlapping lengths. Availability and implementation The source code of our implementation in C++ is available at https://github.com/FIGOGO/smoothq. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

LePrimAlign: local entropy-based alignment of PPI networks to predict conserved modules

BMC Genomics ◽

10.1186/s12864-019-6271-3 ◽

2019 ◽

Vol 20 (S9) ◽

Cited By ~ 1

Author(s):

Sawal Maskey ◽

Young-Rae Cho

Keyword(s):

Computational Cost ◽

Network Alignment ◽

System Level ◽

Local Network ◽

Local Alignment ◽

Global Alignment ◽

Alignment Algorithm ◽

Interaction Patterns ◽

Ppi Networks ◽

Alignment Algorithms

Abstract Background Cross-species analysis of protein-protein interaction (PPI) networks provides an effective means of detecting conserved interaction patterns. Identifying such conserved substructures between PPI networks of different species increases our understanding of the principles deriving evolution of cellular organizations and their functions in a system level. In recent years, network alignment techniques have been applied to genome-scale PPI networks to predict evolutionary conserved modules. Although a wide variety of network alignment algorithms have been introduced, developing a scalable local network alignment algorithm with high accuracy is still challenging. Results We present a novel pairwise local network alignment algorithm, called LePrimAlign, to predict conserved modules between PPI networks of three different species. The proposed algorithm exploits the results of a pairwise global alignment algorithm with many-to-many node mapping. It also applies the concept of graph entropy to detect initial cluster pairs from two networks. Finally, the initial clusters are expanded to increase the local alignment score that is formulated by a combination of intra-network and inter-network scores. The performance comparison with state-of-the-art approaches demonstrates that the proposed algorithm outperforms in terms of accuracy of identified protein complexes and quality of alignments. Conclusion The proposed method produces local network alignment of higher accuracy in predicting conserved modules even with large biological networks at a reduced computational cost.

Download Full-text

High Performance Systolic Array Core Architecture Design for DNA Sequencer

MATEC Web of Conferences ◽

10.1051/matecconf/201815006009 ◽

2018 ◽

Vol 150 ◽

pp. 06009 ◽

Cited By ~ 1

Author(s):

Dayana Saiful Nurdin ◽

Mohd. Nazrin Md. Isa ◽

Rizalafande Che Ismail ◽

Muhammad Imran Ahmad

Keyword(s):

Systolic Array ◽

Dna Sequences ◽

High Performance ◽

Computation Time ◽

Local Alignment ◽

Architecture Design ◽

Alignment Algorithm ◽

The Core ◽

Field Programmable ◽

Dna Sequencer

This paper presents a high performance systolic array (SA) core architecture design for Deoxyribonucleic Acid (DNA) sequencer. The core implements the affine gap penalty score Smith-Waterman (SW) algorithm. This time-consuming local alignment algorithm guarantees optimal alignment between DNA sequences, but it requires quadratic computation time when performed on standard desktop computers. The use of linear SA decreases the time complexity from quadratic to linear. In addition, with the exponential growth of DNA databases, the SA architecture is used to overcome the timing issue. In this work, the SW algorithm has been captured using Verilog Hardware Description Language (HDL) and simulated using Xilinx ISIM simulator. The proposed design has been implemented in Xilinx Virtex -6 Field Programmable Gate Array (FPGA) and improved in the core area by 90% reduction.

Download Full-text

F1000Research TMATCH: A New Algorithm for Protein Alignments using amino-acid hydrophobicities

10.1101/2019.12.16.878744 ◽

2019 ◽

Cited By ~ 1

Author(s):

David Cavanaugh ◽

Krishnan Chittur

Keyword(s):

Amino Acids ◽

Dynamic Programming ◽

Local Alignment ◽

Alignment Algorithm ◽

Hydrophobicity Scale ◽

Fundamental Properties ◽

Extra Information ◽

Alignment Algorithms ◽

Gap Opening ◽

Protein Alignments

AbstractThe identification of proteins of similar structure using sequence alignment is an important problem in bioinformatics. We decribe TMATCH, a basic dynamic programming alignment algorithm which can rapidly identify proteins of similar structure from a database. TMATCH was developed to utilize an optimal hydrophobicity metric for alignments traceable to fundamental properties of amino-acids. Standard alignment algorithms use affine gap penalties as contrasted with the TMATCH algorithm adaptation of local alignment score reinforcement of favorable diagonal paths (transitions) and punishment of unfavorable transitions paired with fixed gap opening penalties. The TMATCH algorithm is especially designed to take advantage of the extra information available within the hydrophobicity scale to detect homologies, as opposed to the probabilities derived from raw percent identities.

Download Full-text

Long-read sequencing technology indicates genome-wide effects of non-B DNA on polymerization speed and error rate

10.1101/237461 ◽

2017 ◽

Author(s):

Wilfried M. Guiblet ◽

Marzia A. Cremona ◽

Monika Cechova ◽

Robert S. Harris ◽

Iva Kejnovska ◽

...

Keyword(s):

Human Genome ◽

Single Molecule ◽

Tandem Repeats ◽

Neurological Diseases ◽

Error Rates ◽

Polymerization Kinetics ◽

Sequencing Error ◽

Dna Polymerization ◽

Sequencing Errors ◽

Genome Wide

ABSTRACTDNA conformation may deviate from the classical B-form in ~13% of the human genome. Non-B DNA regulates many cellular processes; however, its effects on DNA polymerization speed and accuracy have not been investigated genome-wide. Such an inquiry is critical for understanding neurological diseases and cancer genome instability. Here we present the first simultaneous examination of DNA polymerization kinetics and errors in the human genome sequenced with Single-Molecule-Real-Time technology. We show that polymerization speed differs between non-B and B-DNA: it decelerates at G-quadruplexes and fluctuates periodically at disease-causing tandem repeats. Analyzing polymerization kinetics profiles, we predict and validate experimentally non-B DNA formation for a novel motif. We demonstrate that several non-B motifs affect sequencing errors (e.g., G-quadruplexes increase error rates) and that sequencing errors are positively associated with polymerase slowdown. Finally, we show that highly divergent G4 motifs have pronounced polymerization slowdown and high sequencing error rates, suggesting similar mechanisms for sequencing errors and germline mutations.

Download Full-text

Time Efficient Segmented Technique for Dynamic Programming Based Algorithms with FPGA Implementation

Journal of Circuits System and Computers ◽

10.1142/s021812661950227x ◽

2019 ◽

Vol 28 (13) ◽

pp. 1950227

Author(s):

Talal Bonny ◽

Ridhwan Al Debsi ◽

Mohamed Basel Almourad

Keyword(s):

Dynamic Programming ◽

Sequence Alignment ◽

Input Parameter ◽

Dynamic Programming Algorithm ◽

Computation Time ◽

Longest Common Subsequence ◽

Programming Algorithm ◽

Optimization Approach ◽

Alignment Algorithm ◽

Optimal Sequence

Although dynamic programming (DP) is an optimization approach used to solve a complex problem fast, the time required to solve it is still not efficient and grows polynomially with the size of the input. In this contribution, we improve the computation time of the dynamic programming based algorithms by proposing a novel technique, which is called “SDP: Segmented Dynamic programming”. SDP finds the best way of splitting the compared sequences into segments and then applies the dynamic programming algorithm to each segment individually. This will reduce the computation time dramatically. SDP may be applied to any dynamic programming based algorithm to improve its computation time. As case studies, we apply the SDP technique on two different dynamic programming based algorithms; “Needleman–Wunsch (NW)”, the widely used program for optimal sequence alignment, and the LCS algorithm, which finds the “Longest Common Subsequence” between two input strings. The results show that applying the SDP technique in conjunction with the DP based algorithms improves the computation time by up to 80% in comparison to the sole DP algorithms, but with small or ignorable degradation in comparing results. This degradation is controllable and it is based on the number of split segments as an input parameter. However, we compare our results with the well-known heuristic FASTA sequence alignment algorithm, “GGSEARCH”. We show that our results are much closer to the optimal results than the “GGSEARCH” algorithm. The results are valid independent from the sequences length and their level of similarity. To show the functionality of our technique on the hardware and to verify the results, we implement it on the Xilinx Zynq-7000 FPGA.

Download Full-text