SequenceBouncer: A method to remove outlier entries from a multiple sequence alignment

Mapping Intimacies ◽

10.1101/2020.11.24.395459 ◽

2020 ◽

Author(s):

Cory D. Dunn

Keyword(s):

Nucleic Acid ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Phylogenetic Analyses ◽

Protein Sequences ◽

Mitochondrial Genomes ◽

Dna Barcodes ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

AbstractPhylogenetic analyses can take advantage of multiple sequence alignments as input. These alignments typically consist of homologous nucleic acid or protein sequences, and the inclusion of outlier or aberrant sequences can compromise downstream analyses. Here, I describe a program, SequenceBouncer, that uses the Shannon entropy values of alignment columns to identify outlier alignment sequences in a manner responsive to overall alignment context. I demonstrate the utility of this software using alignments of available mammalian mitochondrial genomes, bird cytochrome c oxidase-derived DNA barcodes, and COVID-19 sequences.

Download Full-text

Benchmarking Statistical Multiple Sequence Alignment

10.1101/304659 ◽

2018 ◽

Cited By ~ 1

Author(s):

Michael Nute ◽

Ehsan Saleh ◽

Tandy Warnow

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structural Alignment ◽

Estimation Method ◽

Simulated Data ◽

Protein Sequences ◽

Data Sets ◽

Sequence Alignments ◽

Multiple Sequence ◽

Simulated Data Sets

AbstractThe estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical co-estimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical co-estimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy is dramatically more accurate than the other alignment methods on the simulated data sets, but is among the least accurate on the biological benchmarks. There are several potential causes for this discordance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments; future research is needed to understand the most likely explanation for our observations. multiple sequence alignment, BAli-Phy, protein sequences, structural alignment, homology

Download Full-text

MSAC: Compression of multiple sequence alignment files

10.1101/240341 ◽

2017 ◽

Cited By ~ 1

Author(s):

Sebastian Deorowicz ◽

Joanna Walczyszyn ◽

Agnieszka Debudaj-Grabysz

Keyword(s):

Sequence Alignment ◽

Compression Ratio ◽

Multiple Sequence Alignment ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Link Type ◽

Bioinformatics Databases ◽

Supplementary Material ◽

Burrows Wheeler Transform

AbstractMotivationBioinformatics databases grow rapidly and achieve values hardly to imagine a decade ago. Among numerous bioinformatics processes generating hundreds of GB is multiple sequence alignments of protein families. Its largest database, i.e., Pfam, consumes 40–230 GB, depending of the variant. Storage and transfer of such massive data has become a challenge.ResultsWe propose a novel compression algorithm, MSAC (Multiple Sequence Alignment Compressor), designed especially for aligned data. It is based on a generalisation of the positional Burrows–Wheeler transform for non-binary alphabets. MSAC handles FASTA, as well as Stockholm files. It offers up to six times better compression ratio than other commonly used compressors, i.e., gzip. Performed experiments resulted in an analysis of the influence of a protein family size on the compression ratio.AvailabilityMSAC is available for free at https://github.com/refresh-bio/msac and http://sun.aei.polsl.pl/REFRESH/[email protected] materialSupplementary data are available at the publisher Web site.

Download Full-text

ClipKIT: a multiple sequence alignment-trimming algorithm for accurate phylogenomic inference

10.1101/2020.06.08.140384 ◽

2020 ◽

Cited By ~ 3

Author(s):

Jacob L. Steenwyk ◽

Thomas J. Buida ◽

Yuanning Li ◽

Xing-Xing Shen ◽

Antonis Rokas

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Phylogenetic Inference ◽

Recent Analysis ◽

Sequence Alignments ◽

Multiple Sequence ◽

Time Saving ◽

Multiple Sequence Alignments

AbstractHighly divergent sites in multiple sequence alignments, which stem from erroneous inference of homology and saturation of substitutions, are thought to negatively impact phylogenetic inference. Trimming methods aim to remove these sites before phylogenetic inference, but recent analysis suggests that doing so can worsen inference. We introduce ClipKIT, a trimming method that instead aims to retain phylogenetically-informative sites; phylogenetic inference using ClipKIT-trimmed alignments is accurate, robust, and time-saving.

Download Full-text

MSABrowser: dynamic and fast visualization of sequence alignments, variations, and annotations

10.1101/2021.04.05.426321 ◽

2021 ◽

Author(s):

Furkan M. Torun ◽

Halil I. Bilgin ◽

Oktay I. Kaplan

Keyword(s):

Sequence Alignment ◽

Scientific Community ◽

Protein Sequences ◽

Genetic Variations ◽

Sequence Alignments ◽

Multiple Sequence ◽

Web Browser ◽

Post Translational Modifications ◽

Multiple Sequence Alignments ◽

Similarities And Differences

Sequence alignment is an excellent way to visualize the similarities and differences between DNA, RNA, or protein sequences, yet it is currently difficult to jointly view sequence alignment data with genetic variations, modifications such as post-translational modifications, and annotations (i.e. protein domains). Here, we develop the MSABrowser tool that makes it easy to co-visualize genetic variations, modifications, and annotations on the respective positions of amino acids or nucleotides in pairwise or multiple sequence alignments. MSABrowser is developed entirely in JavaScript and works on any modern web browser at any platform including Linux, Mac OS X, and Windows systems without any installation. MSABrowser is also freely available for the benefit of the scientific community.

Download Full-text

fastMSA: Accelerating Multiple Sequence Alignment with Dense Retrieval on Protein Language

10.1101/2021.12.20.473431 ◽

2021 ◽

Author(s):

Liang Hong ◽

Siqi Sun ◽

Liangzhen Zheng ◽

Qingxiong Tan ◽

Yu Li

Keyword(s):

Protein Structure ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structure Prediction ◽

Structure And Function ◽

Sequence Alignments ◽

Protein Structure And Function ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

And Function

Evolutionarily related sequences provide information for the protein structure and function. Multiple sequence alignment, which includes homolog searching from large databases and sequence alignment, is efficient to dig out the information and assist protein structure and function prediction, whose efficiency has been proved by AlphaFold. Despite the existing tools for multiple sequence alignment, searching homologs from the entire UniProt is still time-consuming. Considering the success of AlphaFold, foreseeably, large- scale multiple sequence alignments against massive databases will be a trend in the field. It is very desirable to accelerate this step. Here, we propose a novel method, fastMSA, to improve the speed significantly. Our idea is orthogonal to all the previous accelerating methods. Taking advantage of the protein language model based on BERT, we propose a novel dual encoder architecture that can embed the protein sequences into a low-dimension space and filter the unrelated sequences efficiently before running BLAST. Extensive experimental results suggest that we can recall most of the homologs with a 34-fold speed-up. Moreover, our method is compatible with the downstream tasks, such as structure prediction using AlphaFold. Using multiple sequence alignments generated from our method, we have little performance compromise on the protein structure prediction with much less running time. fastMSA will effectively assist protein sequence, structure, and function analysis based on homologs and multiple sequence alignment.

Download Full-text

Some remarks on evaluating the quality of the multiple sequence alignment based on the BAliBASE benchmark

International Journal of Applied Mathematics and Computer Science ◽

10.2478/v10006-009-0054-y ◽

2009 ◽

Vol 19 (4) ◽

pp. 675-678 ◽

Cited By ~ 5

Author(s):

Jacek Błażewicz ◽

Piotr Formanowicz ◽

Paweł Wojciechowski

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Formal Definitions ◽

Accuracy Measures ◽

Total Column ◽

Better Than

Some remarks on evaluating the quality of the multiple sequence alignment based on the BAliBASE benchmarkBAliBASE is one of the most widely used benchmarks for multiple sequence alignment programs. The accuracy of alignment methods is measured bybali_score—an application provided together with the database. The standard accuracy measures are the Sum of Pairs (SP) and the Total Column (TC). We have found that, for non-core block columns, results calculated bybali_scoreare different from those obtained on the basis of the formal definitions of the measures. We do not claim that one of these measures is better than the other, but they are definitely different. Such a situation can be the source of confusion when alignments obtained using various methods are compared. Therefore, we propose a new nomenclature for the measures of the quality of multiple sequence alignments to distinguish which one was actually calculated. Moreover, we have found that the occurrence of a gap in some column in the first sequence of the reference alignment causes column discarding.

Download Full-text

SpliVert: A Protein Multiple Sequence Alignment Refinement Method Based on Splitting-Splicing Vertically

Protein and Peptide Letters ◽

10.2174/0929866526666190806143959 ◽

2020 ◽

Vol 27 (4) ◽

pp. 295-302 ◽

Cited By ~ 1

Author(s):

Qing Zhan ◽

Yilei Fu ◽

Qinghua Jiang ◽

Bo Liu ◽

Jiajie Peng ◽

...

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Middle Part ◽

The Other ◽

Initial Alignment ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Refinement Method ◽

Refinement Strategy

Background: Multiple Sequence Alignment (MSA) is a fundamental task in bioinformatics and is required for many biological analysis tasks. The more accurate the alignments are, the more credible the downstream analyses. Most protein MSA algorithms realign an alignment to refine it by dividing it into two groups horizontally and then realign the two groups. However, this strategy does not consider that different regions of the sequences have different conservation; this property may lead to incorrect residue-residue or residue-gap pairs, which cannot be corrected by this strategy. Objective: In this article, our motivation is to develop a novel refinement method based on splitting- splicing vertically. Method: Here, we present a novel refinement method based on splitting-splicing vertically, called SpliVert. For an alignment, we split it vertically into 3 parts, remove the gap characters in the middle, realign the middle part alone, and splice the realigned middle parts with the other two initial pieces to obtain a refined alignment. In the realign procedure of our method, the aligner will only focus on a certain part, ignoring the disturbance of the other parts, which could help fix the incorrect pairs. Results: We tested our refinement strategy for 2 leading MSA tools on 3 standard benchmarks, according to the commonly used average SP (and TC) score. The results show that given appropriate proportions to split the initial alignment, the average scores are increased comparably or slightly after using our method. We also compared the alignments refined by our method with alignments directly refined by the original alignment tools. The results suggest that using our SpliVert method to refine alignments can also outperform direct use of the original alignment tools. Conclusion: The results reveal that splitting vertically and realigning part of the alignment is a good strategy for the refinement of protein multiple sequence alignments.

Download Full-text

MergeAlign: improving multiple sequence alignment performance by dynamic reconstruction of consensus multiple sequence alignments

BMC Bioinformatics ◽

10.1186/1471-2105-13-117 ◽

2012 ◽

Vol 13 (1) ◽

Cited By ~ 44

Author(s):

Peter W Collingridge ◽

Steven Kelly

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Dynamic Reconstruction ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

Download Full-text

ClipKIT: A multiple sequence alignment trimming software for accurate phylogenomic inference

PLoS Biology ◽

10.1371/journal.pbio.3001007 ◽

2020 ◽

Vol 18 (12) ◽

pp. e3001007

Author(s):

Jacob L. Steenwyk ◽

Thomas J. Buida ◽

Yuanning Li ◽

Xing-Xing Shen ◽

Antonis Rokas

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Phylogenetic Inference ◽

Sequence Alignments ◽

Multiple Sequence ◽

Time Saving ◽

Multiple Sequence Alignments ◽

Alternative Alignment ◽

Robust Framework

Highly divergent sites in multiple sequence alignments (MSAs), which can stem from erroneous inference of homology and saturation of substitutions, are thought to negatively impact phylogenetic inference. Thus, several different trimming strategies have been developed for identifying and removing these sites prior to phylogenetic inference. However, a recent study reported that doing so can worsen inference, underscoring the need for alternative alignment trimming strategies. Here, we introduce ClipKIT, an alignment trimming software that, rather than identifying and removing putatively phylogenetically uninformative sites, instead aims to identify and retain parsimony-informative sites, which are known to be phylogenetically informative. To test the efficacy of ClipKIT, we examined the accuracy and support of phylogenies inferred from 14 different alignment trimming strategies, including those implemented in ClipKIT, across nearly 140,000 alignments from a broad sampling of evolutionary histories. Phylogenies inferred from ClipKIT-trimmed alignments are accurate, robust, and time saving. Furthermore, ClipKIT consistently outperformed other trimming methods across diverse datasets, suggesting that strategies based on identifying and retaining parsimony-informative sites provide a robust framework for alignment trimming.

Download Full-text

Tailor-made multiple sequence alignments using the PRALINE 2 alignment toolkit

Bioinformatics ◽

10.1093/bioinformatics/btz572 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5315-5317 ◽

Cited By ~ 1

Author(s):

Maurits J J Dijkstra ◽

Atze J van der Ploeg ◽

K Anton Feenstra ◽

Wan J Fokkink ◽

Sanne Abeln ◽

...

Keyword(s):

Secondary Structure ◽

Open Source ◽

Sequence Alignment ◽

Open Source Software ◽

Multiple Sequence Alignment ◽

Multiple Alignment ◽

Sequence Alignments ◽

Multiple Sequence ◽

Dna Motifs ◽

Multiple Sequence Alignments

Abstract Summary PRALINE 2 is a toolkit for custom multiple sequence alignment workflows. It can be used to incorporate sequence annotations, such as secondary structure or (DNA) motifs, into the alignment scoring, as well as to customize many other aspects of a progressive multiple alignment workflow. Availability and implementation PRALINE 2 is implemented in Python and available as open source software on GitHub: https://github.com/ibivu/PRALINE/.

Download Full-text