LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes

PeerJ ◽

10.7717/peerj.10906 ◽

2021 ◽

Vol 9 ◽

pp. e10906

Author(s):

Long Tian ◽

Reza Mazloom ◽

Lenwood S. Heath ◽

Boris A. Vinatzer

Keyword(s):

Large Set ◽

Computational Pipeline ◽

Genome Sequences ◽

Alignment Free ◽

Prokaryotic Genomes ◽

Highly Correlated ◽

Query Genome ◽

Genomic Similarity ◽

Memory Efficient ◽

Similarity Matrices

Background Computing genomic similarity between strains is a prerequisite for genome-based prokaryotic classification and identification. Genomic similarity was first computed as Average Nucleotide Identity (ANI) values based on the alignment of genomic fragments. Since this is computationally expensive, faster and computationally cheaper alignment-free methods have been developed to estimate ANI. However, these methods do not reach the level of accuracy of alignment-based methods. Methods Here we introduce LINflow, a computational pipeline that infers pairwise genomic similarity in a set of genomes. LINflow takes advantage of the speed of the alignment-free sourmash tool to identify the genome in a dataset that is most similar to a query genome and the precision of the alignment-based pyani software to precisely compute ANI between the query genome and the most similar genome identified by sourmash. This is repeated for each new genome that is added to a dataset. The sequentially computed ANI values are stored as Life Identification Numbers (LINs), which are then used to infer all other pairwise ANI values in the set. We tested LINflow on four sets, 484 genomes in total, and compared the needed time and the generated similarity matrices with other tools. Results LINflow is up to 150 times faster than pyani and pairwise ANI values generated by LINflow are highly correlated with those computed by pyani. However, because LINflow infers most pairwise ANI values instead of computing them directly, ANI values occasionally depart from the ANI values computed by pyani. In conclusion, LINflow is a fast and memory-efficient pipeline to infer similarity among a large set of prokaryotic genomes. Its ability to quickly add new genome sequences to an already computed similarity matrix makes LINflow particularly useful for projects when new genome sequences need to be regularly added to an existing dataset.

Download Full-text

Afann: bias adjustment for alignment-free sequence comparison based on sequencing data using neural network regression

Genome Biology ◽

10.1186/s13059-019-1872-3 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 4

Author(s):

Kujin Tang ◽

Jie Ren ◽

Fengzhu Sun

Keyword(s):

Neural Network ◽

Sequence Comparison ◽

Sequencing Data ◽

Genome Sequences ◽

Excellent Performance ◽

Bias Adjustment ◽

Link Type ◽

Alignment Free ◽

Memory Efficient

AbstractAlignment-free methods, more time and memory efficient than alignment-based methods, have been widely used for comparing genome sequences or raw sequencing samples without assembly. However, in this study, we show that alignment-free dissimilarity calculated based on sequencing samples can be overestimated compared with the dissimilarity calculated based on their genomes, and this bias can significantly decrease the performance of the alignment-free analysis. Here, we introduce a new alignment-free tool, Alignment-Free methods Adjusted by Neural Network (Afann) that successfully adjusts this bias and achieves excellent performance on various independent datasets. Afann is freely available at https://github.com/GeniusTang/Afann.

Download Full-text

The dynamic neural code of the retina for natural scenes

10.1101/340943 ◽

2018 ◽

Cited By ~ 5

Author(s):

Niru Maheswaranathan ◽

Lane T. McIntosh ◽

Hidenori Tanaka ◽

Satchel Grant ◽

David B. Kastner ◽

...

Keyword(s):

Visual Processing ◽

Ganglion Cells ◽

Predictive Coding ◽

Neural Code ◽

Natural Scenes ◽

Large Set ◽

Fundamental Limits ◽

Fundamental Goal ◽

Latent Effects ◽

Highly Correlated

AbstractUnderstanding how the visual system encodes natural scenes is a fundamental goal of sensory neuroscience. We show here that a three-layer network model predicts the retinal response to natural scenes with an accuracy nearing the fundamental limits of predictability. The model’s internal structure is interpretable, in that model units are highly correlated with interneurons recorded separately and not used to fit the model. We further show the ethological relevance to natural visual processing of a diverse set of phenomena of complex motion encoding, adaptation and predictive coding. Our analysis uncovers a fast timescale of visual processing that is inaccessible directly from experimental data, showing unexpectedly that ganglion cells signal in distinct modes by rapidly (< 0.1 s) switching their selectivity for direction of motion, orientation, location and the sign of intensity. A new approach that decomposes ganglion cell responses into the contribution of interneurons reveals how the latent effects of parallel retinal circuits generate the response to any possible stimulus. These results reveal extremely flexible and rapid dynamics of the retinal code for natural visual stimuli, explaining the need for a large set of interneuron pathways to generate the dynamic neural code for natural scenes.

Download Full-text

Cluster analysis of coronavirus sequences using computational sequence descriptors: With applications to SARS, MERS and SARS-CoV-2 (CoVID-19)

Current Computer - Aided Drug Design ◽

10.2174/1573409917666210202092646 ◽

2021 ◽

Vol 17 ◽

Author(s):

Marjan Vračko ◽

Subhash C. Basak ◽

Tathagata Dey ◽

Ashesh Nandy

Keyword(s):

Cluster Analysis ◽

Virus Type ◽

Genome Sequences ◽

Alignment Free ◽

The World

Background: Study of 573 genome sequences belonging to SARS, MERS and SARS-CoV-2 (CoVID-19) viruses. Objective: To compare the virus sequences, which originate from different places around the world. Methods: Alignment free methods for representation of sequences and chemometrical methods for analyzing of clusters. Results: Majority of genome sequences are clustered with respect on virus type, but some of them are outliers. Conclusion: We indicate 71 sequences, which tend to belong to more than cluster.

Download Full-text

Whole-proteome tree of life suggests a deep burst of organism diversity

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1915766117 ◽

2020 ◽

Vol 117 (7) ◽

pp. 3678-3686 ◽

Cited By ~ 5

Author(s):

JaeJin Choi ◽

Sung-Hou Kim

Keyword(s):

Information Theory ◽

Genome Sequence ◽

Tree Of Life ◽

Whole Genome Sequence ◽

Whole Genome ◽

Genome Sequences ◽

Alignment Free ◽

Whole Transcriptome ◽

Evolutionary Progression ◽

Feature Frequency

An organism tree of life (organism ToL) is a conceptual and metaphorical tree to capture a simplified narrative of the evolutionary course and kinship among the extant organisms. Such a tree cannot be experimentally validated but may be reconstructed based on characteristics associated with the organisms. Since the whole-genome sequence of an organism is, at present, the most comprehensive descriptor of the organism, a whole-genome sequence-based ToL can be an empirically derivable surrogate for the organism ToL. However, experimentally determining the whole-genome sequences of many diverse organisms was practically impossible until recently. We have constructed three types of ToLs for diversely sampled organisms using the sequences of whole genome, of whole transcriptome, and of whole proteome. Of the three, whole-proteome sequence-based ToL (whole-proteome ToL), constructed by applying information theory-based feature frequency profile method, an “alignment-free” method, gave the most topologically stable ToL. Here, we describe the main features of a whole-proteome ToL for 4,023 species with known complete or almost complete genome sequences on grouping and kinship among the groups at deep evolutionary levels. The ToL reveals 1) all extant organisms of this study can be grouped into 2 “Supergroups,” 6 “Major Groups,” or 35+ “Groups”; 2) the order of emergence of the “founders” of all of the groups may be assigned on an evolutionary progression scale; 3) all of the founders of the groups have emerged in a “deep burst” at the very beginning period near the root of the ToL—an explosive birth of life’s diversity.

Download Full-text

Discrete Wavelet Packet Transform Based Discriminant Analysis for Whole Genome Sequences

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2018-0045 ◽

2019 ◽

Vol 18 (2) ◽

Author(s):

Hsin-Hsiung Huang ◽

Senthil Balaji Girimurugan

Keyword(s):

Discriminant Analysis ◽

Wavelet Packet ◽

Wavelet Packet Transform ◽

Discrete Wavelet ◽

Whole Genome ◽

Statistical Classification ◽

Genome Sequences ◽

Discrete Wavelet Packet Transform ◽

Alignment Free ◽

Free Representation

Abstract In recent years, alignment-free methods have been widely applied in comparing genome sequences, as these methods compute efficiently and provide desirable phylogenetic analysis results. These methods have been successfully combined with hierarchical clustering methods for finding phylogenetic trees. However, it may not be suitable to apply these alignment-free methods directly to existing statistical classification methods, because an appropriate statistical classification theory for integrating with the alignment-free representation methods is still lacking. In this article, we propose a discriminant analysis method which uses the discrete wavelet packet transform to classify whole genome sequences. The proposed alignment-free representation statistics of features follow a joint normal distribution asymptotically. The data analysis results indicate that the proposed method provides satisfactory classification results in real time.

Download Full-text

Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements

GigaScience ◽

10.1093/gigascience/giaa048 ◽

2020 ◽

Vol 9 (5) ◽

Cited By ~ 1

Author(s):

Morteza Hosseini ◽

Diogo Pratas ◽

Burkhard Morgenstern ◽

Armando J Pinho

Keyword(s):

Dna Sequences ◽

Large Scale ◽

High Throughput Sequencing ◽

Genetic Disorders ◽

Chromosomal Evolution ◽

Genomic Rearrangements ◽

Efficient Tool ◽

Compression Technique ◽

Alignment Free ◽

Memory Efficient

Abstract Background The development of high-throughput sequencing technologies and, as its result, the production of huge volumes of genomic data, has accelerated biological and medical research and discovery. Study on genomic rearrangements is crucial owing to their role in chromosomal evolution, genetic disorders, and cancer. Results We present Smash++, an alignment-free and memory-efficient tool to find and visualize small- and large-scale genomic rearrangements between 2 DNA sequences. This computational solution extracts information contents of the 2 sequences, exploiting a data compression technique to find rearrangements. We also present Smash++ visualizer, a tool that allows the visualization of the detected rearrangements along with their self- and relative complexity, by generating an SVG (Scalable Vector Graphics) image. Conclusions Tested on several synthetic and real DNA sequences from bacteria, fungi, Aves, and Mammalia, the proposed tool was able to accurately find genomic rearrangements. The detected regions were in accordance with previous studies, which took alignment-based approaches or performed FISH (fluorescence in situ hybridization) analysis. The maximum peak memory usage among all experiments was ∼1 GB, which makes Smash++ feasible to run on present-day standard computers.

Download Full-text

Comparison of N<sub>2</sub>O<sub>5</sub> mixing ratios during NO3Comp 2007 in SAPHIR

Atmospheric Measurement Techniques ◽

10.5194/amt-5-2763-2012 ◽

2012 ◽

Vol 5 (11) ◽

pp. 2763-2777 ◽

Cited By ~ 16

Author(s):

H. Fuchs ◽

W. R. Simpson ◽

R. L. Apodaca ◽

T. Brauers ◽

R. C. Cohen ◽

...

Keyword(s):

Correlation Coefficients ◽

Laser Induced Fluorescence ◽

Transmission Efficiency ◽

Large Set ◽

Mixing Ratios ◽

Aerosol Exposure ◽

Accuracy Of Measurements ◽

Highly Correlated ◽

Cavity Ring Down

Abstract. N2O5 detection in the atmosphere has been accomplished using techniques which have been developed during the last decade. Most techniques use a heated inlet to thermally decompose N2O5 to NO3, which can be detected by either cavity based absorption at 662 nm or by laser-induced fluorescence. In summer 2007, a large set of instruments, which were capable of measuring NO3 mixing ratios, were simultaneously deployed in the atmosphere simulation chamber SAPHIR in Jülich, Germany. Some of these instruments measured N2O5 mixing ratios either simultaneously or alternatively. Experiments focused on the investigation of potential interferences from, e.g., water vapour or aerosol and on the investigation of the oxidation of biogenic volatile organic compounds by NO3. The comparison of N2O5 mixing ratios shows an excellent agreement between measurements of instruments applying different techniques (3 cavity ring-down (CRDS) instruments, 2 laser-induced fluorescence (LIF) instruments). Datasets are highly correlated as indicated by the square of the linear correlation coefficients, R2, which values were larger than 0.96 for the entire datasets. N2O5 mixing ratios well agree within the combined accuracy of measurements. Slopes of the linear regression range between 0.87 and 1.26 and intercepts are negligible. The most critical aspect of N2O5 measurements by cavity ring-down instruments is the determination of the inlet and filter transmission efficiency. Measurements here show that the N2O5 inlet transmission efficiency can decrease in the presence of high aerosol loads, and that frequent filter/inlet changing is necessary to quantitatively sample N2O5 in some environments. The analysis of data also demonstrates that a general correction for degrading filter transmission is not applicable for all conditions encountered during this campaign. Besides the effect of a gradual degradation of the inlet transmission efficiency aerosol exposure, no other interference for N2O5 measurements is found.

Download Full-text

Whole-Proteome Tree of Arthropods: An “alignment-free” phylogeny of proteome “books”

10.21203/rs.3.rs-46859/v1 ◽

2020 ◽

Author(s):

Sung-Hou Kim ◽

JaeJin Choi ◽

Byung-Ju Kim

Keyword(s):

Text Analysis ◽

Gene Tree ◽

Whole Genome ◽

Gene Trees ◽

View Point ◽

Analysis Method ◽

Genome Sequences ◽

Alignment Free ◽

Grouping Pattern ◽

Almost All

Abstract Background: An “organism tree” of a group of extant-organisms can be considered as a conceptual tree to capture a simplified narrative of the evolutionary course among the organisms. Due to the difficulties of whole-genome sequencing for many organisms, the most common approach has been to construct a “gene tree” by selecting a group of genes common among the organisms, “align” each gene family and estimate evolutionary distances. Despite broad acceptance of the gene trees as the surrogates for the organism trees, there are important limitations and confounding issues with the approach. During last decades, whole-genome sequences of many extant-arthropods became available, providing an opportunity to construct a “whole-proteome tree” of the arthropods, the largest and most species-diverse group of all living animals. Results: An “alignment-free” whole-proteome tree of the arthropods shows that (a) the demographic grouping-pattern is similar to those in the gene trees, but there are notable differences in the branching orders of the groups and the sisterhood relationships between pairs of the groups; and (b) almost all the “founders” of the groups have emerged in an “explosive burst” near the root of the tree. Conclusion: Since the whole-proteome sequence of an organism can be considered as a “book” of amino-acid alphabets, a tree of the books can be constructed, without alignment of sequences, using a text analysis method of Information Theory, which allows comparing the information content of whole-proteomes. Such tree provides another view-point to consider in telling the narrative of kinship among the arthropods.

Download Full-text

Genome-Based Taxonomic Rearrangement of the Order Geobacterales Including the Description of Geomonas azotofigens sp. nov. and Geomonas diazotrophica sp. nov.

Frontiers in Microbiology ◽

10.3389/fmicb.2021.737531 ◽

2021 ◽

Vol 12 ◽

Author(s):

Zhenxing Xu ◽

Yoko Masuda ◽

Xueding Wang ◽

Natsumi Ushijima ◽

Yutaka Shiratori ◽

...

Keyword(s):

Taxonomic Status ◽

Terrestrial Ecosystems ◽

Single Copy ◽

Amino Acid Identity ◽

Taxonomic Structure ◽

Phylogenomic Analysis ◽

Genome Sequences ◽

Average Amino Acid Identity ◽

The Family ◽

Genomic Similarity

Geobacterales is a recently proposed order comprising members who originally belonged to the well-known family Geobacteraceae, which is a key group in terrestrial ecosystems involved in biogeochemical cycles and has been widely investigated in bioelectrochemistry and bioenergy fields. Previous studies have illustrated the taxonomic structure of most members in this group based on genomic phylogeny; however, several members are still in a pendent or chaotic taxonomic status owing to the lack of genome sequences. To address this issue, we performed this taxonomic reassignment using currently available genome sequences, along with the description of two novel paddy soil-isolated strains, designated Red51T and Red69T, which are phylogenetically located within this order. Phylogenomic analysis based on 120 ubiquitous single-copy proteins robustly separated the species Geobacter luticola from other known genera and placed the genus Oryzomonas (fam. Geobacteraceae) into the family ‘Pseudopelobacteraceae’; thus, a novel genus Geomobilimonas is proposed, and the family ‘Pseudopelobacteraceae’ was emended. Moreover, genomic comparisons with similarity indexes, including average amino acid identity (AAI), percentage of conserved protein (POCP), and average nucleotide identity (ANI), showed proper thresholds as genera boundaries in this order with values of 70%, 65%, and 74% for AAI, POCP, and ANI, respectively. Based on this, the three species Geobacter argillaceus, Geobacter pelophilus, and Geobacter chapellei should be three novel genera, for which the names Geomobilibacter, Geoanaerobacter, and Pelotalea are proposed, respectively. In addition, the two novel isolated strains phylogenetically belonged to the genus Geomonas, family Geobacteraceae, and shared genomic similarity values higher than those of genera boundaries, but lower than those of species boundaries with each other and their neighbors. Taken together with phenotypic and chemotaxonomic characteristics similar to other Geomonas species, these two strains, Red51T and Red69T, represent two novel species in the genus Geomonas, for which the names Geomonas azotofigens sp. nov. and Geomonas diazotrophica sp. nov. are proposed, respectively.

Download Full-text

Effects of pruning a temperate mangrove forest on the associated assemblages of macroinvertebrates

Marine and Freshwater Research ◽

10.1071/mf02083 ◽

2003 ◽

Vol 54 (5) ◽

pp. 683 ◽

Cited By ~ 5

Author(s):

William Gladstone ◽

Maria J. Schreider

Keyword(s):

Mangrove Forest ◽

Forest Canopy ◽

Sampling Period ◽

Mangrove Forests ◽

Short Term ◽

Macrobenthic Assemblages ◽

Highly Correlated ◽

Taxonomic Groups ◽

Macroinvertebrate Fauna ◽

Similarity Matrices

Mangrove forests around the world are being impacted by development in adjacent land and water areas. An after-control-impact study was undertaken to assess the effects of mangrove forest pruning on the associated benthic macroinvertebrate fauna. Pruning, undertaken 5 years before our sampling period, reduced the height of the forest canopy from 5 m to 1 m. Macrobenthic assemblages were sampled in September 2000 and January 2001 from two randomly selected sites within the pruned section of forest, and two sites in each of two control locations in the same forest. Assemblage composition in the pruned and undisturbed mangrove forests was not significantly different, nor were there significant differences in variability between the two areas. Similarity matrices for assemblages based on higher taxonomic groups and molluscs were highly correlated with similarity matrices for all taxa, indicating the utility of more rapid forms of assessment in this habitat. The results suggest that although short-term impacts may have occurred, no impact on macroinvertebrate assemblages was evident 5 years after the pruning.

Download Full-text