CoCoNet: an efficient deep learning tool for viral metagenome binning

Bioinformatics ◽

10.1093/bioinformatics/btab213 ◽

2021 ◽

Author(s):

Cédric G Arisdakessian ◽

Olivia D Nigro ◽

Grieg F Steward ◽

Guylaine Poisson ◽

Mahdi Belcaid

Keyword(s):

Deep Learning ◽

Viral Genome ◽

High Performance ◽

Source Code ◽

Supplementary Information ◽

Biological Processes ◽

Bacterial Genomes ◽

Large Dataset ◽

Sequence Composition ◽

Rigorous Framework

Abstract Motivation Metagenomic approaches hold the potential to characterize microbial communities and unravel the intricate link between the microbiome and biological processes. Assembly is one of the most critical steps in metagenomics experiments. It consists of transforming overlapping DNA sequencing reads into sufficiently accurate representations of the community’s genomes. This process is computationally difficult and commonly results in genomes fragmented across many contigs. Computational binning methods are used to mitigate fragmentation by partitioning contigs based on their sequence composition, abundance or chromosome organization into bins representing the community’s genomes. Existing binning methods have been principally tuned for bacterial genomes and do not perform favorably on viral metagenomes. Results We propose Composition and Coverage Network (CoCoNet), a new binning method for viral metagenomes that leverages the flexibility and the effectiveness of deep learning to model the co-occurrence of contigs belonging to the same viral genome and provide a rigorous framework for binning viral contigs. Our results show that CoCoNet substantially outperforms existing binning methods on viral datasets. Availability and implementation CoCoNet was implemented in Python and is available for download on PyPi (https://pypi.org/). The source code is hosted on GitHub at https://github.com/Puumanamana/CoCoNet and the documentation is available at https://coconet.readthedocs.io/en/latest/index.html. CoCoNet does not require extensive resources to run. For example, binning 100k contigs took about 4 h on 10 Intel CPU Cores (2.4 GHz), with a memory peak at 27 GB (see Supplementary Fig. S9). To process a large dataset, CoCoNet may need to be run on a high RAM capacity server. Such servers are typically available in high-performance or cloud computing settings. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

HOPS: high-performance library for (non-)uniform sampling of convex-constrained models

Bioinformatics ◽

10.1093/bioinformatics/btaa872 ◽

2020 ◽

Author(s):

Johann F Jadebeck ◽

Axel Theorell ◽

Samuel Leweke ◽

Katharina Nöh

Keyword(s):

High Performance ◽

State Of The Art ◽

Source Code ◽

Third Party ◽

Supplementary Information ◽

Scalable Algorithms ◽

Uniform Sampling ◽

Non Uniform Sampling ◽

Constrained Models ◽

Performance Gains

Abstract Summary The C++ library Highly Optimized Polytope Sampling (HOPS) provides implementations of efficient and scalable algorithms for sampling convex-constrained models that are equipped with arbitrary target functions. For uniform sampling, substantial performance gains were achieved compared to the state-of-the-art. The ease of integration and utility of non-uniform sampling is showcased in a Bayesian inference setting, demonstrating how HOPS interoperates with third-party software. Availability and implementation Source code is available at https://github.com/modsim/hops/, tested on Linux and MS Windows, includes unit tests, detailed documentation, example applications and a Dockerfile. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SOAPMetaS: profiling large metagenome datasets efficiently on distributed clusters

Bioinformatics ◽

10.1093/bioinformatics/btaa697 ◽

2020 ◽

Author(s):

Shixu He ◽

Zhibo Huang ◽

Xiaohan Wang ◽

Lin Fang ◽

Shengkang Li ◽

...

Keyword(s):

Big Data ◽

Large Volume ◽

Machine Tools ◽

High Performance ◽

Marker Gene ◽

Source Code ◽

Large Datasets ◽

Supplementary Information ◽

Supplementary Data ◽

Multiple Sample

Abstract Summary Rapid increase of the data size in metagenome researches has raised the demand for new tools to process large datasets efficiently. To accelerate the metagenome profiling process in the scenario of big data, we developed SOAPMetaS, a marker gene-based multiple-sample metagenome profiling tool built on Apache Spark. SOAPMetaS demonstrates high performance and scalability to process large datasets. It can process 80 samples of FASTQ data, summing up to 416 GiB, in around half an hour; and the accuracy of species profiling results of SOAPMetaS is similar to that of MetaPhlAn2. SOAPMetaS can deal with a large volume of metagenome data more efficiently than common-used single-machine tools. Availability and implementation Source code is implemented in Java and freely available at https://github.com/BGI-flexlab/SOAPMetaS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

LGFC-CNN: Prediction of lncRNA-Protein Interactions by Using Multiple Types of Features through Deep Learning

Genes ◽

10.3390/genes12111689 ◽

2021 ◽

Vol 12 (11) ◽

pp. 1689

Author(s):

Lan Huang ◽

Shaoqing Jiao ◽

Sen Yang ◽

Shuangquan Zhang ◽

Xiaopeng Zhu ◽

...

Keyword(s):

Fourier Transform ◽

Deep Learning ◽

Protein Interactions ◽

Noncoding Rna ◽

State Of The Art ◽

Experimental Methods ◽

Biological Processes ◽

Sequence Composition ◽

Novel Method ◽

Global And Local

Long noncoding RNA (lncRNA) plays a crucial role in many critical biological processes and participates in complex human diseases through interaction with proteins. Considering that identifying lncRNA–protein interactions through experimental methods is expensive and time-consuming, we propose a novel method based on deep learning that combines raw sequence composition features and hand-designed features, called LGFC-CNN, to predict lncRNA–protein interactions. The two sequence preprocessing methods and CNN modules (GloCNN and LocCNN) are utilized to extract the raw sequence global and local features. Meanwhile, we select hand-designed features by comparing the predictive effect of different lncRNA and protein features combinations. Furthermore, we obtain the structure features and unifying the dimensions through Fourier transform. In the end, the four types of features are integrated to comprehensively predict the lncRNA–protein interactions. Compared with other state-of-the-art methods on three lncRNA–protein interaction datasets, LGFC-CNN achieves the best performance with an accuracy of 94.14%, on RPI21850; an accuracy of 92.94%, on RPI7317; and an accuracy of 98.19% on RPI1847. The results show that our LGFC-CNN can effectively predict the lncRNA–protein interactions by combining raw sequence composition features and hand-designed features.

Download Full-text

GlycanFormatConverter: a conversion tool for translating the complexities of glycans

Bioinformatics ◽

10.1093/bioinformatics/bty990 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2434-2440 ◽

Cited By ~ 7

Author(s):

Shinichiro Tsuchiya ◽

Issaku Yamada ◽

Kiyoko F Aoki-Kinoshita

Keyword(s):

Open Source ◽

Source Code ◽

Supplementary Information ◽

Biological Processes ◽

Supplementary Data ◽

Unique Representation ◽

Open Source Tool ◽

Living Organisms ◽

Conversion Tool ◽

Complex Glycan

Abstract Motivation Glycans are biomolecules that take an important role in the biological processes of living organisms. They form diverse, complicated structures such as branched and cyclic forms. Web3 Unique Representation of Carbohydrate Structures (WURCS) was proposed as a new linear notation for uniquely representing glycans during the GlyTouCan project. WURCS defines rules for complex glycan structures that other text formats did not support, and so it is possible to represent a wide variety glycans. However, WURCS uses a complicated nomenclature, so it is not human-readable. Therefore, we aimed to support the interpretation of WURCS by converting WURCS to the most basic and widely used format IUPAC. Results In this study, we developed GlycanFormatConverter and succeeded in converting WURCS to the three kinds of IUPAC formats (IUPAC-Extended, IUPAC-Condensed and IUPAC-Short). Furthermore, we have implemented functionality to import IUPAC-Extended, KEGG Chemical Function (KCF) and LinearCode formats and to export WURCS. We have thoroughly tested our GlycanFormatConverter and were able to show that it was possible to convert all the glycans registered in the GlyTouCan repository, with exceptions owing only to the limitations of the original format. The source code for this conversion tool has been released as an open source tool. Availability and implementation https://github.com/glycoinfo/GlycanFormatConverter.git Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BiasAway: command-line and web server to generate nucleotide composition-matched DNA background sequences

Bioinformatics ◽

10.1093/bioinformatics/btaa928 ◽

2020 ◽

Author(s):

Aziz Khan ◽

Rafael Riudavets Puig ◽

Paul Boddie ◽

Anthony Mathelier

Keyword(s):

Dna Sequences ◽

Source Code ◽

Web Server ◽

Enrichment Analysis ◽

Nucleotide Composition ◽

Supplementary Information ◽

Command Line ◽

Sequence Composition ◽

Command Line Tool ◽

Gc Bias

Abstract Motivation Accurate motif enrichment analyses depend on the choice of background DNA sequences used, which should ideally match the sequence composition of the foreground sequences. It is important to avoid false positive enrichment due to sequence biases in the genome, such as GC-bias. Therefore, relying on an appropriate set of background sequences is crucial for enrichment analysis. Results We developed BiasAway, a command line tool and its dedicated easy-to-use web server to generate synthetic sequences matching any k-mer nucleotide composition or select genomic DNA sequences matching the mononucleotide composition of the foreground sequences through four different models. For genomic sequences, we provide precomputed partitions of genomes from nine species with five different bin sizes to generate appropriate genomic background sequences. Availability and implementation BiasAway source code is freely available from Bitbucket (https://bitbucket.org/CBGR/biasaway) and can be easily installed using bioconda or pip. The web server is available at https://biasaway.uio.no and a detailed documentation is available at https://biasaway.readthedocs.io. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A learning-based framework for miRNA-disease association identification using neural networks

10.1101/276048 ◽

2018 ◽

Cited By ~ 9

Author(s):

Jiajie Peng ◽

Weiwei Hui ◽

Qianqian Li ◽

Bolin Chen ◽

Qinghua Jiang ◽

...

Keyword(s):

State Of The Art ◽

Essential Feature ◽

Source Code ◽

Disease Association ◽

Feature Representation ◽

Supplementary Information ◽

Feature Combination ◽

Biological Processes ◽

Non Coding Rna ◽

Supplementary Material

AbstractMotivationA microRNA (miRNA) is a type of non-coding RNA, which plays important roles in many biological processes. Lots of studies have shown that miRNAs are implicated in human diseases, indicating that miRNAs might be potential biomarkers for various types of diseases. Therefore, it is important to reveal the relationships between miRNAs and diseases/phenotypes.ResultsWe propose a novel learning-based framework, MDA-CNN, for miRNA-disease association identification. The model first captures richer interaction features between diseases and miRNAs based on a three-layer network with an additional gene layer. Then, it employs an auto-encoder to identify the essential feature combination for each pair of miRNA and disease automatically. Finally, taking the reduced feature representation as input, it uses a convolutional neural network to predict the final label. The evaluation results show that the proposed framework outperforms some state-of-the-art approaches in a large margin on both tasks of miRNA-disease association prediction and miRNA-phenotype association prediction.AvailabilityThe source code and data are available at https://github.com/Issingjessica/[email protected];[email protected];[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

VIDHOP, viral host prediction with deep learning

Bioinformatics ◽

10.1093/bioinformatics/btaa705 ◽

2020 ◽

Cited By ~ 1

Author(s):

Florian Mock ◽

Adrian Viehweger ◽

Emanuel Barth ◽

Manja Marz

Keyword(s):

Deep Learning ◽

Viral Genome ◽

Influenza A ◽

Deep Neural Networks ◽

Supplementary Information ◽

Learning Approach ◽

Virus Species ◽

Genome Sequences ◽

Rotavirus A ◽

The Core

Abstract Motivation Zoonosis, the natural transmission of infections from animals to humans, is a far-reaching global problem. The recent outbreaks of Zikavirus, Ebolavirus and Coronavirus are examples of viral zoonosis, which occur more frequently due to globalization. In case of a virus outbreak, it is helpful to know which host organism was the original carrier of the virus to prevent further spreading of viral infection. Recent approaches aim to predict a viral host based on the viral genome, often in combination with the potential host genome and arbitrarily selected features. These methods are limited in the number of different hosts they can predict or the accuracy of the prediction. Results Here, we present a fast and accurate deep learning approach for viral host prediction, which is based on the viral genome sequence only. We tested our deep neural network (DNN) on three different virus species (influenza A virus, rabies lyssavirus and rotavirus A). We achieved for each virus species an AUC between 0.93 and 0.98, allowing highly accurate predictions while using only fractions (100–400 bp) of the viral genome sequences. We show that deep neural networks are suitable to predict the host of a virus, even with a limited amount of sequences and highly unbalanced available data. The trained DNNs are the core of our virus–host prediction tool VIrus Deep learning HOst Prediction (VIDHOP). VIDHOP also allows the user to train and use models for other viruses. Availability and implementation VIDHOP is freely available under https://github.com/flomock/vidhop. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

VIDHOP, viral host prediction with Deep Learning

10.1101/575571 ◽

2019 ◽

Cited By ~ 1

Author(s):

Florian Mock ◽

Adrian Viehweger ◽

Emanuel Barth ◽

Manja Marz

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Viral Genome ◽

Influenza A ◽

Deep Neural Networks ◽

Ebola Virus ◽

Training Data ◽

Supplementary Information ◽

Virus Species ◽

Rotavirus A

AbstractMotivationZoonosis, the natural transmission of infections from animals to humans, is a far-reaching global problem. The recent outbreaks of Zika virus, Ebola virus and Corona virus are examples of viral zoonosis, which occur more frequently due to globalization. In the case of a virus outbreak, it is helpful to know which host organism was the original carrier of the virus. Once the reservoir or intermediate host is known, it can be isolated to prevent further spreading of the viral infection. Recent approaches aim to predict a viral host based on the viral genome, often in combination with the potential host genome and arbitrarily selected features. These methods have a clear limitation in either the number of different hosts they can predict or the accuracy of their prediction.ResultsHere, we present a fast and accurate deep learning approach for viral host prediction, which is based on the viral genome sequence only. To ensure a high prediction accuracy, we developed an effective selection approach for the training data to avoid biases due to a highly unbalanced number of known sequences per virus-host combinations. We tested our deep neural network on three different virus species (influenza A, rabies lyssavirus, rotavirus A). We reached for each virus species an AUG between 0.93 and 0.98, outperforming previous approaches and allowing highly accurate predictions while only using fractions (100-400 bp) of the viral genome sequences. We show that deep neural networks are suitable to predict the host of a virus, even with a limited amount of sequences and highly unbalanced available data. The deep neural networks trained for this approach build the core of the virus-host predicting tool VIDHOP (Virus Deep learning HOst Prediction).AvailabilityThe trained models for the prediction of the host for the viruses influenza A, rabies lyssavirus, rotavirus A are implemented in the tool VIDHOP. This tool is freely available under https://github.com/flomock/vidhop.Supplementary informationSupplementary data are available at DOI 10.17605/OSF.IO/UXT7N

Download Full-text