Kaiju: Fast and sensitive taxonomic classification for metagenomics

Mapping Intimacies ◽

10.1101/031229 ◽

2015 ◽

Cited By ~ 7

Author(s):

Peter Menzel ◽

Kim Lee Ng ◽

Anders Krogh

Keyword(s):

Large Scale ◽

Taxonomic Classification ◽

Greedy Heuristic ◽

Reference Database ◽

The Novel ◽

Sequencing Technologies ◽

A Genome ◽

Reference Databases ◽

Genome Exclusion ◽

Higher Sensitivity

The constantly decreasing cost and increasing output of current sequencing technologies enable large scale metagenomic studies of microbial communities from diverse habitats. Therefore, fast and accurate methods for taxonomic classification are needed, which can operate on increasingly larger datasets and reference databases. Recently, several fast metagenomic classifiers have been developed, which are based on comparison of genomic k-mers. However, nucleotide comparison using a fixed k-mer length often lacks the sensitivity to overcome the evolutionary distance between sampled species and genomes in the reference database. Here, we present the novel metagenome classifier Kaiju for fast assignment of reads to taxa. Kaiju finds maximum exact matches on the protein-level using the Borrows-Wheeler transform, and can optionally allow amino acid substitutions in the search using a greedy heuristic. We show in a genome exclusion study that Kaiju can classify more reads with higher sensitivity and similar precision compared to fast k-mer based classifiers, especially in genera that are underrepresented in reference databases. We also demonstrate that Kaiju classifies more than twice as many reads in ten real metagenomes compared to programs based on genomic k-mers. Kaiju can process up to millions of reads per minute, and its memory footprint is below 6 GB of RAM, allowing the analysis on a standard PC. The program is available under the GPL3 license at: http://bioinformatics-centre.github.io/kaiju

Download Full-text

AUTOMATION OF UPDATE OF DIGITAL NATIONAL GEO‐REFERENCE DATABASES / NACIONALINIŲ GEOREFERENCINIŲ DUOMENŲ BAZIŲ ATNAUJINIMO AUTOMATIZAVIMAS

Technological and Economic Development of Economy ◽

10.3846/tede.2010.16 ◽

2010 ◽

Vol 16 (2) ◽

pp. 254-265 ◽

Cited By ~ 6

Author(s):

Žilvinas Stankevičius ◽

Giedrė Beconytė ◽

Aušra Kalantaitė

Keyword(s):

Large Scale ◽

Reference Data ◽

Economic Effect ◽

Geographic Information ◽

National Database ◽

Reference Database ◽

Unique Identifier ◽

Unique Object ◽

Reference Databases ◽

Definition Of

Unified geo‐reference data model is a very important part of national geographic information management. It has been developed within the project of Lithuanian geographic information infrastructure in 2006–2008. This model allows automated integration of large scale (mainly municipality) geo‐reference data into the unified national geo‐reference database. It is based on unique object identifiers across all geo‐reference databases and on standard update and harmonisation procedures. The common stages of harmonisation of geo‐reference databases at different scales include: implementation of a unique identifier of geographic objects across all databases concerned; definition of the life cycle of the objects; definition of cohesion boundary and of the harmonisation points along the boundary; maintenance of the local database and automatic update of the national database using special service. When implemented, such model will significantly facilitate maintenance of national geo‐reference database and in five years from full implementation will have a significant economic effect. Santrauka Lietuvoje atlikta savivaldybėse kaupiamų erdvinių duomenų analizė parodė, kad tik didesniu miestų savivaldybės kaupia erdvinius duomenis, tačiau erdvinių duomenų sandaros skirtingos. Nacionaliniu lygmeniu kuriamos erdviniu duomenų bazės nesuderintos tarpusavyje, dubliuojamas erdviniu duomenų kaupimo procesas, orientuojantis į skirtingų masteliu žemelapių gamyba. Bendras georeferenciniu duomenų modelis (VGDM) apima georeferencinių duomenų konversija iš įvairių mastelių oficialių geografinių duomenų rinkinių, o ypač iš savivaldybių georeferencinių duomenų rinkinių į bendrą valstybės georeferencinių duomenų bazę (VGDB) ir nuolatinės VGDB atnaujinimo procedūras. VGDB atnaujinimo technologijos pagrindas ‐ geoobjektų (vektorinių geografinių duomenų elementų) egzistavimo ciklas ir pokyčių sekimas. Georeferencinių duomenų modelis reiškia, kad yra numatytas kelias pasiekti efektyvią įvairių mastelių oficialių duomenų bazių sąveiką.

Download Full-text

Incorporating genome-based phylogeny and trait similarity into diversity assessments helps to resolve a global collection of human gut metagenomes

10.1101/2020.07.16.207845 ◽

2020 ◽

Author(s):

Nicholas D. Youngblut ◽

Jacobo de la Cuesta-Zuluaga ◽

Ruth E. Ley

Keyword(s):

Microbial Communities ◽

Large Scale ◽

Alpha Diversity ◽

Reference Database ◽

Rrna Gene ◽

Human Gut ◽

Shotgun Metagenomics ◽

Diversity Measures ◽

Microbiome Diversity ◽

A Genome

AbstractTree-based diversity measures incorporate phylogenetic or phenotypic relatedness into comparisons of microbial communities. This improves the identification of explanatory factors compared to tree-agnostic diversity measures. However, applying tree-based diversity measures to metagenome data is more challenging than for single-locus sequencing (e.g., 16S rRNA gene). The Genome Taxonomy Database (GTDB) provides a genome-based reference database that can be used for species-level metagenome profiling, and a multi-locus phylogeny of all genomes that can be employed for diversity calculations. Moreover, traits can be inferred from the genomic content of each representative, allowing for trait-based diversity measures. Still, it is unclear how metagenome-based assessments of microbiome diversity benefit from incorporating phylogeny or phenotype into measures of diversity. We assessed this by measuring phylogeny-based, trait-based, and tree-agnostic diversity measures from a large, global collection of human gut metagenomes composed of 33 studies and 3348 samples. We found phylogeny- and trait-based alpha diversity to better differentiate samples by westernization, age, and gender. PCoA ordinations of phylogeny- or trait-based weighted UniFrac explained more variance than tree-agnostic measures, which was largely a result of these measures emphasizing inter-phylum differences between Bacteroidaceae (Bacteroidota) and Enterobacteriaceae (Proteobacteria) versus just differences within Bacteroidaceae (Bacteroidota). The disease state of samples was better explained by tree-based weighted UniFrac, especially the presence of Shiga toxin-producing E. coli (STEC) and hypertension. Our findings show that metagenome diversity estimation benefits from incorporating a genome-derived phylogeny or traits.ImportanceEstimations of microbiome diversity are fundamental to understanding spatiotemporal changes of microbial communities and identifying which factors mediate such changes. Tree-based measures of diversity are widespread for amplicon-based microbiome studies due to their utility relative to tree-agnostic measures; however, tree-based measures are seldomly applied to shotgun metagenomics data. We evaluated the utility of phylogeny-, trait-, and tree-agnostic diversity measures on a large scale human gut metagenome dataset to help guide researchers with the complex task of evaluating microbiome diversity via metagenomics.

Download Full-text

DeepMicrobes: taxonomic classification for metagenomics with deep learning

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa009 ◽

2020 ◽

Vol 2 (1) ◽

Cited By ~ 7

Author(s):

Qiaoxing Liang ◽

Paul W Bible ◽

Yu Liu ◽

Bin Zou ◽

Lai Wei

Keyword(s):

Deep Learning ◽

Large Scale ◽

Taxonomic Classification ◽

Reference Database ◽

Computational Framework ◽

Bowel Diseases ◽

Comparable Accuracy ◽

Inflammatory Bowel ◽

Genome Assemblies ◽

Taxonomic Tree

Abstract Large-scale metagenomic assemblies have uncovered thousands of new species greatly expanding the known diversity of microbiomes in specific habitats. To investigate the roles of these uncultured species in human health or the environment, researchers need to incorporate their genome assemblies into a reference database for taxonomic classification. However, this procedure is hindered by the lack of a well-curated taxonomic tree for newly discovered species, which is required by current metagenomics tools. Here we report DeepMicrobes, a deep learning-based computational framework for taxonomic classification that allows researchers to bypass this limitation. We show the advantage of DeepMicrobes over state-of-the-art tools in species and genus identification and comparable accuracy in abundance estimation. We trained DeepMicrobes on genomes reconstructed from gut microbiomes and discovered potential novel signatures in inflammatory bowel diseases. DeepMicrobes facilitates effective investigations into the uncharacterized roles of metagenomic species.

Download Full-text

Chromosome assembly of large and complex genomes using multiple references

10.1101/088435 ◽

2016 ◽

Cited By ~ 7

Author(s):

Mikhail Kolmogorov ◽

Joel Armstrong ◽

Brian J. Raney ◽

Ian Streeter ◽

Matthew Dunn ◽

...

Keyword(s):

Large Scale ◽

Genome Rearrangement ◽

Rapid Development ◽

Mus Spretus ◽

Sequencing Technologies ◽

Mus Caroli ◽

A Genome ◽

Direct Assembly ◽

Multiple References ◽

Reference Genomes

AbstractDespite the rapid development of sequencing technologies, assembly of mammalian-scale genomes into complete chromosomes remains one of the most challenging problems in bioinformatics. To help address this difficulty, we developed Ragout, a reference-assisted assembly tool that now works for large and complex genomes. Taking one or more target assemblies (generated from an NGS assembler) and one or multiple related reference genomes, Ragout infers the evolutionary relationships between the genomes and builds the final assemblies using a genome rearrangement approach. Using Ragout, we transformed NGS assemblies of 15 different Mus musculus and one Mus spretus genomes into sets of complete chromosomes, leaving less than 5% of sequence unlocalized per set. Various benchmarks, including PCR testing and realigning of long PacBio reads, suggest only a small number of structural errors in the final assemblies, comparable with direct assembly approaches. Additionally, we applied Ragout to Mus caroli and Mus pahari genomes, which exhibit karyotype-scale variations compared to other genomes from the Muridae family. Chromosome color maps confirmed most large-scale rearrangements that Ragout detected.

Download Full-text

Concatenation of paired-end reads improves taxonomic classification of amplicons for profiling microbial communities

BMC Bioinformatics ◽

10.1186/s12859-021-04410-2 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Daniel P. Dacey ◽

Frédéric J. J. Chain

Keyword(s):

Read Depth ◽

Taxonomic Composition ◽

Taxonomic Classification ◽

Read Length ◽

Reference Database ◽

Reference Databases ◽

Sequence Quality ◽

First Time ◽

Mock Communities

Abstract Background Taxonomic classification of genetic markers for microbiome analysis is affected by the numerous choices made from sample preparation to bioinformatics analysis. Paired-end read merging is routinely used to capture the entire amplicon sequence when the read ends overlap. However, the exclusion of unmerged reads from further analysis can result in underestimating the diversity in the sequenced microbial community and is influenced by bioinformatic processes such as read trimming and the choice of reference database. A potential solution to overcome this is to concatenate (join) reads that do not overlap and keep them for taxonomic classification. The use of concatenated reads can outperform taxonomic recovery from single-end reads, but it remains unclear how their performance compares to merged reads. Using various sequenced mock communities with different amplicons, read length, read depth, taxonomic composition, and sequence quality, we tested how merging and concatenating reads performed for genus recall and precision in bioinformatic pipelines combining different parameters for read trimming and taxonomic classification using different reference databases. Results The addition of concatenated reads to merged reads always increased pipeline performance. The top two performing pipelines both included read concatenation, with variable strengths depending on the mock community. The pipeline that combined merged and concatenated reads that were quality-trimmed performed best for mock communities with larger amplicons and higher average quality sequences. The pipeline that used length-trimmed concatenated reads outperformed quality trimming in mock communities with lower quality sequences but lost a significant amount of input sequences for taxonomic classification during processing. Genus level classification was more accurate using the SILVA reference database compared to Greengenes. Conclusions Merged sequences with the addition of concatenated sequences that were unable to be merged increased performance of taxonomic classifications. This was especially beneficial in mock communities with larger amplicons. We have shown for the first time, using an in-depth comparison of pipelines containing merged vs concatenated reads combined with different trimming parameters and reference databases, the potential advantages of concatenating sequences in improving resolution in microbiome investigations.

Download Full-text

Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics

Frontiers in Microbiology ◽

10.3389/fmicb.2021.755101 ◽

2021 ◽

Vol 12 ◽

Author(s):

Valérian Lupo ◽

Mick Van Vlierberghe ◽

Hervé Vanderschuren ◽

Frédéric Kerff ◽

Denis Baurain ◽

...

Keyword(s):

Reference Sequence ◽

Reference Database ◽

Contamination Level ◽

Gene Markers ◽

Genome Wide ◽

A Genome ◽

Genomic Studies ◽

Reference Databases ◽

Single Method ◽

Divide And Rule

Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases.

Download Full-text

Covalent Atomic Bridges Enable Unidirectional Enhancement of Electronic Transport in Aligned Carbon Nanotubes

10.26434/chemrxiv.8044616 ◽

2019 ◽

Author(s):

Mingguang Chen ◽

Wangxiang Li ◽

Anshuman Kumar ◽

Guanghui Li ◽

Mikhail Itkis ◽

...

Keyword(s):

Carbon Nanotubes ◽

Electrical Transport ◽

Large Scale ◽

The Novel ◽

Electrical Measurements ◽

Metal Atoms ◽

Aligned Carbon Nanotubes ◽

Transport Channels ◽

Walled Carbon Nanotubes ◽

Bulk Structures

<p>Interconnecting the surfaces of nanomaterials without compromising their outstanding mechanical, thermal, and electronic properties is critical in the design of advanced bulk structures that still preserve the novel properties of their nanoscale constituents. As such, bridging the p-conjugated carbon surfaces of single-walled carbon nanotubes (SWNTs) has special implications in next-generation electronics. This study presents a rational path towards improvement of the electrical transport in aligned semiconducting SWNT films by deposition of metal atoms. The formation of conducting Cr-mediated pathways between the parallel SWNTs increases the transverse (intertube) conductance, while having negligible effect on the parallel (intratube) transport. In contrast, doping with Li has a predominant effect on the intratube electrical transport of aligned SWNT films. Large-scale first-principles calculations of electrical transport on aligned SWNTs show good agreement with the experimental electrical measurements and provide insight into the changes that different metal atoms exert on the density of states near the Fermi level of the SWNTs and the formation of transport channels. </p>

Download Full-text

Application of Scan Based FA and Advanced Passive Voltage Contrast Technique in Defect Isolation

ISTFA 2003: Conference Proceedings from the 29th International Symposium for Testing and Failure Analysis ◽

10.31399/asm.cp.istfa2003p0391 ◽

2003 ◽

Author(s):

Michael B. Schmidt ◽

Noor Jehan Saujauddin

Keyword(s):

Failure Analysis ◽

Large Scale ◽

Fault Isolation ◽

Electronic Information ◽

Scan Testing ◽

Challenging Environment ◽

Contrast Technique ◽

Defect Isolation ◽

Voltage Contrast ◽

Higher Sensitivity

Abstract Scan testing and passive voltage contrast (PVC) techniques have been widely used as failure analysis fault isolation tools. Scan diagnosis can narrow a failure to a given net and passive voltage contrast can give real-time, large-scale electronic information about a sample at various stages of deprocessing. In the highly competitive and challenging environment of today, failure analysis cycle time is very important. By combining scan FA with a much higher sensitivity passive voltage contrast technique, one can quickly find defects that have traditionally posed a great challenge.

Download Full-text

Adaptability of Ultrasonic Lamb Wave Touchscreen to the Variations in Touch Force and Touch Area

Sensors ◽

10.3390/s21051736 ◽

2021 ◽

Vol 21 (5) ◽

pp. 1736

Author(s):

Zengchong Yang ◽

Xiucheng Liu ◽

Bin Wu ◽

Ren Liu

Keyword(s):

Lamb Wave ◽

Weight Coefficient ◽

The Self ◽

Reference Database ◽

Learning Method ◽

Improved Method ◽

Large Area ◽

Localization Model ◽

Reference Databases ◽

First Time

Previous studies on Lamb wave touchscreen (LWT) were carried out based on the assumption that the unknown touch had the consistent parameters with acoustic fingerprints in the reference database. The adaptability of LWT to the variations in touch force and touch area was investigated in this study for the first time. The automatic collection of the databases of acoustic fingerprints was realized with an experimental prototype of LWT employing three pairs of transmitter–receivers. The self-adaptive updated weight coefficient of the used transmitter–receiver pairs was employed to successfully improve the accuracy of the localization model established based on a learning method. The performance of the improved method in locating single- and two-touch actions with the reference database of different parameters was carefully evaluated. The robustness of the LWT to the variation of the touch force varied with the touch area. Moreover, it was feasible to locate touch actions of large area with reference databases of small touch areas as long as the unknown touch and the reference databases met the condition of equivalent averaged stress.

Download Full-text

IoT Traffic: Modeling and Measurement Experiments

IoT ◽

10.3390/iot2010008 ◽

2021 ◽

Vol 2 (1) ◽

pp. 140-162

Author(s):

Hung Nguyen-An ◽

Thomas Silverston ◽

Taku Yamazaki ◽

Takumi Miyoshi

Keyword(s):

Smart Home ◽

Large Scale ◽

Traffic Modeling ◽

The Novel ◽

Network Behavior ◽

Performance Accuracy ◽

Traffic Generator ◽

Multiple Devices ◽

Iot Devices ◽

The Internet Of Things

We now use the Internet of things (IoT) in our everyday lives. The novel IoT devices collect cyber–physical data and provide information on the environment. Hence, IoT traffic will count for a major part of Internet traffic; however, its impact on the network is still widely unknown. IoT devices are prone to cyberattacks because of constrained resources or misconfigurations. It is essential to characterize IoT traffic and identify each device to monitor the IoT network and discriminate among legitimate and anomalous IoT traffic. In this study, we deployed a smart-home testbed comprising several IoT devices to study IoT traffic. We performed extensive measurement experiments using a novel IoT traffic generator tool called IoTTGen. This tool can generate traffic from multiple devices, emulating large-scale scenarios with different devices under different network conditions. We analyzed the IoT traffic properties by computing the entropy value of traffic parameters and visually observing the traffic on behavior shape graphs. We propose a new method for identifying traffic entropy-based devices, computing the entropy values of traffic features. The method relies on machine learning to classify the traffic. The proposed method succeeded in identifying devices with a performance accuracy up to 94% and is robust with unpredictable network behavior with traffic anomalies spreading in the network.

Download Full-text