A Nano(publication) Approach Towards Big Data in Biodiversity

Biodiversity Information Science and Standards ◽

10.3897/biss.5.74351 ◽

2021 ◽

Vol 5 ◽

Author(s):

Mariya Dimitrova ◽

Teodor Georgiev ◽

Lyubomir Penev

Keyword(s):

De Novo ◽

Biotic Interactions ◽

Species Interaction ◽

Automatic Extraction ◽

Data Generation ◽

Data Interoperability ◽

Publication Process ◽

Related Data ◽

Biodiversity Knowledge ◽

Efficient Data

One of the major challenges in biodiversity informatics is the generation of machine-readable data that is interoperable between different biodiversity-related data infrastructures. Producers of such data have to comply with existing standards and to be resourceful enough to enable efficient data generation, management and availability. Conversely, nanopublications offer a decentralised approach (Kuhn et al. 2016) towards achieving data interoperability in a robust and standarized way. A nanopublication is a named RDF graph, which serves to communicate a single fact and its original source (provenance) through the use of identifiers and linked data (Groth et al. 2010). It is composed of three constituent graphs (assertion, provenance, and publication info), which are linked to one another in the nanopublication header (Kuhn et al. 2016). For instance, a nanopublication has been published to assert a species interaction in which a hairy woodpecker (Picoides villosus) ate a beetle (genus Ips), along with the license and related bibliographic citation*1. In biodiversity, nanopublications can be used to exchange information between infrastructures in a standardised way (Fig. 1) and to enable curation and correction of knowledge. They can be implemented within different workflows to formalise biodiversity knowledge in self-enclosed graphs. We have developed several nanopublication models*2 for different biodiversity use cases: species occurrences, new species descriptions, biotic interactions, and links between taxonomy, sequences and institutions. Nanopublications can be generated by various means: semi-automatic extraction from the published literature with a consequent human curation and publication; generation during the publication process by the authors via dedicated formalisation tool and published together with the article; de novo generation of a nanopublication through decentralised networks such as Nanobench (Kuhn et al. 2021). semi-automatic extraction from the published literature with a consequent human curation and publication; generation during the publication process by the authors via dedicated formalisation tool and published together with the article; de novo generation of a nanopublication through decentralised networks such as Nanobench (Kuhn et al. 2021). One of the possible uses of nanopublications in biodiversity is communicating new information in a standardised way so that it can be accessed and interpreted by multiple infrastructures that have a common agreement on how information is expressed through the use of particular ontologies, vocabularies and sets of identifiers. In addition, we envision nanopublications to be useful for curation or peer-review of published knowledge by enabling any researcher to publish a nanopublication containing a comment of an assertion made in a previously published nanopublication. With this talk, we aim to showcase several nanopublication formats for biodiversity and to discuss the possible applications of nanopublications in the biodiversity domain.

Download Full-text

Big Data and Big Data Analytics for Improved Healthcare Service and Management

International Journal of Privacy and Health Information Management ◽

10.4018/ijphim.2020010102 ◽

2020 ◽

Vol 8 (1) ◽

pp. 13-51

Author(s):

Pijush Kanti Dutta Pramanik ◽

Saurabh Pal ◽

Moutan Mukhopadhyay

Keyword(s):

Big Data ◽

Data Analytics ◽

Big Data Analytics ◽

Healthcare Services ◽

Healthcare Sector ◽

Future Market ◽

Data Generation ◽

Related Data ◽

Healthcare Data ◽

Different Types

Like other fields, the healthcare sector has also been greatly impacted by big data. A huge volume of healthcare data and other related data are being continually generated from diverse sources. Tapping and analysing these data, suitably, would open up new avenues and opportunities for healthcare services. In view of that, this paper aims to present a systematic overview of big data and big data analytics, applicable to modern-day healthcare. Acknowledging the massive upsurge in healthcare data generation, various ‘V's, specific to healthcare big data, are identified. Different types of data analytics, applicable to healthcare, are discussed. Along with presenting the technological backbone of healthcare big data and analytics, the advantages and challenges of healthcare big data are meticulously explained. A brief report on the present and future market of healthcare big data and analytics is also presented. Besides, several applications and use cases are discussed with sufficient details.

Download Full-text

Positional Correlative Anatomy of Invertebrate Model Organisms Increases Efficiency of TEM Data Production

Microscopy and Microanalysis ◽

10.1017/s1431927614012999 ◽

2014 ◽

Vol 20 (5) ◽

pp. 1392-1403 ◽

Cited By ~ 15

Author(s):

Irina Kolotuev

Keyword(s):

Developmental Biology ◽

Cell Biology ◽

Sampling Rate ◽

Unmet Need ◽

Region Of Interest ◽

Model Organisms ◽

Data Generation ◽

Step Method ◽

Efficient Data ◽

Research Questions

AbstractTransmission electron microscopy (TEM) is an important tool for studies in cell biology, and is essential to address research questions from bacteria to animals. Recent technological innovations have advanced the entire field of TEM, yet classical techniques still prevail for most present-day studies. Indeed, the majority of cell and developmental biology studies that use TEM do not require cutting-edge methodologies, but rather fast and efficient data generation. Although access to state-of-the-art equipment is frequently problematic, standard TEM microscopes are typically available, even in modest research facilities. However, a major unmet need in standard TEM is the ability to quickly prepare and orient a sample to identify a region of interest. Here, I provide a detailed step-by-step method for a positional correlative anatomy approach to flat-embedded samples. These modifications make the TEM preparation and analytic procedures faster and more straightforward, supporting a higher sampling rate. To illustrate the modified procedures, I provide numerous examples addressing research questions in Caenorhabditis elegans and Drosophila. This method can be equally applied to address questions of cell and developmental biology in other small multicellular model organisms.

Download Full-text

Information Freshness-Guaranteed and Energy-Efficient Data Generation Control System in Energy Harvesting Internet of Things

IEEE Access ◽

10.1109/access.2020.3023654 ◽

2020 ◽

Vol 8 ◽

pp. 168711-168720

Author(s):

Haneul Ko ◽

Hochan Lee ◽

Taeyun Kim ◽

Sangheon Pack

Keyword(s):

Control System ◽

Energy Harvesting ◽

Internet Of Things ◽

Energy Efficient ◽

Data Generation ◽

Efficient Data ◽

Generation Control

Download Full-text

A Conceptual Model for Art Criticism

Život umjetnosti ◽

10.31664/zu.2019.105.06 ◽

2019 ◽

pp. 138-157

Author(s):

Maria Giovanna Mancini ◽

Luigi Sauro

Keyword(s):

Detailed Analysis ◽

Conceptual Model ◽

Art Criticism ◽

Data Retrieval ◽

Conceptual Modelling ◽

Related Data ◽

Efficient Data ◽

Cidoc Crm

In this work, we present a detailed analysis of the different acceptations and practices of art criticism. This investigation underpins a novel conceptual modelling that extends Cidoc CRM and has been specifically designed to semantically annotate art criticism-related data and documents in order to enhance in this context interoperability and more efficient data retrieval.

Download Full-text

The latitudinal gradient in rates of evolution for bird beaks, a species interaction trait

10.1101/2020.07.31.231142 ◽

2020 ◽

Author(s):

Benjamin G Freeman ◽

Dolph Schluter ◽

Joseph A Tobias

Keyword(s):

Meta Analysis ◽

Biotic Interactions ◽

Species Interaction ◽

Temperate Zone ◽

Ecological Opportunity ◽

Evolutionary Diversification ◽

Diversity Gradient ◽

Rates Of Evolution ◽

Speciation Rates ◽

Beak Size

AbstractWhere is evolution fastest? The biotic interactions hypothesis proposes that greater species richness creates more ecological opportunity, driving faster evolution at low latitudes, whereas the “empty niches” hypothesis proposes that ecological opportunity is greater where diversity is low, spurring faster evolution at high latitudes. Here we tested these contrasting predictions by analyzing rates of bird beak evolution for a global dataset of 1141 sister pairs of birds. Beak size evolves at similar rates across latitudes, while beak shape evolves faster in the temperate zone, consistent with the empty niches hypothesis. We show in a meta-analysis that trait evolution and recent speciation rates are faster in the temperate zone, while rates of molecular evolution are slightly faster in the tropics. Our results suggest that drivers of evolutionary diversification are more potent at higher latitudes, thus calling into question multiple hypotheses invoking faster tropical evolution to explain the latitudinal diversity gradient.

Download Full-text

Harmonization of whole-genome sequencing for outbreak surveillance of Enterobacteriaceae and Enterococci

Microbial Genomics ◽

10.1099/mgen.0.000567 ◽

2021 ◽

Vol 7 (7) ◽

Author(s):

Casper Jamin ◽

Sien De Koster ◽

Stefanie van Koeveringe ◽

Dieter De Coninck ◽

Klaas Mensaert ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Type Species ◽

De Novo ◽

Whole Genome ◽

Data Generation ◽

Sequencing Data ◽

Content Type ◽

Link Type ◽

Antimicrobial Resistance Genes

Whole-genome sequencing (WGS) is becoming the de facto standard for bacterial typing and outbreak surveillance of resistant bacterial pathogens. However, interoperability for WGS of bacterial outbreaks is poorly understood. We hypothesized that harmonization of WGS for outbreak surveillance is achievable through the use of identical protocols for both data generation and data analysis. A set of 30 bacterial isolates, comprising of various species belonging to the Enterobacteriaceae family and Enterococcus genera, were selected and sequenced using the same protocol on the Illumina MiSeq platform in each individual centre. All generated sequencing data were analysed by one centre using BioNumerics (6.7.3) for (i) genotyping origin of replications and antimicrobial resistance genes, (ii) core-genome multi-locus sequence typing (cgMLST) for Escherichia coli and Klebsiella pneumoniae and whole-genome multi-locus sequencing typing (wgMLST) for all species. Additionally, a split k-mer analysis was performed to determine the number of SNPs between samples. A precision of 99.0% and an accuracy of 99.2% was achieved for genotyping. Based on cgMLST, a discrepant allele was called only in 2/27 and 3/15 comparisons between two genomes, for E. coli and K. pneumoniae, respectively. Based on wgMLST, the number of discrepant alleles ranged from 0 to 7 (average 1.6). For SNPs, this ranged from 0 to 11 SNPs (average 3.4). Furthermore, we demonstrate that using different de novo assemblers to analyse the same dataset introduces up to 150 SNPs, which surpasses most thresholds for bacterial outbreaks. This shows the importance of harmonization of data-processing surveillance of bacterial outbreaks. In summary, multi-centre WGS for bacterial surveillance is achievable, but only if protocols are harmonized.

Download Full-text

Automatic Classification Algorithm for Multisearch Data Association Rules in Wireless Networks

Wireless Communications and Mobile Computing ◽

10.1155/2021/5591387 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Cailing Li ◽

Wenjun Li

Keyword(s):

Wireless Network ◽

Association Rules ◽

Data Association ◽

Automatic Classification ◽

Classification Algorithm ◽

Fuzzy Classification ◽

Classification Error ◽

Data Generation ◽

High Coverage ◽

Efficient Data

In order to realize efficient data processing in wireless network, this paper designs an automatic classification algorithm of multisearch data association rules in a wireless network. According to the algorithm, starting from the mining of multisearch data association rules, from the discretization of continuous attributes of multisearch data, generation of fuzzy classification rules, and the design of association rule classifier and other aspects, automatic classification is completed by using the mining results. Experimental results show that this algorithm has the advantages of small classification error, good real-time performance, high coverage rate, and high feasibility.

Download Full-text

Comprehensive Survey of Recent Drug Discovery Using Deep Learning

International Journal of Molecular Sciences ◽

10.3390/ijms22189983 ◽

2021 ◽

Vol 22 (18) ◽

pp. 9983

Author(s):

Jintae Kim ◽

Sera Park ◽

Dongbo Min ◽

Wankyu Kim

Keyword(s):

Deep Learning ◽

Drug Discovery ◽

Drug Design ◽

De Novo ◽

Molecular Structures ◽

De Novo Drug Design ◽

Related Data ◽

Benchmark Datasets ◽

Comprehensive Survey ◽

Model Training

Drug discovery based on artificial intelligence has been in the spotlight recently as it significantly reduces the time and cost required for developing novel drugs. With the advancement of deep learning (DL) technology and the growth of drug-related data, numerous deep-learning-based methodologies are emerging at all steps of drug development processes. In particular, pharmaceutical chemists have faced significant issues with regard to selecting and designing potential drugs for a target of interest to enter preclinical testing. The two major challenges are prediction of interactions between drugs and druggable targets and generation of novel molecular structures suitable for a target of interest. Therefore, we reviewed recent deep-learning applications in drug–target interaction (DTI) prediction and de novo drug design. In addition, we introduce a comprehensive summary of a variety of drug and protein representations, DL models, and commonly used benchmark datasets or tools for model training and testing. Finally, we present the remaining challenges for the promising future of DL-based DTI prediction and de novo drug design.

Download Full-text

Assessment of human diploid genome assembly with 10x Linked-Reads data

10.1101/729608 ◽

2019 ◽

Cited By ~ 2

Author(s):

Lu Zhang ◽

Xin Zhou ◽

Ziming Weng ◽

Arend Sidow

Keyword(s):

Parameter Space ◽

Genome Assembly ◽

De Novo ◽

Cost Effective ◽

Personal Genome ◽

Dna Fragments ◽

Data Generation ◽

Library Preparation ◽

Assembly Quality ◽

Practical Guidelines

AbstractBackgroundProducing cost-effective haplotype-resolved personal genomes remains challenging. 10x Linked-Read sequencing, with its high base quality and long-range information, has been demonstrated to facilitate de novo assembly of human genomes and variant detection. In this study, we investigate in depth how the parameter space of 10x library preparation and sequencing affects assembly quality, on the basis of both simulated and real libraries.FindingsWe prepared and sequenced eight 10x libraries with a diverse set of parameters from standard cell lines NA12878 and NA24385 and performed whole genome assembly on the data. We also developed the simulator LRTK-SIM to follow the workflow of 10x data generation and produce realistic simulated Linked-Read data sets. We found that assembly quality could be improved by increasing the total sequencing coverage (C) and keeping physical coverage of DNA fragments (CF) or read coverage per fragment (CR) within broad ranges. The optimal physical coverage was between 332X and 823X and assembly quality worsened if it increased to greater than 1,000X for a given C. Long DNA fragments could significantly extend phase blocks, but decreased contig contiguity. The optimal length-weighted fragment length (WμFL) was around 50 – 150kb. When broadly optimal parameters were used for library preparation and sequencing, ca. 80% of the genome was assembled in a diploid state.ConclusionThe Linked-Read libraries we generated and the parameter space we identified provide theoretical considerations and practical guidelines for personal genome assemblies based on 10x Linked-Read sequencing.

Download Full-text

Liberating Biodiversity Data From COVID-19 Lockdown: Toward a knowledge hub for mammal host-virus information

Biodiversity Information Science and Standards ◽

10.3897/biss.4.59199 ◽

2020 ◽

Vol 4 ◽

Author(s):

Nathan Upham ◽

Donat Agosti ◽

Jorrit Poelen ◽

Lyubomir Penev ◽

Deborah Paul ◽

...

Keyword(s):

Biotic Interactions ◽

Basic Knowledge ◽

Deer Mice ◽

Data Repository ◽

Ecological Data ◽

Viral Reservoirs ◽

Mammal Species ◽

Biodiversity Knowledge ◽

Biodiversity Science ◽

Mammal Host

A deep irony of COVID-19 likely originating from a bat-borne coronavirus (Boni et al. 2020) is that the global lockdown to quell the pandemic also locked up physical access to much basic knowledge regarding bat biology. Digital access to data on the ecology, geography, and taxonomy of potential viral reservoirs, from Southeast Asian horseshoe bats and pangolins to North American deer mice, was suddenly critical for understanding the disease's emergence and spread. However, much of this information lay inside rare books and personal files rather than as open, linked, and queryable resources on the internet. Even the world's experts on mammal taxonomy and zoonotic disease could not retrieve their data from shuttered laboratories. We were caught unprepared. Why, in this digitally connected age, were such fundamental data describing life on Earth not already freely accessible online? Understanding why biodiversity science was unprepared—and how to fix it before the next pandemic—has been the focus of our COVID-19 Taskforce since April 2020 and is continuing (organized by CETAF and DiSSCo). We are a group of museum-based and academic scientists with the goal of opening the rich ecological data stored in natural history collections to the research public. This information is rooted in what may seem an unlikely location—taxonomic names and their historical usages, which are the keys for searching literature and extracting linked ecological data (Fig. 1). This has been the core motivation of our group, enabled by the pioneering efforts of Plazi (Agosti and Egloff 2009) to build tools for literature digitization, extraction, and parsing (e.g., Synospecies, Ocellus) without which biodiversity science would be even less prepared. Our group led efforts to build an additional pipeline from Plazi to the Biodiversity Literature Repository at Zenodo, a free and unlimited data repository (Agosti et al. 2019), and then to GloBI, an open-source database of biotic interactions (Poelen et al. 2014, GloBI 2020). We also developed a direct integration from Pensoft Journals to GloBI, leveraging that publisher’s indexing of computer-readable terms (called semantic metadata; Senderov et al. 2018) to extract mammal host and virus information. Overall, considerable progress was made. In total, 85,492 new interactions were added to GloBI from 14 April to 21 May 2020 (see entire dataset on Zenodo: Poelen et al. 2020). Of those, 28,839 interactions are present when subset to "hasHost", "hostOf", "pathogenOf", "virus", and 4,101 unique name combinations are present after considering mammal species synonymies (from Meyer et al. 2015). Of those interactions, 892 species of mammals and 1,530 unique virus names are involved, which compares to 754 mammals and 586 viruses in the most recent data synthesis (Olival et al. 2017). While these liberated data may still include redundancies, they demonstrate the value of our approach and the expanse of known but digitally unconnected data that remains locked in publications. We can liberate host-virus data from publications, but doing so is expensive and does not scale to the continued influx of new articles that are inadequately digitized. Our efforts make it clear that Pensoft-style semantic publishing should be expanded to all major journals. The pandemic has created an opportunity for re-thinking the way we do science in the digital age. Thankfully, our future is not the past, so we do not have to keep wasting resources to digitially 'rediscover' biodiversity knowledge. We collectively call for changes to the publishing paradigm, so that research findings are directly accessible, citable, discoverable, and reusable for creating complete forms of digital knowledge.

Download Full-text