sequence databases
Recently Published Documents


TOTAL DOCUMENTS

369
(FIVE YEARS 41)

H-INDEX

51
(FIVE YEARS 4)

2022 ◽  
pp. 25-36
Author(s):  
Mohammad Yaseen Sofi ◽  
Afshana Shafi ◽  
Khalid Z. Masoodi

Author(s):  
Nesma Youssef ◽  
Hatem Abdulkader ◽  
Amira Abdelwahab

Sequential rule mining is one of the most common data mining techniques. It intends to find desired rules in large sequence databases. It can decide the essential information that helps acquire knowledge from large search spaces and select curiously rules from sequence databases. The key challenge is to avoid wasting time, which is particularly difficult in large sequence databases. This paper studies the mining rules from two representations of sequential patterns to have compact databases without affecting the final result. In addition, execute a parallel approach by utilizing multi core processor architecture for mining non-redundant sequential rules. Also, perform pruning techniques to enhance the efficiency of the generated rules. The evaluation of the proposed algorithm was accomplished by comparing it with another non-redundant sequential rule algorithm called Non-Redundant with Dynamic Bit Vector (NRD-DBV). Both algorithms were performed on four real datasets with different characteristics. Our experiments show the performance of the proposed algorithm in terms of execution time and computational cost. It achieves the highest efficiency, especially for large datasets and with low values of minimum support, as it takes approximately half the time consumed by the compared algorithm.


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e12625
Author(s):  
Yoonhee Cho ◽  
Ji Seon Kim ◽  
Yu-Cheng Dai ◽  
Yusufjon Gafforov ◽  
Young Woon Lim

Genus Xylodon consists of white-rot fungi that grow on both angiosperms and gymnosperms. With resupinate and adnate basidiomes, Xylodon species have been classified into other resupinate genera for a long time. Upon the integration of molecular assessments, the taxonomy of the genus has been revised multiple times over the years. However, the emendations were poorly reflected in studies and public sequence databases. In the present study, the genus Xylodon in Korea was evaluated using molecular and morphological analyses of 172 specimens collected in the period of 2011 to 2018. The host types and geographical distributions were also determined for species delimitation. Furthermore, public sequences that correspond to the Xylodon species in Korea were assessed to validate their identities. Nine Xylodon species were identified in Korea, with three species new to the country. Morphological differentiation and identification of some species were challenging, but all nine species were clearly divided into well-resolved clades in the phylogenetic analyses. Detailed species descriptions, phylogeny, and a key to Xylodon species in Korea are provided in the present study. A total of 646 public ITS and nrLSU sequences corresponding to the nine Xylodon species were found, each with 404 (73.1%) and 57 (61.3%) misidentified or labeled with synonymous names. In many cases, sequences released before the report of new names have not been revised or updated. Revisions of these sequences are arranged in the present study. These amendments may be used to avoid the misidentification of future sequence-based identifications and concurrently prevent the accumulation of misidentified sequences in GenBank.


2021 ◽  
Author(s):  
◽  
Cassidy Moeke

<p>The greenshell mussel Perna canaliculus is considered to be a suitable biomonitor for heavy metal pollution. This is due to their ability to accumulate and tolerate heavy metals in their tissues. These characteristics make them useful for identifying protein biomarkers of heavy metal pollution, as well as proteins associated with heavy metal detoxification and homeostasis. However, the identification of such proteins is restricted by the greenshell mussel being poorly represented in sequence databases. Several strategies have previously been used to identify proteins in unsequenced species, but only one of these strategies has been applied to the greenshell mussel. The objective of this thesis was to examine different protein identification strategies using a combined two-dimensional gel electrophoresis and MALDI-TOF/TOF mass spectrometry approach. The protein identification strategies used include a Mascot database search, as well as de novo sequencing approaches using PEAKS DB and SPIDER homology searches. In total, 155 protein spots were excised and a total of 68 identified. Fifty-six proteins were identified using a Mascot search against the Mollusca, NCBInr and Invertebrate EST database, with seven single-peptide identifications. De novo sequencing strategies identified additional proteins, with two from a PEAKS DB search and 10 from an error-tolerant SPIDER homology search. The most noticeable protein groups identified were cytoskeletal proteins, stress response proteins and those involved in protein biosynthesis. Actin and tubulin made up the bulk of the identifications, accounting for 39% of all proteins identified. This multifaceted approach was shown to be useful for identifying proteins in the greenshell mussel Perna canaliculus. Mascot and PEAKS DB performed equally well, while the error-tolerant functionality of SPIDER was useful for identifying additional proteins. A subsequent search against the Invertebrate EST database was also found to be useful for identifying additional proteins. Despite this, more than half of all proteins remained unidentified. Most of these proteins either failed to produce good quality MS spectra or did not find a match to a sequence in the database. Future research should first focus on obtaining quality MS spectra for all proteins concerned and then examine other strategies that may be more suitable for identifying proteins for species with poor representation in sequence databases.</p>


2021 ◽  
Author(s):  
◽  
Cassidy Moeke

<p>The greenshell mussel Perna canaliculus is considered to be a suitable biomonitor for heavy metal pollution. This is due to their ability to accumulate and tolerate heavy metals in their tissues. These characteristics make them useful for identifying protein biomarkers of heavy metal pollution, as well as proteins associated with heavy metal detoxification and homeostasis. However, the identification of such proteins is restricted by the greenshell mussel being poorly represented in sequence databases. Several strategies have previously been used to identify proteins in unsequenced species, but only one of these strategies has been applied to the greenshell mussel. The objective of this thesis was to examine different protein identification strategies using a combined two-dimensional gel electrophoresis and MALDI-TOF/TOF mass spectrometry approach. The protein identification strategies used include a Mascot database search, as well as de novo sequencing approaches using PEAKS DB and SPIDER homology searches. In total, 155 protein spots were excised and a total of 68 identified. Fifty-six proteins were identified using a Mascot search against the Mollusca, NCBInr and Invertebrate EST database, with seven single-peptide identifications. De novo sequencing strategies identified additional proteins, with two from a PEAKS DB search and 10 from an error-tolerant SPIDER homology search. The most noticeable protein groups identified were cytoskeletal proteins, stress response proteins and those involved in protein biosynthesis. Actin and tubulin made up the bulk of the identifications, accounting for 39% of all proteins identified. This multifaceted approach was shown to be useful for identifying proteins in the greenshell mussel Perna canaliculus. Mascot and PEAKS DB performed equally well, while the error-tolerant functionality of SPIDER was useful for identifying additional proteins. A subsequent search against the Invertebrate EST database was also found to be useful for identifying additional proteins. Despite this, more than half of all proteins remained unidentified. Most of these proteins either failed to produce good quality MS spectra or did not find a match to a sequence in the database. Future research should first focus on obtaining quality MS spectra for all proteins concerned and then examine other strategies that may be more suitable for identifying proteins for species with poor representation in sequence databases.</p>


Author(s):  
Cibele Sotero-Caio ◽  
Richard Challis ◽  
Sujai Kumar ◽  
Mark Blaxter

Genomic data are transforming our understanding of biodiversity and, under the umbrella of the Earth BioGenome Project (EBP - https://www.earthbiogenome.org), many initiatives seek to generate large numbers of reference genome sequences. The distributed nature of this work makes coordination essential to ensure optimal synergy between projects and to prevent duplication of effort. While public sequence databases hold data describing completed projects, there is currently no global source of information about projects in progress or planned. In addition, the scoping and delivery of sequencing projects benefits from prior estimates of genome size and karyotype, but existing data are scattered in the literature. To address these issues, the Tree of Life programme (https://www.sanger.ac.uk/programme/tree-of-life/) has developed Genomes on a Tree (GoaT), an ElasticSearch-powered, taxon-centred database that collates observed and estimated genome-relevant metadata—including genome sizes and karyotypes—for eukaryotic species. Missing values for individual species are estimated from phylogenetic comparison. GoaT also holds declarations of actual and planned activity, from priority lists and in-progress status, to submissions to the International Nucleotide Sequence Database Collaboration (INSDC https://www.insdc.org/), across genome sequencing consortia. GoaT can be queried through a mature API (application programming interface), and we have developed a web front-end that includes data summary visualisations (see https://goat.genomehubs.org/). We are currently transitioning this service into the Tree of Life production pipeline. GoaT currently reports priority lists from the Darwin Tree of Life project (focussed on the biodiversity of Britain and Ireland). We are actively soliciting additional data concerning progress and intent from other projects so that GoaT displays a real-time summary of the state of play in reference genome sequencing, and thus facilitates collaboration and cooperation among projects. We are developing standard formats and procedures so that any project can make explicit its intent and progress. Cross referencing to other data systems such as the INSDC sequence databases, the BOLD DNA barcodes resource and Global Biodiversity Information Facility- and Open Tree of Life-related taxonomic and distribution databases will further enhance the system’s utility. We also seek to incorporate additional kinds of metadata, such as sex chromosome systems, to augment the utility of GoaT in supporting the global genome sequencing effort.


2021 ◽  
Author(s):  
Jaya Srivastava ◽  
Ritu Hembrom ◽  
Ankita Kumawat ◽  
Petety V. Balaji

UniProt and BFD databases together have 2.5 billion protein sequences. A large majority of these proteins have been electronically annotated. Automated annotation pipelines, vis-á-vis manual curation, have the advantage of scale and speed but are fraught with relatively higher error rates. This is because sequence homology does not necessarily translate to functional homology, molecular function specification is hierarchic and not all functional families have the same amount of experimental data that one can exploit for annotation. Consequently, customization of annotation workflow is inevitable to minimize annotation errors. In this study, we illustrate possible ways of customizing the search of sequence databases for functional homologs using profile HMMs. Choosing an optimal bit score threshold is a critical step in the application of HMMs. We illustrate ways in which an optimal bit score can be arrived at using four Case Studies. These are the single domain nucleotide sugar 6-dehydrogenase and lysozyme-C families, and SH3 and GT-A domains which are typically found as a part of multi-domain proteins. We also discuss the limitations of using profile HMMs for functional annotation and suggests some possible ways to partially overcome such limitations.


2021 ◽  
Vol 9 (6) ◽  
pp. 1229
Author(s):  
Corentine Alauzet ◽  
Fabien Aujoulat ◽  
Alain Lozniewski ◽  
Safa Ben Brahim ◽  
Chloé Domenjod ◽  
...  

Solobacterium moorei is an anaerobic Gram-positive bacillus present within the oral and the intestinal microbiota that has rarely been described in human infections. Besides its role in halitosis and oral infections, S. moorei is considered to be an opportunistic pathogen causing mainly bloodstream and surgical wound infections. We performed a retrospective study of 27 cases of infections involving S. moorei in two French university hospitals between 2006 and 2021 with the aim of increasing our knowledge of this unrecognized opportunistic pathogen. We also reviewed all the data available in the literature and in genetic and metagenomic sequence databases. In addition to previously reported infections, S. moorei had been isolated from various sites and involved in intra-abdominal, osteoarticular, and cerebral infections more rarely or not previously reported. Although mostly involved in polymicrobial infections, in seven cases, it was the only pathogen recovered. Not included in all mass spectrometry databases, its identification can require 16S rRNA gene sequencing. High susceptibility to antibiotics (apart from rifampicin, moxifloxacin, and clindamycin; 91.3%, 11.8%, and 4.3% of resistant strains, respectively) has been noted. Our global search strategy revealed S. moorei to be human-associated, widely distributed in the human microbiota, including the vaginal and skin microbiota, which may be other sources for infection in addition to the oral and gut microbiota.


2021 ◽  
Author(s):  
Dominik Pistorius ◽  
Kathrin Buntin ◽  
Caroline Bouquet ◽  
Etienne Richard ◽  
Eric Weber ◽  
...  

<p></p><p><a></a>The depsipeptide FR900359 has been first described in literature in 1988 (Fujioka <i>et al</i>, 1988) to be isolated from a methanol extract of the whole plant of <i>Ardisia crenata</i>. FR900359 can be isolated from the leaves of <i>A. crenata</i>, but the very low quantities and the complex matrix prevent access to sufficient amounts of FR900359 to enable drug development efforts and potential commercial manufacturing. Almost two decades later, it has been discovered that FR900359 is in fact produced by a strictly obligate bacterial endosymbiont, <i>Candidatus</i> <i>Burkholderia crenata</i>, of the plant <i>Ardisia crenata</i> (Carlier <i>et al</i>, 2016). This study identified also the DNA sequence of the biosynthetic gene cluster (BGC) of FR900359. In order to identify alternative and scalable methods for production of FR900359, a genome mining effort on bacterial genomes from both public sequence databases and genome sequences generated from internal efforts at Novartis was initiated. Translated amino acid sequences of the FR900359‑BGC from <i>Candidatus B. crenata</i> were used as query sequence. While the query of public sequence databases did not return highly similar sequences, a gene cluster with very high homology in translated amino acid sequence and identical prediction of protein functions was discovered in the genome of <i>Chromobacterium vaccinii</i> DSM 25150, which had been sequenced internally at Novartis. Here we describe the genetic engineering of <i>Chromobacterium vaccinii</i> DSM 25150 resulting in mutants that exhibit improved production of FR900359 and improved characteristics concerning downstream processing and purification.</p><p></p>


2021 ◽  
Author(s):  
Dominik Pistorius ◽  
Kathrin Buntin ◽  
Caroline Bouquet ◽  
Etienne Richard ◽  
Eric Weber ◽  
...  

<p></p><p><a></a>The depsipeptide FR900359 has been first described in literature in 1988 (Fujioka <i>et al</i>, 1988) to be isolated from a methanol extract of the whole plant of <i>Ardisia crenata</i>. FR900359 can be isolated from the leaves of <i>A. crenata</i>, but the very low quantities and the complex matrix prevent access to sufficient amounts of FR900359 to enable drug development efforts and potential commercial manufacturing. Almost two decades later, it has been discovered that FR900359 is in fact produced by a strictly obligate bacterial endosymbiont, <i>Candidatus</i> <i>Burkholderia crenata</i>, of the plant <i>Ardisia crenata</i> (Carlier <i>et al</i>, 2016). This study identified also the DNA sequence of the biosynthetic gene cluster (BGC) of FR900359. In order to identify alternative and scalable methods for production of FR900359, a genome mining effort on bacterial genomes from both public sequence databases and genome sequences generated from internal efforts at Novartis was initiated. Translated amino acid sequences of the FR900359‑BGC from <i>Candidatus B. crenata</i> were used as query sequence. While the query of public sequence databases did not return highly similar sequences, a gene cluster with very high homology in translated amino acid sequence and identical prediction of protein functions was discovered in the genome of <i>Chromobacterium vaccinii</i> DSM 25150, which had been sequenced internally at Novartis. Here we describe the genetic engineering of <i>Chromobacterium vaccinii</i> DSM 25150 resulting in mutants that exhibit improved production of FR900359 and improved characteristics concerning downstream processing and purification.</p><p></p>


Sign in / Sign up

Export Citation Format

Share Document