scholarly journals Optimized Focused Web Crawler with Natural Language Processing Based Relevance Measure in Bioinformatics Web Sources

2019 ◽  
Vol 19 (2) ◽  
pp. 146-158 ◽  
Author(s):  
S. R. Mani Sekhar ◽  
G. M. Siddesh ◽  
Sunilkumar S. Manvi ◽  
K. G. Srinivasa

Abstract In the fast growing of digital technologies, crawlers and search engines face unpredictable challenges. Focused web-crawlers are essential for mining the boundless data available on the internet. Web-Crawlers face indeterminate latency problem due to differences in their response time. The proposed work attempts to optimize the designing and implementation of Focused Web-Crawlers using Master-Slave architecture for Bioinformatics web sources. Focused Crawlers ideally should crawl only relevant pages, but the relevance of the page can only be estimated after crawling the genomics pages. A solution for predicting the page relevance, which is based on Natural Language Processing, is proposed in the paper. The frequency of the keywords on the top ranked sentences of the page determines the relevance of the pages within genomics sources. The proposed solution uses a TextRank algorithm to rank the sentences, as well as ensuring the correct classification of Bioinformatics web page. Finally, the model is validated by being compared with a breadth first search web-crawler. The comparison shows significant reduction in run time for the same harvest rate.

2020 ◽  
Author(s):  
Jared Ucherek ◽  
Tesleem Lawal ◽  
Matthew Prinz ◽  
Lisa Li ◽  
Pradeepkumar Ashok ◽  
...  

2020 ◽  
Vol 4 (Supplement_1) ◽  
Author(s):  
Lina Sulieman ◽  
Jing He ◽  
Robert Carroll ◽  
Lisa Bastarache ◽  
Andrea Ramirez

Abstract Electronic Health Records (EHR) contain rich data to identify and study diabetes. Many phenotype algorithms have been developed to identify research subjects with type 2 diabetes (T2D), but very few accurately identify type 1 diabetes (T1D) cases or more rare forms of monogenic and atypical metabolic presentations. Polygenetic risk scores (PRS) quantify risk of a disease using common genomic variants well for both T1D and T2D. In this study, we apply validated phenotyping algorithms to EHRs linked to a genomic biobank to understand the independent contribution of PRS to classification of diabetes etiology and generate additional novel markers to distinguish subtypes of diabetes in EHR data. Using a de-identified mirror of medical center’s electronic health record, we applied published algorithms for T1D and T2D to identify cases, and used natural language processing and chart review strategies to identify cases of maturity onset diabetes of the young (MODY) and other more rare presentations. This novel approach included additional data types such as medication sequencing, ratio and temporality of insulin and non-insulin agents, clinical genetic testing, and ratios of diagnostic codes. Chart review was performed to validate etiology. To calculate PRS, we used genome wide genotyping from our BioBank, the de-identified biobank linking EHR to genomic data using coefficients of 65 published T1D SNPS and 76,996 T2D SNPS using PLINK in Caucasian subjects. In the dataset, we identified 82,238 cases of T2D but only 130 cases of T1D using the most cited published algorithms. Adding novel structured elements and natural language processing identified an additional 138 cases of T1D and distinguished 354 cases as MODY. Among over 90,000 subjects with genotyping data available, we included 72,624 Caucasian subjects since PRS coefficients were generated in Caucasian cohorts. Among those subjects, 248, 6,488, and 21 subjects were identified as T1D, T2D, and MODY subjects respectively in our final PRS cohort. The T1D PRS did significantly discriminate well between cases and controls (Mann-Whitney p-value is 3.4 e-17). The PRS for T2D did not significantly discriminate between cases and controls using published algorithms. The atypical case count was too low to calculate PRS discrimination. Calculation of the PRS score was limited by quality inclusion of variants available, and discrimination may improve in larger data sets. Additionally, blinded physician case review is ongoing to validate the novel classification scheme and provide a gold standard for machine learning approaches that can be applied in validation sets.


Information ◽  
2021 ◽  
Vol 12 (11) ◽  
pp. 438
Author(s):  
Aiala Rosá ◽  
Luis Chiruzzo

The study of affective language has had numerous developments in the Natural Language Processing area in recent years, but the focus has been predominantly on Sentiment Analysis, an expression usually used to refer to the classification of texts according to their polarity or valence (positive vs. negative). The study of emotions, such as joy, sadness, anger, surprise, among others, has been much less developed and has fewer resources, both for English and for other languages, such as Spanish. In this paper, we present the most relevant existing resources for the study of emotions, mainly for Spanish; we describe some heuristics for the union of two existing corpora of Spanish tweets; and based on some experiments for classification of tweets according to seven categories (anger, disgust, fear, joy, sadness, surprise, and others) we analyze the most problematic classes.


Sign in / Sign up

Export Citation Format

Share Document