Optimized Focused Web Crawler with Natural Language Processing Based Relevance Measure in Bioinformatics Web Sources

S. R. Mani Sekhar; G. M. Siddesh; Sunilkumar S. Manvi; K. G. Srinivasa

doi:10.2478/cait-2019-0021

Optimized Focused Web Crawler with Natural Language Processing Based Relevance Measure in Bioinformatics Web Sources

Cybernetics and Information Technologies ◽

10.2478/cait-2019-0021 ◽

2019 ◽

Vol 19 (2) ◽

pp. 146-158 ◽

Cited By ~ 1

Author(s):

S. R. Mani Sekhar ◽

G. M. Siddesh ◽

Sunilkumar S. Manvi ◽

K. G. Srinivasa

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Web Page ◽

Breadth First Search ◽

Web Crawler ◽

Web Crawlers ◽

Relevance Measure ◽

Focused Crawlers

Abstract In the fast growing of digital technologies, crawlers and search engines face unpredictable challenges. Focused web-crawlers are essential for mining the boundless data available on the internet. Web-Crawlers face indeterminate latency problem due to differences in their response time. The proposed work attempts to optimize the designing and implementation of Focused Web-Crawlers using Master-Slave architecture for Bioinformatics web sources. Focused Crawlers ideally should crawl only relevant pages, but the relevance of the page can only be estimated after crawling the genomics pages. A solution for predicting the page relevance, which is based on Natural Language Processing, is proposed in the paper. The frequency of the keywords on the top ranked sentences of the page determines the relevance of the pages within genomics sources. The proposed solution uses a TextRank algorithm to rank the sentences, as well as ensuring the correct classification of Bioinformatics web page. Finally, the model is validated by being compared with a breadth first search web-crawler. The comparison shows significant reduction in run time for the same harvest rate.

Download Full-text

Automatic Extraction and Classification of Patients’ Smoking Status from Free Text Using Natural Language Processing

Value in Health ◽

10.1016/j.jval.2016.09.158 ◽

2016 ◽

Vol 19 (7) ◽

pp. A373

Author(s):

A Caccamisi ◽

L Jörgensen ◽

H Dalianis ◽

M Rosenlund

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Smoking Status ◽

Free Text ◽

Automatic Extraction

Download Full-text

A Comparison of Natural Language Processing Methods for the Classification of Lumbar Spine Imaging Findings Related to Lower Back Pain

Academic Radiology ◽

10.1016/j.acra.2021.09.005 ◽

2021 ◽

Author(s):

Chethan Jujjavarapu ◽

Vikas Pejaver ◽

Trevor A. Cohen ◽

Sean D. Mooney ◽

Patrick J. Heagerty ◽

...

Keyword(s):

Natural Language Processing ◽

Back Pain ◽

Lumbar Spine ◽

Natural Language ◽

Language Processing ◽

Lower Back Pain ◽

Imaging Findings ◽

Lower Back ◽

Spine Imaging

Download Full-text

Understanding Legal Documents: Classification of Rhetorical Role of Sentences Using Deep Learning and Natural Language Processing

2020 IEEE 14th International Conference on Semantic Computing (ICSC) ◽

10.1109/icsc.2020.00089 ◽

2020 ◽

Author(s):

Syed Rameel Ahmad ◽

Deborah Harris ◽

Ibrahim Sahibzada

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Legal Documents

Download Full-text

Identification of Symptoms Based on Natural Language Processing (NLP) for Disease Diagnosis Based on International Classification of Diseases and Related Health Problems (ICD-11)

2019 International Electronics Symposium (IES) ◽

10.1109/elecsym.2019.8901644 ◽

2019 ◽

Cited By ~ 1

Author(s):

Fariz Bramasta Putra ◽

Alviansyah Arman Yusuf ◽

Heri Yulianus ◽

Yogi Putra Pratama ◽

Dzakiyah Salma Humairra ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

International Classification Of Diseases ◽

Disease Diagnosis ◽

Health Problems ◽

International Classification ◽

Classification Of Diseases

Download Full-text

Auto-Suggestive Real-Time Classification of Driller Memos into Activity Codes Using Natural Language Processing

10.2118/199593-ms ◽

2020 ◽

Author(s):

Jared Ucherek ◽

Tesleem Lawal ◽

Matthew Prinz ◽

Lisa Li ◽

Pradeepkumar Ashok ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Real Time ◽

Language Processing ◽

Real Time Classification

Download Full-text

Natural language processing and machine learning to enable automatic extraction and classification of patients’ smoking status from electronic medical records

Upsala Journal of Medical Sciences ◽

10.1080/03009734.2020.1792010 ◽

2020 ◽

Vol 125 (4) ◽

pp. 316-324

Author(s):

Andrea Caccamisi ◽

Leif Jørgensen ◽

Hercules Dalianis ◽

Mats Rosenlund

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Electronic Medical Records ◽

Language Processing ◽

Medical Records ◽

Smoking Status ◽

Automatic Extraction

Download Full-text

Automated Classification of Computer-Based Medical Device Recalls: An Application of Natural Language Processing and Statistical Learning

2014 IEEE 27th International Symposium on Computer-Based Medical Systems ◽

10.1109/cbms.2014.134 ◽

2014 ◽

Author(s):

Homa Alemzadeh ◽

Raymond Hoagland ◽

Zbigniew Kalbarczyk ◽

Ravishankar K. Iyer

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Medical Device ◽

Statistical Learning ◽

Language Processing ◽

Automated Classification ◽

Computer Based

Download Full-text

Classification of CT pulmonary angiography reports by presence, chronicity, and location of pulmonary embolism with natural language processing

Journal of Biomedical Informatics ◽

10.1016/j.jbi.2014.08.001 ◽

2014 ◽

Vol 52 ◽

pp. 386-393 ◽

Cited By ~ 20

Author(s):

Sheng Yu ◽

Kanako K. Kumamaru ◽

Elizabeth George ◽

Ruth M. Dunne ◽

Arash Bedayat ◽

...

Keyword(s):

Pulmonary Embolism ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Pulmonary Angiography ◽

Ct Pulmonary Angiography

Download Full-text

SAT-LB111 Improving Classification of Diabetes Etiology in Electronic Resources Using Phenotype Algorithms and Polygenic Risk Scores

Journal of the Endocrine Society ◽

10.1210/jendso/bvaa046.2239 ◽

2020 ◽

Vol 4 (Supplement_1) ◽

Author(s):

Lina Sulieman ◽

Jing He ◽

Robert Carroll ◽

Lisa Bastarache ◽

Andrea Ramirez

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Chart Review ◽

Risk Scores ◽

P Value ◽

Learning Approaches ◽

Data Types ◽

Electronic Health

Abstract Electronic Health Records (EHR) contain rich data to identify and study diabetes. Many phenotype algorithms have been developed to identify research subjects with type 2 diabetes (T2D), but very few accurately identify type 1 diabetes (T1D) cases or more rare forms of monogenic and atypical metabolic presentations. Polygenetic risk scores (PRS) quantify risk of a disease using common genomic variants well for both T1D and T2D. In this study, we apply validated phenotyping algorithms to EHRs linked to a genomic biobank to understand the independent contribution of PRS to classification of diabetes etiology and generate additional novel markers to distinguish subtypes of diabetes in EHR data. Using a de-identified mirror of medical center’s electronic health record, we applied published algorithms for T1D and T2D to identify cases, and used natural language processing and chart review strategies to identify cases of maturity onset diabetes of the young (MODY) and other more rare presentations. This novel approach included additional data types such as medication sequencing, ratio and temporality of insulin and non-insulin agents, clinical genetic testing, and ratios of diagnostic codes. Chart review was performed to validate etiology. To calculate PRS, we used genome wide genotyping from our BioBank, the de-identified biobank linking EHR to genomic data using coefficients of 65 published T1D SNPS and 76,996 T2D SNPS using PLINK in Caucasian subjects. In the dataset, we identified 82,238 cases of T2D but only 130 cases of T1D using the most cited published algorithms. Adding novel structured elements and natural language processing identified an additional 138 cases of T1D and distinguished 354 cases as MODY. Among over 90,000 subjects with genotyping data available, we included 72,624 Caucasian subjects since PRS coefficients were generated in Caucasian cohorts. Among those subjects, 248, 6,488, and 21 subjects were identified as T1D, T2D, and MODY subjects respectively in our final PRS cohort. The T1D PRS did significantly discriminate well between cases and controls (Mann-Whitney p-value is 3.4 e-17). The PRS for T2D did not significantly discriminate between cases and controls using published algorithms. The atypical case count was too low to calculate PRS discrimination. Calculation of the PRS score was limited by quality inclusion of variants available, and discrimination may improve in larger data sets. Additionally, blinded physician case review is ongoing to validate the novel classification scheme and provide a gold standard for machine learning approaches that can be applied in validation sets.

Download Full-text

Emotion Classification in Spanish: Exploring the Hard Classes

Information ◽

10.3390/info12110438 ◽

2021 ◽

Vol 12 (11) ◽

pp. 438

Author(s):

Aiala Rosá ◽

Luis Chiruzzo

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Language Processing ◽

Emotion Classification

The study of affective language has had numerous developments in the Natural Language Processing area in recent years, but the focus has been predominantly on Sentiment Analysis, an expression usually used to refer to the classification of texts according to their polarity or valence (positive vs. negative). The study of emotions, such as joy, sadness, anger, surprise, among others, has been much less developed and has fewer resources, both for English and for other languages, such as Spanish. In this paper, we present the most relevant existing resources for the study of emotions, mainly for Spanish; we describe some heuristics for the union of two existing corpora of Spanish tweets; and based on some experiments for classification of tweets according to seven categories (anger, disgust, fear, joy, sadness, surprise, and others) we analyze the most problematic classes.

Download Full-text