Lithology prediction by support vector classifiers using inverted seismic attributes data and petrophysical logs as a new approach and investigation of training data set size effect on its performance in a heterogeneous carbonate reservoir

Purpose – Pattern recognition systems often have to handle problem of large volume of training data sets including duplicate and similar training samples. This problem leads to large memory requirement for saving and processing data, and the time complexity for training algorithms. The purpose of the paper is to reduce the volume of training part of a data set – in order to increase the system speed, without any significant decrease in system accuracy. Design/methodology/approach – A new technique for data set size reduction – using a version of modified frequency diagram approach – is presented. In order to reduce processing time, the proposed method compares the samples of a class to other samples in the same class, instead of comparing samples from different classes. It only removes patterns that are similar to the generated class template in each class. To achieve this aim, no feature extraction operation was carried out, in order to produce more precise assessment on the proposed data size reduction technique. Findings – The results from the experiments, and according to one of the biggest handwritten numeral standard optical character recognition (OCR) data sets, Hoda, show a 14.88 percent decrease in data set volume without significant decrease in performance. Practical implications – The proposed technique is effective for size reduction for all pictorial databases such as OCR data sets. Originality/value – State-of-the-art algorithms currently used for data set size reduction usually remove samples near to class's centers, or support vector (SV) samples between different classes. However, the samples near to a class center have valuable information about class characteristics, and they are necessary to build a system model. Also, SV s are important samples to evaluate the system efficiency. The proposed technique, unlike the other available methods, keeps both outlier samples, as well as the samples close to the class centers.

Download Full-text

A sentiment analysis system for social media using machine learning techniques: Social enablement

Digital Scholarship in the Humanities ◽

10.1093/llc/fqy037 ◽

2018 ◽

Vol 34 (3) ◽

pp. 569-581 ◽

Cited By ~ 1

Author(s):

Sujata Rani ◽

Parteek Kumar

Keyword(s):

Machine Learning ◽

Social Media ◽

Sentiment Analysis ◽

Media Analysis ◽

Training Data ◽

Machine Learning Techniques ◽

Support Vector ◽

Analysis Tool ◽

Data Set ◽

Learning Techniques

Abstract In this article, an innovative approach to perform the sentiment analysis (SA) has been presented. The proposed system handles the issues of Romanized or abbreviated text and spelling variations in the text to perform the sentiment analysis. The training data set of 3,000 movie reviews and tweets has been manually labeled by native speakers of Hindi in three classes, i.e. positive, negative, and neutral. The system uses WEKA (Waikato Environment for Knowledge Analysis) tool to convert these string data into numerical matrices and applies three machine learning techniques, i.e. Naive Bayes (NB), J48, and support vector machine (SVM). The proposed system has been tested on 100 movie reviews and tweets, and it has been observed that SVM has performed best in comparison to other classifiers, and it has an accuracy of 68% for movie reviews and 82% in case of tweets. The results of the proposed system are very promising and can be used in emerging applications like SA of product reviews and social media analysis. Additionally, the proposed system can be used in other cultural/social benefits like predicting/fighting human riots.

Download Full-text

Bootstrapping a Neural Morphological Generator from Morphological Analyzer Output for Inuktitut

10.33011/computel.v2i.455 ◽

2019 ◽

Vol 2 (1) ◽

Author(s):

Jeffrey Micher

Keyword(s):

Neural Network ◽

Training Data ◽

Data Set ◽

Set Size ◽

The Neural Network ◽

Surface Character ◽

Finite State ◽

Character Sequences ◽

Finite State Transducer

We present a method for building a morphological generator from the output of an existing analyzer for Inuktitut, in the absence of a two-way finite state transducer which would normally provide this functionality. We make use of a sequence to sequence neural network which “translates” underlying Inuktitut morpheme sequences into surface character sequences. The neural network uses only the previous and the following morphemes as context. We report a morpheme accuracy of approximately 86%. We are able to increase this accuracy slightly by passing deep morphemes directly to output for unknown morphemes. We do not see significant improvement when increasing training data set size, and postulate possible causes for this.

Download Full-text

An Incremental Isomap Method for Hyperspectral Dimensionality Reduction and Classification

Photogrammetric Engineering & Remote Sensing ◽

10.14358/pers.87.7.445 ◽

2021 ◽

Vol 87 (6) ◽

pp. 445-455

Author(s):

Yi Ma ◽

Zezhong Zheng ◽

Yutang Ma ◽

Mingcang Zhu ◽

Ran Huang ◽

...

Keyword(s):

Manifold Learning ◽

Nearest Neighbor ◽

Hyperspectral Image ◽

Hyperspectral Data ◽

Training Data ◽

Support Vector ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Data Points

Many manifold learning algorithms conduct an eigen vector analysis on a data-similarity matrix with a size of N×N, where N is the number of data points. Thus, the memory complexity of the analysis is no less than O(N2). We pres- ent in this article an incremental manifold learning approach to handle large hyperspectral data sets for land use identification. In our method, the number of dimensions for the high-dimensional hyperspectral-image data set is obtained with the training data set. A local curvature varia- tion algorithm is utilized to sample a subset of data points as landmarks. Then a manifold skeleton is identified based on the landmarks. Our method is validated on three AVIRIS hyperspectral data sets, outperforming the comparison algorithms with a k–nearest-neighbor classifier and achieving the second best performance with support vector machine.

Download Full-text

Multiscale Mask R-CNN–Based Lung Tumor Detection Using PET Imaging

Molecular Imaging ◽

10.1177/1536012119863531 ◽

2019 ◽

Vol 18 ◽

pp. 153601211986353 ◽

Cited By ~ 6

Author(s):

Rui Zhang ◽

Chao Cheng ◽

Xuehua Zhao ◽

Xuechen Li

Keyword(s):

Pet Imaging ◽

Lung Tumor ◽

Training Data ◽

Fine Tuning ◽

Weighted Voting ◽

Data Set ◽

New Approach ◽

Positron Emission ◽

Voting Strategy ◽

Aided Diagnosis

Positron emission tomography (PET) imaging serves as one of the most competent methods for the diagnosis of various malignancies, such as lung tumor. However, with an elevation in the utilization of PET scan, radiologists are overburdened considerably. Consequently, a new approach of “computer-aided diagnosis” is being contemplated to curtail the heavy workloads. In this article, we propose a multiscale Mask Region–Based Convolutional Neural Network (Mask R-CNN)–based method that uses PET imaging for the detection of lung tumor. First, we produced 3 models of Mask R-CNN for lung tumor candidate detection. These 3 models were generated by fine-tuning the Mask R-CNN using certain training data that consisted of images from 3 different scales. Each of the training data set included 594 slices with lung tumor. These 3 models of Mask R-CNN models were then integrated using weighted voting strategy to diminish the false-positive outcomes. A total of 134 PET slices were employed as test set in this experiment. The precision, recall, and F score values of our proposed method were 0.90, 1, and 0.95, respectively. Experimental results exhibited strong conviction about the effectiveness of this method in detecting lung tumors, along with the capability of identifying a healthy chest pattern and reducing incorrect identification of tumors to a large extent.

Download Full-text

Exploiting Rules to Enhance Machine Learning in Extracting Information From Multi-Institutional Prostate Pathology Reports

JCO Clinical Cancer Informatics ◽

10.1200/cci.20.00028 ◽

2020 ◽

pp. 865-874

Author(s):

Enrico Santus ◽

Tal Schuster ◽

Amir M. Tahmasebi ◽

Clara Li ◽

Adam Yala ◽

...

Keyword(s):

Machine Learning ◽

Hybrid Systems ◽

High Performance ◽

Feature Model ◽

Training Data ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Extreme Gradient Boosting ◽

Pathology Reports

PURPOSE Literature on clinical note mining has highlighted the superiority of machine learning (ML) over hand-crafted rules. Nevertheless, most studies assume the availability of large training sets, which is rarely the case. For this reason, in the clinical setting, rules are still common. We suggest 2 methods to leverage the knowledge encoded in pre-existing rules to inform ML decisions and obtain high performance, even with scarce annotations. METHODS We collected 501 prostate pathology reports from 6 American hospitals. Reports were split into 2,711 core segments, annotated with 20 attributes describing the histology, grade, extension, and location of tumors. The data set was split by institutions to generate a cross-institutional evaluation setting. We assessed 4 systems, namely a rule-based approach, an ML model, and 2 hybrid systems integrating the previous methods: a Rule as Feature model and a Classifier Confidence model. Several ML algorithms were tested, including logistic regression (LR), support vector machine (SVM), and eXtreme gradient boosting (XGB). RESULTS When training on data from a single institution, LR lags behind the rules by 3.5% (F1 score: 92.2% v 95.7%). Hybrid models, instead, obtain competitive results, with Classifier Confidence outperforming the rules by +0.5% (96.2%). When a larger amount of data from multiple institutions is used, LR improves by +1.5% over the rules (97.2%), whereas hybrid systems obtain +2.2% for Rule as Feature (97.7%) and +2.6% for Classifier Confidence (98.3%). Replacing LR with SVM or XGB yielded similar performance gains. CONCLUSION We developed methods to use pre-existing handcrafted rules to inform ML algorithms. These hybrid systems obtain better performance than either rules or ML models alone, even when training data are limited.

Download Full-text

Identification of Leukemia Subtypes from Microscopic Images Using Convolutional Neural Network

Diagnostics ◽

10.3390/diagnostics9030104 ◽

2019 ◽

Vol 9 (3) ◽

pp. 104 ◽

Cited By ~ 11

Author(s):

Ahmed ◽

Yigit ◽

Isik ◽

Alpkocak

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Nearest Neighbor ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data ◽

Support Vector ◽

K Nearest Neighbor ◽

Data Set ◽

Leukemia Data

Leukemia is a fatal cancer and has two main types: Acute and chronic. Each type has two more subtypes: Lymphoid and myeloid. Hence, in total, there are four subtypes of leukemia. This study proposes a new approach for diagnosis of all subtypes of leukemia from microscopic blood cell images using convolutional neural networks (CNN), which requires a large training data set. Therefore, we also investigated the effects of data augmentation for an increasing number of training samples synthetically. We used two publicly available leukemia data sources: ALL-IDB and ASH Image Bank. Next, we applied seven different image transformation techniques as data augmentation. We designed a CNN architecture capable of recognizing all subtypes of leukemia. Besides, we also explored other well-known machine learning algorithms such as naive Bayes, support vector machine, k-nearest neighbor, and decision tree. To evaluate our approach, we set up a set of experiments and used 5-fold cross-validation. The results we obtained from experiments showed that our CNN model performance has 88.25% and 81.74% accuracy, in leukemia versus healthy and multiclass classification of all subtypes, respectively. Finally, we also showed that the CNN model has a better performance than other wellknown machine learning algorithms.

Download Full-text

Detection and Characterization of Physical Activity and Psychological Stress from Wristband Data

Signals ◽

10.3390/signals1020011 ◽

2020 ◽

Vol 1 (2) ◽

pp. 188-208

Author(s):

Mert Sevil ◽

Mudassir Rashid ◽

Mohammad Reza Askari ◽

Zacharie Maloney ◽

Iman Hajizadeh ◽

...

Keyword(s):

Physical Activity ◽

Signal Processing ◽

Feature Extraction ◽

Psychological Stress ◽

Training Data ◽

Support Vector ◽

K Nearest Neighbor ◽

Data Set ◽

Linear Discriminant ◽

Physiological Variables

Wearable devices continuously measure multiple physiological variables to inform users of health and behavior indicators. The computed health indicators must rely on informative signals obtained by processing the raw physiological variables with powerful noise- and artifacts-filtering algorithms. In this study, we aimed to elucidate the effects of signal processing techniques on the accuracy of detecting and discriminating physical activity (PA) and acute psychological stress (APS) using physiological measurements (blood volume pulse, heart rate, skin temperature, galvanic skin response, and accelerometer) collected from a wristband. Data from 207 experiments involving 24 subjects were used to develop signal processing, feature extraction, and machine learning (ML) algorithms that can detect and discriminate PA and APS when they occur individually or concurrently, classify different types of PA and APS, and estimate energy expenditure (EE). Training data were used to generate feature variables from the physiological variables and develop ML models (naïve Bayes, decision tree, k-nearest neighbor, linear discriminant, ensemble learning, and support vector machine). Results from an independent labeled testing data set demonstrate that PA was detected and classified with an accuracy of 99.3%, and APS was detected and classified with an accuracy of 92.7%, whereas the simultaneous occurrences of both PA and APS were detected and classified with an accuracy of 89.9% (relative to actual class labels), and EE was estimated with a low mean absolute error of 0.02 metabolic equivalent of task (MET).The data filtering and adaptive noise cancellation techniques used to mitigate the effects of noise and artifacts on the classification results increased the detection and discrimination accuracy by 0.7% and 3.0% for PA and APS, respectively, and by 18% for EE estimation. The results demonstrate the physiological measurements from wristband devices are susceptible to noise and artifacts, and elucidate the effects of signal processing and feature extraction on the accuracy of detection, classification, and estimation of PA and APS.

Download Full-text

Automatic Task Classification via Support Vector Machine and Crowdsourcing

Mobile Information Systems ◽

10.1155/2018/6920679 ◽

2018 ◽

Vol 2018 ◽

pp. 1-9 ◽

Cited By ~ 3

Author(s):

Hyungsik Shin ◽

Jeongyeup Paek

Keyword(s):

Support Vector Machine ◽

Mobile Devices ◽

Prediction Accuracy ◽

Training Data ◽

Amazon Mechanical Turk ◽

Support Vector ◽

Data Set ◽

English Sentence ◽

Task Classification ◽

Personal Assistant

Automatic task classification is a core part of personal assistant systems that are widely used in mobile devices such as smartphones and tablets. Even though many industry leaders are providing their own personal assistant services, their proprietary internals and implementations are not well known to the public. In this work, we show through real implementation and evaluation that automatic task classification can be implemented for mobile devices by using the support vector machine algorithm and crowdsourcing. To train our task classifier, we collected our training data set via crowdsourcing using the Amazon Mechanical Turk platform. Our classifier can classify a short English sentence into one of the thirty-two predefined tasks that are frequently requested while using personal mobile devices. Evaluation results show high prediction accuracy of our classifier ranging from 82% to 99%. By using large amount of crowdsourced data, we also illustrate the relationship between training data size and the prediction accuracy of our task classifier.

Download Full-text

AUGMENTATIVE AND ALTERNATIVE COMMUNICATION METHOD BASED ON TONGUE CLICKING FOR MUTE DISABILITIES

IIUM Engineering Journal ◽

10.31436/iiumej.v20i1.1021 ◽

2019 ◽

Vol 20 (1) ◽

pp. 119-128

Author(s):

NIK NUR WAHIDAH NIK HASHIM ◽

MUHAMMAD AMIRUL AMIN AZMI ◽

HAZLINA MD. YUSOF

Keyword(s):

Amplitude Modulation ◽

Augmentative And Alternative Communication ◽

Training Data ◽

Support Vector ◽

Classification Rate ◽

Data Set ◽

Zero Crossing ◽

Svm Classification ◽

Development Data ◽

Multiclass Svm

This paper presents a pilot study for a novel application of converting tongue clicking sound to words for people with the inability to speak. 15 features of speech that are related to speech timing patterns, amplitude modulation, zero crossing and peak detection were extracted. The experiments were conducted with three different patterns using binary Support Vector Machine (SVM) classification with 10 recordings as training data and 10 recordings as development data. Peak size outperformed all features with 85% classification rate for pattern P1-P3 whereas multiple features produced 100% classification rate for P1-P2 and P2-P3. A GUI based system was developed to validate the trained classifier. Multiclass SVM were constructed based on the best features obtained from binary SVM classification outcome, namely peak size and skewness amplitude modulation, and then tested on 15 recordings. The GUI based multiclass SVM obtained a satisfying performance of 67% correct classification of the test data set. ABSTRAK: Kertas ini membentangkan panduan kajian kepada aplikasi terkini dalam menukar bunyi klik pada lidah kepada perkataan untuk orang yang mempunyai kehilangan upaya dalam bertutur. 15 ciri khas berkaitan pertuturan adalah pola masa, modulasi nilai tertinggi, tiada titik persilangan dan nilai terpilih yang dikesan. Eksperimen telah dijalankan dengan tiga corak berlainan menggunakan perduaan Mesin Vektor Sokongan (SVM) klasifikasi dengan 10 rakaman sebagai data terlatih dan 10 rakaman sebagai data yang dibina. Saiz tertinggi yang melebihi semua ciri-ciri pada 85% kadar klasifikasi dilihat pada corak P1-P3, sedangkan ciri-ciri pelbagai telah terhasil pada 100% kadar klasifikasi P1-P2 dan P2-P3. Sistem berdasarkan GUI telah dibina bagi menilai ciri terlatih. Kelas pelbagai SVM telah dibina berdasarkan ciri-ciri terbaik dan dihasilkan daripada klasifikasi perduaan SVM, iaitu saiz tertinggi dan modulasi saiz tertinggi tidak linear, dan telah diuji dengan 15 rakaman. Kelas pelbagai SVM yang didapati melalui GUI ini adalah memberangsangkan iaitu 67% klasifikasi adalah tepat pada set data yang diuji.

Download Full-text