Application of the C4.5 Algorithm to Predict the Types of Disease in Pigs Based on Android

I Gusti Ayu Purnami Indryaswari; Ida Bagus Made Mahendra

doi:10.24843/jlk.2021.v10.i01.p14

Application of the C4.5 Algorithm to Predict the Types of Disease in Pigs Based on Android

JELIKU (Jurnal Elektronik Ilmu Komputer Udayana) ◽

10.24843/jlk.2021.v10.i01.p14 ◽

2021 ◽

Vol 10 (1) ◽

pp. 105

Author(s):

I Gusti Ayu Purnami Indryaswari ◽

Ida Bagus Made Mahendra

Keyword(s):

Programming Language ◽

Test Data ◽

Training Data ◽

Data Sets ◽

Android Application ◽

C4.5 Algorithm ◽

Sqlite Database

Many Indonesian people, especially in Bali, make pigs as livestock. Pig livestock are susceptible to various types of diseases and there have been many cases of pig deaths due to diseases that cause losses to breeders. Therefore, the author wants to create an Android-based application that can predict the type of disease in pigs by applying the C4.5 Algorithm. The C4.5 algorithm is an algorithm for classifying data in order to obtain a rule that is used to predict something. In this study, 50 training data sets were used with 8 types of diseases in pigs and 31 symptoms of disease. which is then inputted into the system so that the data is processed so that the system in the form of an Android application can predict the type of disease in pigs. In the testing process, it was carried out by testing 15 test data sets and producing an accuracy value that is 86.7%. In testing the application features built using the Kotlin programming language and the SQLite database, it has been running as expected.

Download Full-text

Data Analysis With Shapley Values For Automatic Subject Selection in Alzheimer's Disease Data Sets Using Interpretable Machine Learning

10.21203/rs.3.rs-245707/v1 ◽

2021 ◽

Author(s):

Louise Bloch ◽

Christoph M. Friedrich

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Test Data ◽

Noisy Data ◽

Training Data ◽

Data Sets ◽

Data Set ◽

Model Interpretation ◽

Percentage Points ◽

Shapley Values

Abstract Background: The prediction of whether Mild Cognitive Impaired (MCI) subjects will prospectively develop Alzheimer's Disease (AD) is important for the recruitment and monitoring of subjects for therapy studies. Machine Learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to noisy data sets. Additional noise is introduced by multicentric study designs and varying acquisition protocols. This article examines whether an automatic and fair data valuation method based on Shapley values can identify subjects with noisy data. Methods: An ML-workow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test data set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workow included volumetric Magnetic Resonance Imaging (MRI) feature extraction, subject sample selection using data Shapley, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) for model training and Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. This model interpretation enables clinically relevant explanation of individual predictions. Results: The XGBoost models which excluded 116 of the 467 subjects from the training data set based on their Logistic Regression (LR) data Shapley values outperformed the models which were trained on the entire training data set and which reached a mean classification accuracy of 58.54 % by 14.13 % (8.27 percentage points) on the independent ADNI test data set. The XGBoost models, which were trained on the entire training data set reached a mean accuracy of 60.35 % for the AIBL data set. An improvement of 24.86 % (15.00 percentage points) could be reached for the XGBoost models if those 72 subjects with the smallest RF data Shapley values were excluded from the training data set. Conclusion: The data Shapley method was able to improve the classification accuracies for the test data sets. Noisy data was associated with the number of ApoEϵ4 alleles and volumetric MRI measurements. Kernel SHAP showed that the black-box models learned biologically plausible associations.

Download Full-text

Separating Phages From Other Virus Families and Classifying the Different Phage Families By GI-Clusters

10.21203/rs.3.rs-1130357/v1 ◽

2021 ◽

Author(s):

Xingang Jia ◽

Qiuhong Han ◽

Zuhong Lu

Keyword(s):

Test Data ◽

Cluster Algorithm ◽

Nearest Neighbors ◽

Euclidean Algorithm ◽

Training Data ◽

Data Sets ◽

Maximum Element ◽

Clustering Techniques ◽

Cluster Algorithms ◽

Biological Entities

Abstract Background: Phages are the most abundant biological entities, but the commonly used clustering techniques are difficult to separate them from other virus families and classify the different phage families together.Results: This work uses GI-clusters to separate phages from other virus families and classify the different phage families, where GI-clusters are constructed by GI-features, GI-features are constructed by the togetherness with F-features, training data, MG-Euclidean and Icc-cluster algorithms, F-features are the frequencies of multiple-nucleotides that are generated from genomes of viruses, MG-Euclidean algorithm is able to put the nearest neighbors in the same mini-groups, and Icc-cluster algorithm put the distant samples to the different mini-clusters. For these viruses that the maximum element of their GI-features are in the same locations, they are put to the same GI-clusters, where the families of viruses in test data are identified by GI-clusters, and the families of GI-clusters are defined by viruses of training data.Conclusions: From analysis of 4 data sets that are constructed by the different family viruses, we demonstrate that GI-clusters are able to separate phages from other virus families, correctly classify the different phage families, and correctly predict the families of these unknown phages also.

Download Full-text

An interpretable mortality prediction model for COVID-19 patients – alternative approach

10.1101/2020.06.14.20130732 ◽

2020 ◽

Cited By ~ 1

Author(s):

Peter Gemmar

Keyword(s):

Prediction Model ◽

Test Data ◽

Fuzzy Model ◽

Healthcare Services ◽

Mortality Prediction ◽

Training Data ◽

Disease Assessment ◽

Data Sets ◽

Fuzzy Classifier ◽

Binary Decision Tree

AbstractThe pandemic spread of coronavirus leads to increased burden on healthcare services worldwide. Experience shows that required medical treatment can reach limits at local clinics and fast and secure clinical assessment of the disease severity becomes vital. In [1] a model is presented for predicting the mortality of COVID-19 patients from their biomarkers. Three biomarkers have been selected by ranking with a supervised Multi-tree XGBoost classifier. The prediction model is built up as a binary decision tree with depth three and achieves AUC scores of up to 97.84±0.37 and 95.06± 2.21 for training and external test data sets, resp.In human assessment and decision making influencing parameters usually aren’t considered as sharp numbers but rather as Fuzzy terms [2], and inferencing primarily yields Fuzzy terms or continuous grades rather than binary decisions. Therefore, I examined a Sugenotype Fuzzy classifier [3] for disease assessment and decision support. In addition, I used an artificial neural network (SOM, [4]) for selecting the biomarkers. Modelling and validation was done with the identical data base provided by [1]. With the complete training and test data sets, the Fuzzy prediction model achieves improved AUC scores of up to 98.59 or 95.12 The improvements with the Fuzzy classifier obviously become clear as physicians can interpret output grades to belong to positive or negative class more or less strongly. An extension of the Fuzzy model, which takes into account the trend in key features over time, provides excellent results with the training data, which, however, could not be finally verified due to the lack of suitable test data. The generation and training of the Fuzzy models was fully automatic and without additional adjustment with the help of ANFIS from Matlab©.

Download Full-text

Identification of Poison using C4.5 Algorithm

International Journal of Scientific Research in Science Engineering and Technology ◽

10.32628/ijsrset207247 ◽

2020 ◽

pp. 218-222

Author(s):

Lai Lai Yee ◽

Myo Ma Ma

Keyword(s):

Data Mining ◽

Test Data ◽

Knowledge Worker ◽

Training Data ◽

Independent Data ◽

Classification Rules ◽

Natural Evolution ◽

C4.5 Algorithm ◽

Other Information

Data mining is the task of discovering interesting patterns from large amounts of data where the data can be stored in databases, data warehouses or other information repositories. This can be viewed as a result of the natural evolution of information technology. The key point is that data mining is the application of these and other AI and statistical techniques to common business problems in a fashion that makes these techniques available to the skilled knowledge worker as well as the trained statistics professional. This paper is classification system for Toxicology using C4.5. Firstly, the input data are randomly partitioned into two independent data, a training data and a test data. And then two third of the data are allocated to the training data and the remaining one third is allocated to the test data. Final step is C4.5 Algorithm Process, the training data is used to derive C4.5 algorithm. Classification Process, test data are used to estimate the accuracy of the classification rules. If the accuracy is considered acceptable the rules can be applied to the classification of new data.

Download Full-text

SMDI: An Index for Measuring Subgingival Microbial Dysbiosis

Journal of Dental Research ◽

10.1177/00220345211035775 ◽

2021 ◽

pp. 002203452110357

Author(s):

T. Chen ◽

P.D. Marsh ◽

N.N. Al-Hebshi

Keyword(s):

Test Data ◽

Training Data ◽

Response To Treatment ◽

Data Sets ◽

Sequencing Data ◽

Characteristic Analysis ◽

Data Set ◽

Microbial Dysbiosis ◽

Log Ratio

An intuitive, clinically relevant index of microbial dysbiosis as a summary statistic of subgingival microbiome profiles is needed. Here, we describe a subgingival microbial dysbiosis index (SMDI) based on machine learning analysis of published periodontitis/health 16S microbiome data. The raw sequencing data, split into training and test sets, were quality filtered, taxonomically assigned to the species level, and centered log-ratio transformed. The training data set was subject to random forest analysis to identify discriminating species (DS) between periodontitis and health. DS lists, compiled by various “Gini” importance score cutoffs, were used to compute the SMDI for samples in the training and test data sets as the mean centered log-ratio abundance of periodontitis-associated species subtracted by that of health-associated ones. Diagnostic accuracy was assessed with receiver operating characteristic analysis. An SMDI based on 49 DS provided the highest accuracy with areas under the curve of 0.96 and 0.92 in the training and test data sets, respectively, and ranged from −6 (most normobiotic) to 5 (most dysbiotic) with a value around zero discriminating most of the periodontitis and healthy samples. The top periodontitis-associated DS were Treponema denticola, Mogibacterium timidum, Fretibacterium spp., and Tannerella forsythia, while Actinomyces naeslundii and Streptococcus sanguinis were the top health-associated DS. The index was highly reproducible by hypervariable region. Applying the index to additional test data sets in which nitrate had been used to modulate the microbiome demonstrated that nitrate has dysbiosis-lowering properties in vitro and in vivo. Finally, 3 genera ( Treponema, Fretibacterium, and Actinomyces) were identified that could be used for calculation of a simplified SMDI with comparable accuracy. In conclusion, we have developed a nonbiased, reproducible, and easy-to-interpret index that can be used to identify patients/sites at risk of periodontitis, to assess the microbial response to treatment, and, importantly, as a quantitative tool in microbiome modulation studies.

Download Full-text

Domain Adaptation for Statistical Classifiers

Journal of Artificial Intelligence Research ◽

10.1613/jair.1872 ◽

2006 ◽

Vol 26 ◽

pp. 101-126 ◽

Cited By ~ 270

Author(s):

H. Daume III ◽

D. Marcu

Keyword(s):

Language Processing ◽

Test Data ◽

Domain Adaptation ◽

Linear Chain ◽

Training Data ◽

Data Sets ◽

Underlying Distribution ◽

Statistical Classifiers ◽

Inference Algorithms ◽

Improved Performance

The most basic assumption used in statistical learning theory is that training data and test data are drawn from the same underlying distribution. Unfortunately, in many applications, the "in-domain" test data is drawn from a distribution that is related, but not identical, to the "out-of-domain" distribution of the training data. We consider the common case in which labeled out-of-domain data is plentiful, but labeled in-domain data is scarce. We introduce a statistical formulation of this problem in terms of a simple mixture model and present an instantiation of this framework to maximum entropy classifiers and their linear chain counterparts. We present efficient inference algorithms for this special case based on the technique of conditional expectation maximization. Our experimental results show that our approach leads to improved performance on three real world tasks on four different data sets from the natural language processing domain.

Download Full-text

DATA MINING ALGORITHM C4.5 CLASSIFICATION DETERMINATION CREDIT ELIGIBILITY FOR JAYA BERSAMA COOPERATIVES (KORJABE)

JURTEKSI ◽

10.33330/jurteksi.v8i1.1298 ◽

2021 ◽

Vol 8 (1) ◽

pp. 59-68

Author(s):

Christnatalis Christnatalis ◽

Roni Rayandi Saragih ◽

Bobby Christianto Tambunan

Keyword(s):

Data Mining ◽

Test Data ◽

Selection Method ◽

Training Data ◽

Classification Error ◽

Data Mining Algorithm ◽

Mining Method ◽

Data Mining Method ◽

Mining Algorithm ◽

C4.5 Algorithm

Abstract: This study uses the C4.5 classification algorithm to determine creditworthness, clasification aims to divide the assigned object intoin a number of categories called classes. In this study, the authorusing data mining and C4.5 algorithm as the selection method. The criteria used are loan installments, prospective customer income, termloan time, status of prospective customers. This study resulted in a classification modeldecision tree using the C4.5 algorithm is included in the Excellent category Classification with an accuracy value of 98.33% and a classification error of 1.67%,so that this study uses 70% training data and 30% test data. From resultthe calculation obtained shows that the C4.5 algorithm can be usedto determine the feasibility of granting credit to Koperasi Jaya customers Together (KORJABE). Keywords: Analysis, Credit Eligibility, C4 Algorithm, Data Mining, Method Abstrak: Penelitian ini menggunakan metode Algoritma C4.5 klasifikasi untuk menentukan kelayakan kredit, klasifikasi bertujuan untuk membagi objek yang ditetapkan ke dalam satu nomor kategori yang disebut kelas. Dalam penelitian ini, penulis menggunankan data mining dan algoritma C4.5 sebagai metode pemilihannya. Kriteria yang digunakan yaitu , angsuran pinjaman,penghasilan calon nasabah,jangka waktu pinjaman ,status calon nasabah. Penelitian ini menghasillkan model klasifikasi pohon keputusan menggunakan algoritma C4.5 termasuk dalam kategori Excellent Classification dengan nilai akurasi sebesar 98,33% dan klasifikasi eror 1,67%, sehingga penelitian ini kan menggunakan data latih 70% dan data uji 30%. Dari hasil perhitungan yang diperoleh menunjukan bahwa algoritma C4.5 dapat digunakan untuk menen tukan kelayakan pemberian kredit kepada nasabah Koperasi Jaya Bersama (KORJABE). Kata kunci: Algoritma C4.5, Analisis, Data Mining, Kelayakan Kredit, Metode

Download Full-text

Limited sampling models for doxorubicin pharmacokinetics.

Journal of Clinical Oncology ◽

10.1200/jco.1991.9.5.871 ◽

1991 ◽

Vol 9 (5) ◽

pp. 871-876 ◽

Cited By ~ 31

Author(s):

M J Ratain ◽

J Robert ◽

W J van der Vijgh

Keyword(s):

Test Data ◽

Plasma Concentrations ◽

Training Data ◽

Stepwise Multiple Regression ◽

Unknown Primary ◽

Data Sets ◽

Bolus Administration ◽

Time Curve ◽

Data Set ◽

Limited Sampling

Although doxorubicin is one of the most commonly used antineoplastics, no studies to date have clearly related the area under the concentration-time curve (AUC) to toxicity or response. The limited sampling model has recently been shown to be a feasible method for estimating the AUC to facilitate pharmacodynamic studies. Data from two previous studies of doxorubicin pharmacokinetics were used, including 26 patients with sarcoma and five patients with breast cancer or unknown primary. The former were divided into a training data set of 15 patients and a test datum set of 11 patients, and the latter patients formed a second test data set. The model was developed by stepwise multiple regression on the training data set: AUC (nanogram hour per milliliter) = 17.39 C2 + 163 C48-111.0 [dose/(50 mg/m2)], where C2 and C48 are the concentrations at 2 and 48 hours after bolus dose. The model was subsequently validated on both test data sets: first test data set--mean predictive error (MPE), 4.7%; root mean square error (RMSE), 12.4%; second test data set--MPE, 4.5%, RMSE, 9.2%. An additional model was also generated using a simulated time point to estimate the total AUC for a daily x 3-day schedule: AUC (nanogram hour per milliliter) = 44.79 C2 + 175.65 C48 + 47.25 [dose/(25 mg/m2/d)], where the C48 is obtained just prior to the third dose. We conclude that the AUC of doxorubicin after bolus administration can be adequately estimated from two timed plasma concentrations.

Download Full-text

Convolutional Neural Networks and Impact of Filter Sizes on Image Classification

Multidiszciplináris Tudományok ◽

10.35925/j.multi.2020.1.7 ◽

2020 ◽

Vol 10 (1) ◽

pp. 55-60

Author(s):

Owais Mujtaba Khanday ◽

Samad Dadvandipour

Keyword(s):

Neural Networks ◽

Image Classification ◽

Convolutional Neural Networks ◽

Test Data ◽

Training Data ◽

Three Dimensions ◽

Data Sets ◽

Data Set ◽

Classification Pattern ◽

Filter Size

Deep Neural Networks (DNN) in the past few years have revolutionized the computer vision by providing the best results on a large number of problems such as image classification, pattern recognition, and speech recognition. One of the essential models in deep learning used for image classification is convolutional neural networks. These networks can integrate a different number of features or so-called filters in a multi-layer fashion called convolutional layers. These models use convolutional, and pooling layers for feature abstraction and have neurons arranged in three dimensions: Height, Width, and Depth. Filters of 3 different sizes were used like 3×3, 5×5 and 7×7. It has been seen that the accuracy on the training data has been decreased from 100% to 97.8% as we increase the filter size and also the accuracy on the test data set decreases for 3×3 it is 98.7%, for 5×5 it is 98.5%, and for 7×7 it is 97.8%. The loss on the training data and test data per 10 epochs could be seen drastically increasing from 3.4% to 27.6% and 12.5% to 23.02%, respectively. Thus it is clear that using the filters having lesser dimensions is giving less loss than those having more dimensions. However, using the smaller filter size comes with the cost of computational complexity, which is very crucial in the case of larger data sets.

Download Full-text

Evaluation of automated cephalometric analysis based on the latest deep learning method

The Angle Orthodontist ◽

10.2319/021220-100.1 ◽

2021 ◽

Author(s):

Hye-Won Hwang ◽

Jun-Ho Moon ◽

Min-Gyu Kim ◽

Richard E. Donatelli ◽

Shin-Jae Lee

Keyword(s):

Deep Learning ◽

Test Data ◽

Training Data ◽

Superior Performance ◽

Cephalometric Analysis ◽

Data Sets ◽

Learning Method ◽

Test Results ◽

Classification Rate ◽

Data Set

ABSTRACT Objectives To compare an automated cephalometric analysis based on the latest deep learning method of automatically identifying cephalometric landmarks (AI) with previously published AI according to the test style of the worldwide AI challenges at the International Symposium on Biomedical Imaging conferences held by the Institute of Electrical and Electronics Engineers (IEEE ISBI). Materials and Methods This latest AI was developed by using a total of 1983 cephalograms as training data. In the training procedures, a modification of a contemporary deep learning method, YOLO version 3 algorithm, was applied. Test data consisted of 200 cephalograms. To follow the same test style of the AI challenges at IEEE ISBI, a human examiner manually identified the IEEE ISBI-designated 19 cephalometric landmarks, both in training and test data sets, which were used as references for comparison. Then, the latest AI and another human examiner independently detected the same landmarks in the test data set. The test results were compared by the measures that appeared at IEEE ISBI: the success detection rate (SDR) and the success classification rates (SCR). Results SDR of the latest AI in the 2-mm range was 75.5% and SCR was 81.5%. These were greater than any other previous AIs. Compared to the human examiners, AI showed a superior success classification rate in some cephalometric analysis measures. Conclusions This latest AI seems to have superior performance compared to previous AI methods. It also seems to demonstrate cephalometric analysis comparable to human examiners.

Download Full-text