Predicting Breast Cancer Using Logistic Regression and Multi-Class Classifiers

Jabeen Sultana; Abdul Khader Jilani; . .

doi:10.14419/ijet.v7i4.20.22115

Predicting Breast Cancer Using Logistic Regression and Multi-Class Classifiers

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i4.20.22115 ◽

2018 ◽

Vol 7 (4.20) ◽

pp. 22 ◽

Cited By ~ 4

Author(s):

Jabeen Sultana ◽

Abdul Khader Jilani ◽

. .

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Logistic Regression ◽

Regression Method ◽

Breast Cancer Dataset ◽

Breast Cancer Data ◽

Data Set ◽

Cancer Data ◽

Logistic Regression Method ◽

Simple Logistic

The primary identification and prediction of type of the cancer ought to develop a compulsion in cancer study, in order to assist and supervise the patients. The significance of classifying cancer patients into high or low risk clusters needs commanded many investigation teams, from the biomedical and the bioinformatics area, to learn and analyze the application of machine learning (ML) approaches. Logistic Regression method and Multi-classifiers has been proposed to predict the breast cancer. To produce deep predictions in a new environment on the breast cancer data. This paper explores the different data mining approaches using Classification which can be applied on Breast Cancer data to build deep predictions. Besides this, this study predicts the best Model yielding high performance by evaluating dataset on various classifiers. In this paper Breast cancer dataset is collected from the UCI machine learning repository has 569 instances with 31 attributes. Data set is pre-processed first and fed to various classifiers like Simple Logistic-regression method, IBK, K-star, Multi-Layer Perceptron (MLP), Random Forest, Decision table, Decision Trees (DT), PART, Multi-Class Classifiers and REP Tree. 10-fold cross validation is applied, training is performed so that new Models are developed and tested. The results obtained are evaluated on various parameters like Accuracy, RMSE Error, Sensitivity, Specificity, F-Measure, ROC Curve Area and Kappa statistic and time taken to build the model. Result analysis reveals that among all the classifiers Simple Logistic Regression yields the deep predictions and obtains the best model yielding high and accurate results followed by other methods IBK: Nearest Neighbor Classifier, K-Star: instance-based Classifier, MLP- Neural network. Other Methods obtained less accuracy in comparison with Logistic regression method.

Download Full-text

Synthesising artificial patient-level data for Open Science - an evaluation of five methods

10.1101/2020.10.09.20210138 ◽

2020 ◽

Author(s):

Michael Allen ◽

Andrew Salmon

Keyword(s):

Breast Cancer ◽

Logistic Regression ◽

Synthetic Data ◽

Original Data ◽

Classification Model ◽

Data Sets ◽

List Type ◽

Breast Cancer Data ◽

Data Set ◽

Cancer Data

ABSTRACTBackgroundOpen science is a movement seeking to make scientific research accessible to all, including publication of code and data. Publishing patient-level data may, however, compromise the confidentiality of that data if there is any significant risk that data may later be associated with individuals. Use of synthetic data offers the potential to be able to release data that may be used to evaluate methods or perform preliminary research without risk to patient confidentiality.MethodsWe have tested five synthetic data methods:A technique based on Principal Component Analysis (PCA) which samples data from distributions derived from the transformed data.Synthetic Minority Oversampling Technique, SMOTE which is based on interpolation between near neighbours.Generative Adversarial Network, GAN, an artificial neural network approach with competing networks - a discriminator network trained to distinguish between synthetic and real data., and a generator network trained to produce data that can fool the discriminator network.CT-GAN, a refinement of GANs specifically for the production of structured tabular synthetic data.Variational Auto Encoders, VAE, a method of encoding data in a reduced number of dimensions, and sampling from distributions based on the encoded dimensions.Two data sets are used to evaluate the methods:The Wisconsin Breast Cancer data set, a histology data set where all features are continuous variables.A stroke thrombolysis pathway data set, a data set describing characteristics for patients where a decision is made whether to treat with clot-busting medication. Features are mostly categorical, binary, or integers.Methods are evaluated in three ways:The ability of synthetic data to train a logistic regression classification model.A comparison of means and standard deviations between original and synthetic data.A comparison of covariance between features in the original and synthetic data.ResultsUsing the Wisconsin Breast Cancer data set, the original data gave 98% accuracy in a logistic regression classification model. Synthetic data sets gave between 93% and 99% accuracy. Performance (best to worst) was SMOTE > PCA > GAN > CT-GAN = VAE. All methods produced a high accuracy in reproducing original data means and stabdard deviations (all R-square > 0.96 for all methods and data classes). CT-GAN and VAE suffered a significant loss of covariance between features in the synthetic data sets.Using the Stroke Pathway data set, the original data gave 82% accuracy in a logistic regression classification model. Synthetic data sets gave between 66% and 82% accuracy. Performance (best to worst) was SMOTE > PCA > CT-GAN > GAN > VAE. CT-GAN and VAE suffered loss of covariance between features in the synthetic data sets, though less pronounced than with the Wisconsin Breast Cancer data set.ConclusionsThe pilot work described here shows, as proof of concept, that synthetic data may be produced, which is of sufficient quality to publish with open methodology, to allow people to better understand and test methodology. The quality of the synthetic data also gives promise of data sets that may be used for screening of ideas, or for research project (perhaps especially in an education setting).More work is required to further refine and test methods across a broader range of patient-level data sets.

Download Full-text

Detection of Breast Cancer Using Machine Learning Support Vector Machine Algorithm

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2019.7747 ◽

2019 ◽

Vol 16 (2) ◽

pp. 441-444

Author(s):

D. V. Soundari ◽

R. Padmapriya ◽

C. Thirumariselvi ◽

N. Nanthini ◽

K. Priyadharsini

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Support Vector Machine ◽

Support Vector ◽

Learning Support ◽

Support Vector Machine Algorithm ◽

Breast Cancer Data ◽

Data Set ◽

Cancer Data ◽

Hormone Imbalance

A woman majorly suffers due to breast cancer which is due to hormone imbalance. It leads to huge death in recent years. Early detection of the breast cancer is more important to prevent human lives. Image Processing plays an important to classify and detect the same. So this paper proposes machine learning based cancer classification using support vector machine with Wisconsin breast cancer data set.

Download Full-text

Hyperparameter Tuning and Pipeline Optimization via Grid Search Method and Tree-Based AutoML in Breast Cancer Prediction

Journal of Personalized Medicine ◽

10.3390/jpm11100978 ◽

2021 ◽

Vol 11 (10) ◽

pp. 978

Author(s):

Siti Fairuz Mat Radzi ◽

Muhammad Khalis Abdul Karim ◽

M Iqbal Saripan ◽

Mohd Amiruddin Abdul Rahman ◽

Iza Nurzawani Che Isa ◽

...

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Model Selection ◽

Principal Component ◽

Receiver Operating Curve ◽

Support Vector ◽

Grid Search ◽

Breast Cancer Data ◽

Data Set ◽

Cancer Data

Automated machine learning (AutoML) has been recognized as a powerful tool to build a system that automates the design and optimizes the model selection machine learning (ML) pipelines. In this study, we present a tree-based pipeline optimization tool (TPOT) as a method for determining ML models with significant performance and less complex breast cancer diagnostic pipelines. Some features of pre-processors and ML models are defined as expression trees and optimal gene programming (GP) pipelines, a stochastic search system. Features of radiomics have been presented as a guide for the ML pipeline selection from the breast cancer data set based on TPOT. Breast cancer data were used in a comparative analysis of the TPOT-generated ML pipelines with the selected ML classifiers, optimized by a grid search approach. The principal component analysis (PCA) random forest (RF) classification was proven to be the most reliable pipeline with the lowest complexity. The TPOT model selection technique exceeded the performance of grid search (GS) optimization. The RF classifier showed an outstanding outcome amongst the models in combination with only two pre-processors, with a precision of 0.83. The grid search optimized for support vector machine (SVM) classifiers generated a difference of 12% in comparison, while the other two classifiers, naïve Bayes (NB) and artificial neural network—multilayer perceptron (ANN-MLP), generated a difference of almost 39%. The method’s performance was based on sensitivity, specificity, accuracy, precision, and receiver operating curve (ROC) analysis.

Download Full-text

Breast Cancer Detection with Revamped Dataset Using Machine Learning Techniques

Journal of Medical Imaging and Health Informatics ◽

10.1166/jmihi.2021.3892 ◽

2021 ◽

Vol 11 (12) ◽

pp. 2996-3009

Author(s):

Sundarambal Balaraman ◽

Ramesh Ramamoorthy ◽

Raja Krishnamoorthi

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Logistic Regression ◽

Learning Algorithm ◽

Machine Learning Techniques ◽

Support Vector ◽

Data Set ◽

Cancer Data ◽

Learning Techniques ◽

Incidence And Mortality

Machine learning is a current topic of interest in research and industry, with the implementation of novel strategies all the time. The main purpose of this research activity is to determine the efficiency of machine learning techniques in the detection research of breast cancer. The incidence and mortality of breast cancer in women are increasing day by day. Worldwide, researchers have worked hard to help clinicians provide the best model for detecting diagnosis and breast cancer. In this work, learning UCI machine Wisconsin breast cancer data from a set of databases, model, and analyze the performance of existing work use, compared to the same data set. The dataset is analyzed, and the revamped dataset is constructed by eliminating redundant features and appending new features essential for prediction. Logistic regression, K nearest neighbors (KNN), support vector machine (SVM), decision trees, random forest, XGBoost, using a machine learning algorithm, such as re-organized data set of artificial neural network AdaBoost, 8 one of prediction build the model application (ANN). Standard to analyze the accuracy rate. In the experiment, these classifications have been shown to work for breast cancer with >97% accuracy. Logistic regression, XGBoost and Adaboost, stand on top with 99.28 percent accuracy. The experiment also, the balanced data set of removal outliers and balance, shows that have a significant impact on the model’s prediction performance.

Download Full-text

Emperical Evaluation of Machine Learning algorithms for Breast Cancer Data Classification

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i10.346351 ◽

2018 ◽

Vol 6 (10) ◽

pp. 346-351

Author(s):

S. Kumaravel ◽

S. Ophilia Domanica Vithya

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Learning Algorithms ◽

Data Classification ◽

Machine Learning Algorithms ◽

Breast Cancer Data ◽

Cancer Data

Download Full-text

Design of novel multi filter union feature selection framework for breast cancer dataset

Concurrent Engineering ◽

10.1177/1063293x211016046 ◽

2021 ◽

pp. 1063293X2110160

Author(s):

Dinesh Morkonda Gunasekaran ◽

Prabha Dhandayudam

Keyword(s):

Breast Cancer ◽

Feature Selection ◽

Care Center ◽

Feature Selection Method ◽

Selection Method ◽

Cancer Center ◽

Breast Cancer Dataset ◽

Data Set ◽

Health Care Center ◽

Cancer Data

Nowadays women are commonly diagnosed with breast cancer. Feature based Selection method plays an important step while constructing a classification based framework. We have proposed Multi filter union (MFU) feature selection method for breast cancer data set. The feature selection process based on random forest algorithm and Logistic regression (LG) algorithm based union model is used for selecting important features in the dataset. The performance of the data analysis is evaluated using optimal features subset from selected dataset. The experiments are computed with data set of Wisconsin diagnostic breast cancer center and next the real data set from women health care center. The result of the proposed approach shows high performance and efficient when comparing with existing feature selection algorithms.

Download Full-text

A Comparison Study of Goodness of Fit Tests of Logistic Regression in R: Simulation and Application to Breast Cancer Data

Academic Journal of Applied Mathematical Sciences ◽

10.32861/ajams.71.50.59 ◽

2020 ◽

pp. 50-59

Author(s):

El-Housainy A. Rady ◽

Mohamed R. Abonazel ◽

Mariam H. Metawe’e

Keyword(s):

Breast Cancer ◽

Logistic Regression ◽

Sample Size ◽

Null Hypothesis ◽

Goodness Of Fit ◽

Quadratic Term ◽

Breast Cancer Dataset ◽

Cancer Data ◽

Interaction Term ◽

Test Package

Goodness of fit (GOF) tests of logistic regression attempt to find out the suitability of the model to the data. The null hypothesis of all GOF tests is the model fit. R as a free software package has many GOF tests in different packages. A Monte Carlo simulation has been conducted to study two situations; the first, studying the ability of each test, under its default settings, to accept the null hypothesis when the model truly fitted. The second, studying the power of these tests when assumptions of sufficient linear combination of the explanatory variables are violated (by omitting linear covariate term, quadratic term, or interaction term). Moreover, checking whether the same test in different R packages had the same results or not. As the sample size supposed to affect simulation results, so the pattern of change of GOF tests results under different sample sizes as well as different model settings was estimated. All tests accept the null hypothesis (more than 95% of simulation trials) when the model truly fitted except modified Hosmer-Lemeshow test in "LogisticDx" package under all different model settings and Osius and Rojek’s (OsRo) test when the true model had an interaction term between binary and categorical covariates. In addition, le Cessie-van Houwelingen-Copas-Hosmer unweighted sum of squares (CHCH) test gave unexpected different results under different packages. Concerning the power study, all tests had a very low power when a departure of missing covariate existed. Generally, stukel’s test (package ’LogisticDX) and CHCH test (package "RMS") reached a power in detecting a missing quadratic term greater than 80% under lower sample size while OsRo test (package ’LogisticDX’) was better in detecting missing interaction term. Beside the simulation study, we evaluated the performance of GOF tests using the breast cancer dataset.

Download Full-text

Prediksi Not Operational Transaction Menggunakan Logistic Regression pada Bank XYZ di Kota Kupang

AITI ◽

10.24246/aiti.v17i1.42-55 ◽

2020 ◽

Vol 17 (1) ◽

pp. 42-55

Author(s):

Radius Tanone ◽

Arnold B Emmanuel

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Regression Method ◽

Learning Approach ◽

Know How ◽

Independent Variables ◽

Machine Learning Approach ◽

Logistic Regression Method ◽

Python Programming

Bank XYZ is one of the banks in Kupang City, East Nusa Tenggara Province which has several ATM machines and is placed in several merchant locations. The existing ATM machine is one of the goals of customers and non-customers in conducting transactions at the ATM machine. The placement of the ATM machines sometimes makes the machine not used optimally by the customer to transact, causing the disposal of machine resources and a condition called Not Operational Transaction (NOP). With the data consisting of several independent variables with numeric types, it is necessary to know how the classification of the dependent variable is NOP. Machine learning approach with Logistic Regression method is the solution in doing this classification. Some research steps are carried out by collecting data, analyzing using machine learning using python programming and writing reports. The results obtained with this machine learning approach is the resulting prediction value of 0.507 for its classification. This means that in the future XYZ Bank can classify NOP conditions based on the behavior of customers or non-customers in making transactions using Bank XYZ ATM machines.

Download Full-text

Using new artificial bee colony as probabilistic neural network for breast cancer data classification

Frontiers in Engineering and Built Environment ◽

10.1108/febe-03-2021-0015 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Habib Shah

Keyword(s):

Breast Cancer ◽

Neural Network ◽

Artificial Bee Colony ◽

Probabilistic Neural Network ◽

Breast Cancers ◽

Breast Cancer Data ◽

Data Set ◽

Content Type ◽

Cancer Data ◽

Bee Colony

PurposeBreast cancer is an important medical disorder, which is not a single disease but a cluster more than 200 different serious medical complications.Design/methodology/approachThe new artificial bee colony (ABC) implementation has been applied to probabilistic neural network (PNN) for training and testing purpose to classify the breast cancer data set.FindingsThe new ABC algorithm along with PNN has been successfully applied to breast cancers data set for prediction purpose with minimum iteration consuming.Originality/valueThe new implementation of ABC along PNN can be easily applied to times series problems for accurate prediction or classification.

Download Full-text

A Markov chain model of a longitudinal breast cancer data set.

Journal of Clinical Oncology ◽

10.1200/jco.2014.32.15_suppl.11040 ◽

2014 ◽

Vol 32 (15_suppl) ◽

pp. 11040-11040

Author(s):

Paul K. Newton ◽

Jorge J. Nieva ◽

Peter Kuhn ◽

Larry Norton ◽

Elizabeth Anne Comen ◽

...

Keyword(s):

Breast Cancer ◽

Markov Chain ◽

Markov Chain Model ◽

Chain Model ◽

Breast Cancer Data ◽

Data Set ◽

Cancer Data

Download Full-text