Spam Mail Filtering Using Data Mining Approach

Author(s):  
Ajay Kumar Gupta

This chapter presents an overview of spam email as a serious problem in our internet world and creates a spam filter that reduces the previous weaknesses and provides better identification accuracy with less complexity. Since J48 decision tree is a widely used classification technique due to its simple structure, higher classification accuracy, and lower time complexity, it is used as a spam mail classifier here. Now, with lower complexity, it becomes difficult to get higher accuracy in the case of large number of records. In order to overcome this problem, particle swarm optimization is used here to optimize the spam base dataset, thus optimizing the decision tree model as well as reducing the time complexity. Once the records have been standardized, the decision tree is again used to check the accuracy of the classification. The chapter presents a study on various spam-related issues, various filters used, related work, and potential spam-filtering scope.

2016 ◽  
Vol 26 (03) ◽  
pp. 1750007 ◽  
Author(s):  
S. Dinakaran ◽  
P. Ranjit Jeba Thangaiah

This article introduces a novel ensemble method named eAdaBoost (Effective Adaptive Boosting) is a meta classifier which is developed by enhancing the existing AdaBoost algorithm and to handle the time complexity and also to produce the best classification accuracy. The eAdaBoost reduces the error rate when compared with the existing methods and generates the best accuracy by reweighing each feature for further process. The comparison results of an extensive experimental evaluation of the proposed method are explained using the UCI machine learning repository datasets. The accuracy of the classifiers and statistical test comparisons are made with various boosting algorithms. The proposed eAdaBoost has been also implemented with different decision tree classifiers like C4.5, Decision Stump, NB Tree and Random Forest. The algorithm has been computed with various dataset, with different weight thresholds and the performance is analyzed. The proposed method produces better results using random forest and NB tree as base classifier than the decision stump and C4.5 classifiers for few datasets. The eAdaBoost gives better classification accuracy, and prediction accuracy, and execution time is also less when compared with other classifiers.


2013 ◽  
Vol 2013 ◽  
pp. 1-8 ◽  
Author(s):  
Ersen Yılmaz ◽  
Çağlar Kılıkçıer

We use least squares support vector machine (LS-SVM) utilizing a binary decision tree for classification of cardiotocogram to determine the fetal state. The parameters of LS-SVM are optimized by particle swarm optimization. The robustness of the method is examined by running 10-fold cross-validation. The performance of the method is evaluated in terms of overall classification accuracy. Additionally, receiver operation characteristic analysis and cobweb representation are presented in order to analyze and visualize the performance of the method. Experimental results demonstrate that the proposed method achieves a remarkable classification accuracy rate of 91.62%.


2020 ◽  
Vol 9 (1) ◽  
pp. 102
Author(s):  
Hendra Hendra ◽  
Mochammad Abdul Azis ◽  
Suhardjono Suhardjono

Good accreditation results are the goal of the college. With good accreditation, prospective students can glance at and enter the tertiary institution. To achieve this, there are several aspects that affect good accreditation results, one of which is graduate students who play an important role in determining accreditation. Timely graduate students can benefit the college or a student. Graduates can be predicted before the final semester using a method one of which is the decision tree. Decision tree is a method that is simple and easy to understand by producing rules in the form of a decision tree, but using a decision tree model alone is not enough to produce optimal results. So we need a method for optimization that is particle swarm optimization with advantages can improve accuracy by eliminating unused features. From the results of research with primary data of 2000-2003 graduate students in Amik PPMI Tangerang explained that the particle swarm optimization method can increase accuracy by 87.56% and increase by 01.01% from the decision tree method with a value of 86.55%. From the particle swarm optimization method can also find out which unused attributes have no weight, so that way can improve accuracy. From the results of the increase, it can be used by the Amik University of Tangerang to prevent students from graduating on time.


2020 ◽  
Vol 4 (2) ◽  
pp. 296-302
Author(s):  
Ikhsan Romli ◽  
Fairuz Kharida ◽  
Chandra Naya

Tax Service Office is a work unit of the Directorate General of Taxation that carries out services in the field of taxation to the public, both registered and unregistered taxpayers, within the working area of the Directorate General of Taxes. The number of Primary Tax Service Offices in Indonesia, one of which is the Primary Tax Service Office in Bekasi, has various ways to increase the satisfaction of taxpayers for the services provided. This study aims to determine the accuracy of taxpayers' satisfaction using data mining techniques using the Decision Tree C4.5 Algorithm with Particle Swarm Optimization (PSO) feature selection, validation uses cross validation techniques while accuracy is measured by the confussion matrix, which is to determine the level of service satisfaction conducted by distributing questionnaires to taxpayers in the Primary Tax Service Office in Bekasi as many as 500 questionnaires. The results show the accuracy value of Taxpayers' service satisfaction at the Pratama Tax Service Office using the Decision Tree C4.5 Algorithm with a feature selection of Particle Swarm Optimization (PSO) of 98,85%, Precission of 98,85% and Recall of 100%.


2021 ◽  
Vol 5 (2) ◽  
pp. 556
Author(s):  
Firman Syahputra ◽  
Hartono Hartono ◽  
Rika Rosnelly

This study aims to provide an evaluation of the availability of money in ATM machines using data mining. Data mining with the C4.5 algorithm is used to predict cash demand or total cash withdrawals at ATMs. To determine the need for ATM cash based on cash transaction data. It is hoped that this forecasting can help the monitoring department in making decisions about the money requirements that must be allocated to each ATM machine. The results of this study are expected to assist the ATM management unit in optimizing and monitoring the availability of money at an ATM machine for cash needs, so that it can provide optimal service to customers. Algortima C4.5 is an algorithm that is able to form a decision tree, where the decision tree will then generate new knowledge. The results of the test matched the data on the availability of money at the ATM machine. The results of implementing the C4.5 method on the availability of money at the ATM machine are seen from the travel time to the ATM location and also the remaining balance in the machine. The resulting decision tree model is to make the balance variable as the root, then the travel time as a branch at Level 1 with the variables fast, medium, long, and the bank becomes a branch at the last level (Level 2). Then the C4.5 algorithm was tested using the K-Fold Cross validation method with the value of fold = 10, it can be seen that the accuracy rate is 85%, the Precision value is 80% and the Recall value is 66.67%. While the AUC (Area Under Curve) value is 0.833, this shows that if the AUC value approaches the value 1, the accuracy level is getting better


Author(s):  
Alice Constance Mensah ◽  
Isaac Ofori Asare

Breast cancer is the most common of all cancers and is the leading cause of cancer deaths in women worldwide. The classification of breast cancer data can be useful to predict the outcome of some diseases or discover the genetic behavior of tumors. Data mining technology helps in classifying cancer patients and this technique helps to identify potential cancer patients by simply analyzing the data. This study examines the determinant factors of breast cancer and measures the breast cancer patient data to build a useful classification model using a data mining approach. In this study of 2397 women, 1022 (42.64%) were diagnosed with breast cancer. Among the four main learning techniques such as: Random Forest, Naive Bayes, Classification and Regression Model (CART), and Boosted Tree model were used for the study. The Random Forest technique had the better accuracy value of 0.9892(95%CI,0.9832 -0.9935) and a sensitivity value of about 92%. This means that the Random Forest learning model is the best model to classify and predict breast cancer based on associated factors.


Author(s):  
Esra Aksoy ◽  
Serkan Narli ◽  
Mehmet Akif Aksoy

The aim of this chapter is to illustrate both uses of data mining methods and the way of these methods can be applied in education by using students' multiple intelligences. Data mining is a data analysis methodology that has been successfully used in different areas including the educational domain. In this context, in this study, an application of EDM will be illustrated by using multiple intelligence and some other variables (e.g., learning styles and personality types). The decision tree model was implemented using students' learning styles, multiple intelligences, and personality types to identify gifted students. The sample size was 735 middle school students. The constructed decision tree model with 70% validity revealed that examination of mathematically gifted students using data mining techniques may be possible if specific characteristics are included.


2019 ◽  
Vol 11 (7) ◽  
pp. 155 ◽  
Author(s):  
Yufeng Wang ◽  
Shuangrong Liu ◽  
Songqian Li ◽  
Jidong Duan ◽  
Zhihao Hou ◽  
...  

Social network services for self-media, such as Weibo, Blog, and WeChat Public, constitute a powerful medium that allows users to publish posts every day. Due to insufficient information transparency, malicious marketing of the Internet from self-media posts imposes potential harm on society. Therefore, it is necessary to identify news with marketing intentions for life. We follow the idea of text classification to identify marketing intentions. Although there are some current methods to address intention detection, the challenge is how the feature extraction of text reflects semantic information and how to improve the time complexity and space complexity of the recognition model. To this end, this paper proposes a machine learning method to identify marketing intentions from large-scale We-Media data. First, the proposed Latent Semantic Analysis (LSI)-Word2vec model can reflect the semantic features. Second, the decision tree model is simplified by decision tree pruning to save computing resources and reduce the time complexity. Finally, this paper examines the effects of classifier associations and uses the optimal configuration to help people efficiently identify marketing intention. Finally, the detailed experimental evaluation on several metrics shows that our approaches are effective and efficient. The F1 value can be increased by about 5%, and the running time is increased by 20%, which prove that the newly-proposed method can effectively improve the accuracy of marketing news recognition.


Sign in / Sign up

Export Citation Format

Share Document