The Research of Key Data Classification Optimal Mining Methods for Massive Data

The data classification is an important issue in massive data classification. This paper proposes an inter-cell classification algorithm based on phase recombination neighbor points convergence which analyzes the convergence value weights of inter-cell characteristic points and filter the interferences of the minority local optimal characteristic points. The proposed algorithm can promote the convergence of the inter-cell classification data neighbor points. The simulation experiments testify the models by three types of actually collected data sets which illustrate the models have better classification performance.

Download Full-text

A systematical approach to classification problems with feature space heterogeneity

Kybernetes ◽

10.1108/k-06-2018-0313 ◽

2019 ◽

Vol 48 (9) ◽

pp. 2006-2029

Author(s):

Hongshan Xiao ◽

Yu Wang

Keyword(s):

Factor Analysis ◽

Meta Analysis ◽

Feature Space ◽

Classification Performance ◽

Classification Algorithm ◽

Significant Feature ◽

Data Sets ◽

Data Set ◽

Classification Techniques ◽

Content Type

Purpose Feature space heterogeneity exists widely in various application fields of classification techniques, such as customs inspection decision, credit scoring and medical diagnosis. This paper aims to study the relationship between feature space heterogeneity and classification performance. Design/methodology/approach A measurement is first developed for measuring and identifying any significant heterogeneity that exists in the feature space of a data set. The main idea of this measurement is derived from a meta-analysis. For the data set with significant feature space heterogeneity, a classification algorithm based on factor analysis and clustering is proposed to learn the data patterns, which, in turn, are used for data classification. Findings The proposed approach has two main advantages over the previous methods. The first advantage lies in feature transform using orthogonal factor analysis, which results in new features without redundancy and irrelevance. The second advantage rests on samples partitioning to capture the feature space heterogeneity reflected by differences of factor scores. The validity and effectiveness of the proposed approach is verified on a number of benchmarking data sets. Research limitations/implications Measurement should be used to guide the heterogeneity elimination process, which is an interesting topic in future research. In addition, to develop a classification algorithm that enables scalable and incremental learning for large data sets with significant feature space heterogeneity is also an important issue. Practical implications Measuring and eliminating the feature space heterogeneity possibly existing in the data are important for accurate classification. This study provides a systematical approach to feature space heterogeneity measurement and elimination for better classification performance, which is favorable for applications of classification techniques in real-word problems. Originality/value A measurement based on meta-analysis for measuring and identifying any significant feature space heterogeneity in a classification problem is developed, and an ensemble classification framework is proposed to deal with the feature space heterogeneity and improve the classification accuracy.

Download Full-text

Chicken swarm foraging algorithm for big data classification using the deep belief network classifier

Data Technologies and Applications ◽

10.1108/dta-08-2019-0146 ◽

2020 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Sathyaraj R ◽

Ramanathan L ◽

Lavanya K ◽

Balasubramanian V ◽

Saira Banu J

Keyword(s):

Big Data ◽

Data Classification ◽

Massive Data ◽

Data Sets ◽

Jaccard Coefficient ◽

Training Phase ◽

Mapreduce Framework ◽

Massive Data Sets ◽

Content Type ◽

Big Data Classification

PurposeThe innovation in big data is increasing day by day in such a way that the conventional software tools face several problems in managing the big data. Moreover, the occurrence of the imbalance data in the massive data sets is a major constraint to the research industry.Design/methodology/approachThe purpose of the paper is to introduce a big data classification technique using the MapReduce framework based on an optimization algorithm. The big data classification is enabled using the MapReduce framework, which utilizes the proposed optimization algorithm, named chicken-based bacterial foraging (CBF) algorithm. The proposed algorithm is generated by integrating the bacterial foraging optimization (BFO) algorithm with the cat swarm optimization (CSO) algorithm. The proposed model executes the process in two stages, namely, training and testing phases. In the training phase, the big data that is produced from different distributed sources is subjected to parallel processing using the mappers in the mapper phase, which perform the preprocessing and feature selection based on the proposed CBF algorithm. The preprocessing step eliminates the redundant and inconsistent data, whereas the feature section step is done on the preprocessed data for extracting the significant features from the data, to provide improved classification accuracy. The selected features are fed into the reducer for data classification using the deep belief network (DBN) classifier, which is trained using the proposed CBF algorithm such that the data are classified into various classes, and finally, at the end of the training process, the individual reducers present the trained models. Thus, the incremental data are handled effectively based on the training model in the training phase. In the testing phase, the incremental data are taken and split into different subsets and fed into the different mappers for the classification. Each mapper contains a trained model which is obtained from the training phase. The trained model is utilized for classifying the incremental data. After classification, the output obtained from each mapper is fused and fed into the reducer for the classification.FindingsThe maximum accuracy and Jaccard coefficient are obtained using the epileptic seizure recognition database. The proposed CBF-DBN produces a maximal accuracy value of 91.129%, whereas the accuracy values of the existing neural network (NN), DBN, naive Bayes classifier-term frequency–inverse document frequency (NBC-TFIDF) are 82.894%, 86.184% and 86.512%, respectively. The Jaccard coefficient of the proposed CBF-DBN produces a maximal Jaccard coefficient value of 88.928%, whereas the Jaccard coefficient values of the existing NN, DBN, NBC-TFIDF are 75.891%, 79.850% and 81.103%, respectively.Originality/valueIn this paper, a big data classification method is proposed for categorizing massive data sets for meeting the constraints of huge data. The big data classification is performed on the MapReduce framework based on training and testing phases in such a way that the data are handled in parallel at the same time. In the training phase, the big data is obtained and partitioned into different subsets of data and fed into the mapper. In the mapper, the features extraction step is performed for extracting the significant features. The obtained features are subjected to the reducers for classifying the data using the obtained features. The DBN classifier is utilized for the classification wherein the DBN is trained using the proposed CBF algorithm. The trained model is obtained as an output after the classification. In the testing phase, the incremental data are considered for the classification. New data are first split into subsets and fed into the mapper for classification. The trained models obtained from the training phase are used for the classification. The classified results from each mapper are fused and fed into the reducer for the classification of big data.

Download Full-text

A Method of Feature Automatic Selection Based on Mutual Information Grouping and Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.543-547.1613 ◽

2014 ◽

Vol 543-547 ◽

pp. 1613-1618

Author(s):

Man Sheng Xiao ◽

Zhe Xiao ◽

Zhi Liu

Keyword(s):

Mutual Information ◽

Clustering Algorithm ◽

Data Classification ◽

Optimal Number ◽

Massive Data ◽

Data Sets ◽

Maximum Correlation ◽

Automatic Selection ◽

Feature Correlation ◽

Fuzzy C Means Clustering

For the problem about a large number of irrelevant and redundant features may reduce the performance of data classification in massive data sets, a method of feature automatic selection based on mutual information and fuzzy clustering algorithm is proposed. The method is carried out as follows: The first is to work out the feature correlation based on mutual information, and to group the data according to the feature of the maximum correlation. The second is to automatically determine the optimal number of feature and compression features dimension by fuzzy c-means clustering algorithm in the data groups. The theoretical analysis and the experiment indicate that the method can obtain higher efficiency in data classification.

Download Full-text

Fundamental resource trade-offs for encoded distributed optimization

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iaaa026 ◽

2020 ◽

Author(s):

A Salman Avestimehr ◽

Seyed Mohammadreza Mousavi Kalan ◽

Mahdi Soltanolkotabi

Keyword(s):

Computational Time ◽

Massive Data ◽

Data Sets ◽

Massive Data Sets ◽

Computational Framework ◽

Data Set ◽

Trade Offs ◽

Major Bottleneck ◽

Computing Environments ◽

Analyze Data

Abstract Dealing with the shear size and complexity of today’s massive data sets requires computational platforms that can analyze data in a parallelized and distributed fashion. A major bottleneck that arises in such modern distributed computing environments is that some of the worker nodes may run slow. These nodes a.k.a. stragglers can significantly slow down computation as the slowest node may dictate the overall computational time. A recent computational framework, called encoded optimization, creates redundancy in the data to mitigate the effect of stragglers. In this paper, we develop novel mathematical understanding for this framework demonstrating its effectiveness in much broader settings than was previously understood. We also analyze the convergence behavior of iterative encoded optimization algorithms, allowing us to characterize fundamental trade-offs between convergence rate, size of data set, accuracy, computational load (or data redundancy) and straggler toleration in this framework.

Download Full-text

An Incremental Classification Algorithm for Mining Data with Feature Space Heterogeneity

Mathematical Problems in Engineering ◽

10.1155/2014/327142 ◽

2014 ◽

Vol 2014 ◽

pp. 1-9 ◽

Cited By ~ 1

Author(s):

Yu Wang

Keyword(s):

Feature Space ◽

Classification Problem ◽

Classification Algorithm ◽

Data Sets ◽

Real World Data ◽

Supervised Clustering ◽

Online Classification ◽

Efficiency And Effectiveness ◽

Feature Relevance ◽

Incremental Classification

Feature space heterogeneity often exists in many real world data sets so that some features are of different importance for classification over different subsets. Moreover, the pattern of feature space heterogeneity might dynamically change over time as more and more data are accumulated. In this paper, we develop an incremental classification algorithm, Supervised Clustering for Classification with Feature Space Heterogeneity (SCCFSH), to address this problem. In our approach, supervised clustering is implemented to obtain a number of clusters such that samples in each cluster are from the same class. After the removal of outliers, relevance of features in each cluster is calculated based on their variations in this cluster. The feature relevance is incorporated into distance calculation for classification. The main advantage of SCCFSH lies in the fact that it is capable of solving a classification problem with feature space heterogeneity in an incremental way, which is favorable for online classification tasks with continuously changing data. Experimental results on a series of data sets and application to a database marketing problem show the efficiency and effectiveness of the proposed approach.

Download Full-text

A methodology for supporting collaborative exploratory analysis of massive data sets in tele-immersive environments

Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469) ◽

10.1109/hpdc.1999.805283 ◽

2003 ◽

Cited By ~ 8

Author(s):

J. Leigh ◽

A.E. Johnson ◽

T.A. DeFanti ◽

S. Bailey ◽

R. Grossman

Keyword(s):

Exploratory Analysis ◽

Massive Data ◽

Data Sets ◽

Massive Data Sets ◽

Immersive Environments

Download Full-text

Imbalanced data classification algorithm based on boosting and cascade model

2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC) ◽

10.1109/icsmc.2012.6378183 ◽

2012 ◽

Author(s):

Xiaolong Zhang ◽

Chao Cheng

Keyword(s):

Imbalanced Data ◽

Data Classification ◽

Classification Algorithm ◽

Cascade Model ◽

Imbalanced Data Classification

Download Full-text

Sparse Matrix Approach in Neural Networks for Effective Medical Data Sets Classifications

Journal of Basic and Applied Research in Biomedicine ◽

10.51152/jbarbiomed.v6i2.113 ◽

2020 ◽

Vol 6 (2) ◽

pp. 90-97

Author(s):

Sagir Masanawa ◽

Hamza Abubakar

Keyword(s):

Intelligent System ◽

Sparse Matrix ◽

Data Classification ◽

Medical Data ◽

Data Sets ◽

Matrix Approach ◽

Neural Network Learning ◽

Network Learning ◽

Hybrid Intelligent System ◽

Medical Data Classification

In this paper, a hybrid intelligent system that consists of the sparse matrix approach incorporated in neural network learning model as a decision support tool for medical data classification is presented. The main objective of this research is to develop an effective intelligent system that can be used by medical practitioners to accelerate diagnosis and treatment processes. The sparse matrix approach incorporated in neural network learning algorithm for scalability, minimize higher memory storage capacity usage, enhancing implementation time and speed up the analysis of the medical data classification problem. The hybrid intelligent system aims to exploit the advantages of the constituent models and, at the same time, alleviate their limitations. The proposed intelligent classification system maximizes the intelligently classification of medical data and minimizes the number of trends inaccurately identified. To evaluate the effectiveness of the hybrid intelligent system, three benchmark medical data sets, viz., Hepatitis, SPECT Heart and Cleveland Heart from the UCI Repository of Machine Learning, are used for evaluation. A number of useful performance metrics in medical applications which include accuracy, sensitivity, specificity. The results were analyzed and compared with those from other methods published in the literature. The experimental outcomes positively demonstrate that the hybrid intelligent system was effective in undertaking medical data classification tasks.

Download Full-text

An Associate Rules Mining Algorithm Based on Artificial Immune Network for SAR Image Segmentation

Mathematical Problems in Engineering ◽

10.1155/2015/839081 ◽

2015 ◽

Vol 2015 ◽

pp. 1-14

Author(s):

Mengling Zhao ◽

Hongwei Liu

Keyword(s):

Association Rules ◽

Large Scale ◽

Classification Algorithm ◽

Data Sets ◽

Immune Network ◽

Artificial Immune ◽

Mining Algorithm ◽

Sar Image Segmentation ◽

Artificial Immune Network ◽

Adaptive Pso

As a computational intelligence method, artificial immune network (AIN) algorithm has been widely applied to pattern recognition and data classification. In the existing artificial immune network algorithms, the calculating affinity for classifying is based on calculating a certain distance, which may lead to some unsatisfactory results in dealing with data with nominal attributes. To overcome the shortcoming, the association rules are introduced into AIN algorithm, and we propose a new classification algorithm an associate rules mining algorithm based on artificial immune network (ARM-AIN). The new method uses the association rules to represent immune cells and mine the best association rules rather than searching optimal clustering centers. The proposed algorithm has been extensively compared with artificial immune network classification (AINC) algorithm, artificial immune network classification algorithm based on self-adaptive PSO (SPSO-AINC), and PSO-AINC over several large-scale data sets, target recognition of remote sensing image, and segmentation of three different SAR images. The result of experiment indicates the superiority of ARM-AIN in classification accuracy and running time.

Download Full-text

Stamping Plant 4.0 – Basics for the Application of Data Mining Methods in Manufacturing Car Body Parts

Key Engineering Materials ◽

10.4028/www.scientific.net/kem.639.21 ◽

2015 ◽

Vol 639 ◽

pp. 21-30 ◽

Cited By ~ 7

Author(s):

Stephan Purr ◽

Josef Meinhardt ◽

Arnulf Lipp ◽

Axel Werner ◽

Martin Ostermair ◽

...

Keyword(s):

Data Mining ◽

Data Acquisition ◽

Data Driven ◽

Quality Analysis ◽

Process Conditions ◽

Data Sets ◽

Body Parts ◽

Car Body ◽

Sample Data ◽

Mining Methods

Data-driven quality evaluation in the stamping process of car body parts is quite promising because dependencies in the process have not yet been sufficiently researched. However, the application of data mining methods for the process in stamping plants would require a large number of sample data sets. Today, acquiring these data represents a major challenge, because the necessary data are inadequately measured, recorded or stored. Thus, the preconditions for the sample data acquisition must first be created before being able to investigate any correlations. In addition, the process conditions change over time due to wear mechanisms. Therefore, the results do not remain valid and a constant data acquisition is required. In this publication, the current situation in stamping plants regarding the process robustness will be first discussed and the need for data-driven methods will be shown. Subsequently, the state of technology regarding the possibility of collecting the sample data sets for quality analysis in producing car body parts will be researched. At the end of this work, an overview will be provided concerning how this data collection was implemented at BMW as well as what kind of potential can be expected.

Download Full-text