attribute clustering
Recently Published Documents


TOTAL DOCUMENTS

51
(FIVE YEARS 10)

H-INDEX

7
(FIVE YEARS 2)

2021 ◽  
Vol 7 ◽  
pp. e671
Author(s):  
Shilpi Bose ◽  
Chandra Das ◽  
Abhik Banerjee ◽  
Kuntal Ghosh ◽  
Matangini Chattopadhyay ◽  
...  

Background Machine learning is one kind of machine intelligence technique that learns from data and detects inherent patterns from large, complex datasets. Due to this capability, machine learning techniques are widely used in medical applications, especially where large-scale genomic and proteomic data are used. Cancer classification based on bio-molecular profiling data is a very important topic for medical applications since it improves the diagnostic accuracy of cancer and enables a successful culmination of cancer treatments. Hence, machine learning techniques are widely used in cancer detection and prognosis. Methods In this article, a new ensemble machine learning classification model named Multiple Filtering and Supervised Attribute Clustering algorithm based Ensemble Classification model (MFSAC-EC) is proposed which can handle class imbalance problem and high dimensionality of microarray datasets. This model first generates a number of bootstrapped datasets from the original training data where the oversampling procedure is applied to handle the class imbalance problem. The proposed MFSAC method is then applied to each of these bootstrapped datasets to generate sub-datasets, each of which contains a subset of the most relevant/informative attributes of the original dataset. The MFSAC method is a feature selection technique combining multiple filters with a new supervised attribute clustering algorithm. Then for every sub-dataset, a base classifier is constructed separately, and finally, the predictive accuracy of these base classifiers is combined using the majority voting technique forming the MFSAC-based ensemble classifier. Also, a number of most informative attributes are selected as important features based on their frequency of occurrence in these sub-datasets. Results To assess the performance of the proposed MFSAC-EC model, it is applied on different high-dimensional microarray gene expression datasets for cancer sample classification. The proposed model is compared with well-known existing models to establish its effectiveness with respect to other models. From the experimental results, it has been found that the generalization performance/testing accuracy of the proposed classifier is significantly better compared to other well-known existing models. Apart from that, it has been also found that the proposed model can identify many important attributes/biomarker genes.


Author(s):  
Shihua Liu ◽  
Hao Zhang ◽  
Xianghua Liu

A Two-stage clustering framework and a clustering algorithm for mixed attribute data based on density peaks and Goodall distance are proposed. Firstly, the subset of numerical attributes of the dataset is clustered, and then the result is mapped into one-dimensional categorical attribute and added to the subset of categorical attribute data. Finally, the new dataset is clustered by the density peaks clustering algorithm to obtain the final result. Experiments on three commonly used UCI datasets show that this algorithm can effectively realize mixed attribute clustering and produce better clustering results than the traditional K-prototypes algorithm do. The clustering accuracy on the Acute, Heart and Credit datasets are 17%, 24%, and 21% higher on average than that of the K-prototypes, respectively.


2020 ◽  
Vol 12 (4) ◽  
pp. 506-511
Author(s):  
Min Sun ◽  
Jiang Duan

To enhance the feature extraction capacity of nanofibers, a method of feature detection based on nonlinear mapping pattern recognition is proposed. The characteristic distribution model of nanofibers is constructed, and the spectral characteristic decomposition method is used to recognize the nonlinear mapping pattern of nanofibers at current density. The spatial spectrum beam forming processing of nanofiber features is carried out by using cluster–cluster hybrid molecular reconstruction method, and the association rule feature decomposition of nanofibers is carried out by recursive graph analysis method, and the nonlinear mapping pattern recognition of nanofiber features is realized. The classification and recognition of nanofiber features are carried out by combining the correlation attribute clustering method, and the characteristics detection optimization of nanofibers is realized. The proposed method has higher acurracy than other methods. The pattern recognition performance of nonlinear mapping is good, and the ability of accurate recognition of the crystal structure characteristics of nanofibers is better.


Author(s):  
Alireza Rahimi ◽  
Ghazaleh Azimi ◽  
Hamidreza Asgari ◽  
Xia Jin

Heterogeneity of crash data masks the underlying crash patterns and perplexes crash analysis. This paper aims to explore an advanced high-dimensional clustering approach to investigate heterogeneity in large datasets. Detailed records of crashes involving large trucks occurring in the state of Florida between 2007 and 2016 were examined to identify truck crash patterns and significant conditions contributing to the patterns. The block clustering method was applied to more than 220,000 crash records with nearly 200 attributes. The analysis showed promising results in segmenting a large heterogeneous dataset into meaningful subgroups (with 95.72% average degree of homogeneity for selected blocks). The goodness of fit for clustering methods is evaluated and both integrated completed likelihood (ICL) and pseudo-likelihood values improved significantly (20.8% and 21.1% respectively). Attribute clustering showed distinct characteristics for each cluster. Crash clustering revealed significant differences among the clusters and suggested that this crash dataset could be portioned as same-direction, opposing-direction, and single-vehicle crashes. Individual blocks defined by both row and column clustering were further investigated to better understand the contribution set of conditions that lead to large truck crashes. Major features for each of the three major types of crashes were analyzed, which may provide additional insights to develop potential countermeasures and strategies that target specific segments. The clustering approach could be used as a preanalysis method to identify homogeneous subgroups for further analysis, which will help enhance the effectiveness of safety programs.


Sign in / Sign up

Export Citation Format

Share Document