High-Dimensional Text Datasets Clustering Algorithm Based on Cuckoo Search and Latent Semantic Indexing

Saida Ishak Boushaki; Nadjet Kamel; Omar Bendjeghaba

doi:10.1142/s0219649218500338

High-Dimensional Text Datasets Clustering Algorithm Based on Cuckoo Search and Latent Semantic Indexing

Journal of Information & Knowledge Management ◽

10.1142/s0219649218500338 ◽

2018 ◽

Vol 17 (03) ◽

pp. 1850033 ◽

Cited By ~ 2

Author(s):

Saida Ishak Boushaki ◽

Nadjet Kamel ◽

Omar Bendjeghaba

Keyword(s):

Clustering Algorithm ◽

Sparse Matrix ◽

Clustering Algorithms ◽

Document Clustering ◽

Cuckoo Search ◽

Latent Semantic Indexing ◽

Computational Time ◽

High Dimensional ◽

Semantic Indexing ◽

Number Of Clusters

The clustering is an important data analysis technique. However, clustering high-dimensional data like documents needs more effort in order to extract the richness relevant information hidden in the multidimensionality space. Recently, document clustering algorithms based on metaheuristics have demonstrated their efficiency to explore the search area and to achieve the global best solution rather than the local one. However, most of these algorithms are not practical and suffer from some limitations, including the requirement of the knowledge of the number of clusters in advance, they are neither incremental nor extensible and the documents are indexed by high-dimensional and sparse matrix. In order to overcome these limitations, we propose in this paper, a new dynamic and incremental approach (CS_LSI) for document clustering based on the recent cuckoo search (CS) optimization and latent semantic indexing (LSI). Conducted Experiments on four well-known high-dimensional text datasets show the efficiency of LSI model to reduce the dimensionality space with more precision and less computational time. Also, the proposed CS_LSI determines the number of clusters automatically by employing a new proposed index, focused on significant distance measure. This later is also used in the incremental mode and to detect the outlier documents by maintaining a more coherent clusters. Furthermore, comparison with conventional document clustering algorithms shows the superiority of CS_LSI to achieve a high quality of clustering.

Download Full-text

Biomedical Document Clustering Based on Accelerated Symbiotic Organisms Search Algorithm

International Journal of Swarm Intelligence Research ◽

10.4018/ijsir.2021100109 ◽

2021 ◽

Vol 12 (4) ◽

pp. 169-185

Author(s):

Saida Ishak Boushaki ◽

Omar Bendjeghaba ◽

Nadjet Kamel

Keyword(s):

Clustering Algorithm ◽

Search Algorithm ◽

Clustering Algorithms ◽

Document Clustering ◽

Latent Semantic Indexing ◽

Research Area ◽

Semantic Indexing ◽

Local Optima ◽

Symbiotic Organisms Search ◽

Symbiotic Organisms

Clustering is an important unsupervised analysis technique for big data mining. It finds its application in several domains including biomedical documents of the MEDLINE database. Document clustering algorithms based on metaheuristics is an active research area. However, these algorithms suffer from the problems of getting trapped in local optima, need many parameters to adjust, and the documents should be indexed by a high dimensionality matrix using the traditional vector space model. In order to overcome these limitations, in this paper a new documents clustering algorithm (ASOS-LSI) with no parameters is proposed. It is based on the recent symbiotic organisms search metaheuristic (SOS) and enhanced by an acceleration technique. Furthermore, the documents are represented by semantic indexing based on the famous latent semantic indexing (LSI). Conducted experiments on well-known biomedical documents datasets show the significant superiority of ASOS-LSI over five famous algorithms in terms of compactness, f-measure, purity, misclassified documents, entropy, and runtime.

Download Full-text

A novel bidirectional clustering algorithm based on local density

Scientific Reports ◽

10.1038/s41598-021-93244-2 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Baicheng Lyu ◽

Wenhua Wu ◽

Zhiqiang Hu

Keyword(s):

Clustering Algorithm ◽

Local Density ◽

Clustering Algorithms ◽

Cluster Number ◽

Denoising Method ◽

Number Of Clusters ◽

Data Points ◽

Cutoff Distance ◽

Large Clusters ◽

Small Clusters

AbstractWith the widely application of cluster analysis, the number of clusters is gradually increasing, as is the difficulty in selecting the judgment indicators of cluster numbers. Also, small clusters are crucial to discovering the extreme characteristics of data samples, but current clustering algorithms focus mainly on analyzing large clusters. In this paper, a bidirectional clustering algorithm based on local density (BCALoD) is proposed. BCALoD establishes the connection between data points based on local density, can automatically determine the number of clusters, is more sensitive to small clusters, and can reduce the adjusted parameters to a minimum. On the basis of the robustness of cluster number to noise, a denoising method suitable for BCALoD is proposed. Different cutoff distance and cutoff density are assigned to each data cluster, which results in improved clustering performance. Clustering ability of BCALoD is verified by randomly generated datasets and city light satellite images.

Download Full-text

An Adaptive Multiobjective Genetic Algorithm with Fuzzy c-Means for Automatic Data Clustering

Mathematical Problems in Engineering ◽

10.1155/2018/6123874 ◽

2018 ◽

Vol 2018 ◽

pp. 1-13 ◽

Cited By ~ 2

Author(s):

Ze Dong ◽

Hao Jia ◽

Miao Liu

Keyword(s):

Genetic Algorithm ◽

Fuzzy Clustering ◽

Clustering Algorithm ◽

Majority Vote ◽

Clustering Algorithms ◽

Nsga Ii ◽

Number Of Clusters ◽

Automatic Data ◽

Multiobjective Genetic Algorithm ◽

Fuzzy Clustering Method

This paper presents a fuzzy clustering method based on multiobjective genetic algorithm. The ADNSGA2-FCM algorithm was developed to solve the clustering problem by combining the fuzzy clustering algorithm (FCM) with the multiobjective genetic algorithm (NSGA-II) and introducing an adaptive mechanism. The algorithm does not need to give the number of clusters in advance. After the number of initial clusters and the center coordinates are given randomly, the optimal solution set is found by the multiobjective evolutionary algorithm. After determining the optimal number of clusters by majority vote method, the Jm value is continuously optimized through the combination of Canonical Genetic Algorithm and FCM, and finally the best clustering result is obtained. By using standard UCI dataset verification and comparing with existing single-objective and multiobjective clustering algorithms, the effectiveness of this method is proved.

Download Full-text

Fine-Tuning an Algorithm for Semantic Document Clustering Using a Similarity Graph

International Journal of Semantic Computing ◽

10.1142/s1793351x16400195 ◽

2016 ◽

Vol 10 (04) ◽

pp. 527-555

Author(s):

Lubomir Stanchev

Keyword(s):

English Language ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Document Clustering ◽

Fine Tuning ◽

Human Judgment ◽

Multiple Parameters ◽

Similarity Graph ◽

Multiple Metrics

In this article, we examine an algorithm for document clustering using a similarity graph. The graph stores words and common phrases from the English language as nodes and it can be used to compute the degree of semantic similarity between any two phrases. One application of the similarity graph is semantic document clustering, that is, grouping documents based on the meaning of the words in them. Since our algorithm for semantic document clustering relies on multiple parameters, we examine how fine-tuning these values affects the quality of the result. Specifically, we use the Reuters-21578 benchmark, which contains [Formula: see text] newswire stories that are grouped in 82 categories using human judgment. We apply the k-means clustering algorithm to group the documents using a similarity metric that is based on keywords matching and one that uses the similarity graph. We evaluate the results of the clustering algorithms using multiple metrics, such as precision, recall, f-score, entropy, and purity.

Download Full-text

Optimizing K-means text document clustering using latent semantic indexing and pillar algorithm

2017 5th International Symposium on Computational and Business Intelligence (ISCBI) ◽

10.1109/iscbi.2017.8053549 ◽

2017 ◽

Cited By ~ 1

Author(s):

Sigit Adinugroho ◽

Yuita Arum Sari ◽

M. Ali Fauzi ◽

Putra Pandu Adikara

Keyword(s):

Document Clustering ◽

Latent Semantic Indexing ◽

Semantic Indexing ◽

Text Document

Download Full-text

Design of an Unsupervised Machine Learning-Based Movie Recommender System

10.20944/preprints202001.0124.v1 ◽

2020 ◽

Author(s):

Debby Cintia Ganesha Putri ◽

Jenq-Shiou Leu ◽

Pavel Seda

Keyword(s):

Recommender System ◽

Clustering Algorithm ◽

System Development ◽

Clustering Algorithms ◽

Mean Shift ◽

Computational Time ◽

Agglomerative Clustering ◽

Method Performance ◽

Cluster Validity Indices ◽

Validity Indices

This research aims to determine the similarities in groups of people to build a film recommender system for users. Users often have difficulty in finding suitable movies due to the increasing amount of movie information. The recommender system is very useful for helping customers choose a preferred movie with the existing features. In this study, the recommender system development is established by using several algorithms to obtain groupings, such as the K-Means algorithm, birch algorithm, mini-batch K-Means algorithm, mean-shift algorithm, affinity propagation algorithm, agglomerative clustering algorithm, and spectral clustering algorithm. We propose methods optimizing K so that each cluster may not significantly increase variance. We are limited to using groupings based on Genre and, Tags for movies. This research can discover better methods for evaluating clustering algorithms. To verify the quality of the recommender system, we adopted the mean square error (MSE), such as the Dunn Matrix and Cluster Validity Indices, and social network analysis (SNA), such as Degree Centrality, Closeness Centrality, and Betweenness Centrality. We also used Average Similarity, Computational Time, Association Rule with Apriori algorithm, and Clustering Performance Evaluation as evaluation measures to compare method performance of recommender systems using Silhouette Coefficient, Calinski-Harabaz Index, and Davies-Bouldin Index.

Download Full-text

A Quantitative Discriminant Method of Elbow Point for the Optimal Number of Clusters in Clustering Algorithm

10.21203/rs.3.rs-58011/v3 ◽

2021 ◽

Author(s):

Congming Shi ◽

Bingtao Wei ◽

Shoulin Wei ◽

Wen Wang ◽

Hai Liu ◽

...

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Optimal Number ◽

Machine Learning Method ◽

Cluster Number ◽

Number Of Clusters ◽

Public Dataset ◽

Optimal Cluster ◽

Better Than ◽

Optimal Number Of Clusters

Abstract Clustering, a traditional machine learning method, plays a significant role in data analysis. Most clustering algorithms depend on a predetermined exact number of clusters, whereas, in practice, clusters are usually unpredictable. Although the Elbow method is one of the most commonly used methods to discriminate the optimal cluster number, the discriminant of the number of clusters depends on the manual identification of the elbow points on the visualization curve. Thus, experienced analysts cannot clearly identify the elbow point from the plotted curve when the plotted curve is fairly smooth. To solve this problem, a new elbow point discriminant method is proposed to yield a statistical metric that estimates an optimal cluster number when clustering on a dataset. First, the average degree of distortion obtained by the Elbow method is normalized to the range of 0 to 10. Second, the normalized results are used to calculate the cosine of intersection angles between elbow points. Third, this calculated cosine of intersection angles and the arccosine theorem are used to compute the intersection angles between elbow points. Finally, the index of the above computed minimal intersection angles between elbow points is used as the estimated potential optimal cluster number. The experimental results based on simulated datasets and a well-known public dataset (Iris Dataset) demonstrated that the estimated optimal cluster number obtained by our newly proposed method is better than the widely used Silhouette method.

Download Full-text

Latent Semantic Indexing Analysis of K-Means Document Clustering for Changing Index Terms Weighting

The KIPS Transactions PartB ◽

10.3745/kipstb.2003.10b.7.735 ◽

2003 ◽

Vol 10B (7) ◽

pp. 735-742

Keyword(s):

Document Clustering ◽

Latent Semantic Indexing ◽

Semantic Indexing ◽

Index Terms

Download Full-text

A novel bidirectional clustering algorithm based on local density

10.21203/rs.3.rs-141525/v1 ◽

2021 ◽

Author(s):

BAICHENG LV ◽

WENHUA WU ◽

ZHIQIANG HU

Keyword(s):

Clustering Algorithm ◽

Local Density ◽

Clustering Algorithms ◽

Cluster Number ◽

Denoising Method ◽

Number Of Clusters ◽

Data Points ◽

Cutoff Distance ◽

Large Clusters ◽

Small Clusters

Abstract With the widely application of cluster analysis, the number of clusters is gradually increasing, as is the difficulty in selecting the judgment indicators of cluster numbers. Also, small clusters are crucial to discovering the extreme characteristics of data samples, but current clustering algorithms focus mainly on analyzing large clusters. In this paper, a bidirectional clustering algorithm based on local density (BCALoD) is proposed. BCALoD establishes the connection between data points based on local density, can automatically determine the number of clusters, is more sensitive to small clusters, and can reduce the adjusted parameters to a minimum. On the basis of the robustness of cluster number to noise, a denoising method suitable for BCALoD is proposed. Different cutoff distance and cutoff density are assigned to each data cluster, which results in improved clustering performance. Clustering ability of BCALoD is verified by randomly generated datasets and city light satellite images.

Download Full-text

Enhancing Machine Learning Aptitude Using Significant Cluster Identification for Augmented Image Refining

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s021800142051009x ◽

2019 ◽

Vol 34 (09) ◽

pp. 2051009

Author(s):

Dhanalakshmi Samiappan ◽

S. Latha ◽

T. Rama Rao ◽

Deepak Verma ◽

CSA Sriharsha

Keyword(s):

Machine Learning ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Speckle Noise ◽

Computational Time ◽

Significant Cluster ◽

Edge Preservation ◽

Cluster Identification ◽

Base Method ◽

Window Selection

Enhancing the image to remove noise, preserving the useful features and edges are the most important tasks in image analysis. In this paper, Significant Cluster Identification for Maximum Edge Preservation (SCI-MEP), which works in parallel with clustering algorithms and improved efficiency of the machine learning aptitude, is proposed. Affinity propagation (AP) is a base method to obtain clusters from a learnt dictionary, with an adaptive window selection, which are then refined using SCI-MEP to preserve the semantic components of the image. Since only the significant clusters are worked upon, the computational time drastically reduces. The flexibility of SCI-MEP allows it to be integrated with any clustering algorithm to improve its efficiency. The method is tested and verified to remove Gaussian noise, rain noise and speckle noise from images. Our results have shown that SCI-MEP considerably optimizes the existing algorithms in terms of performance evaluation metrics.

Download Full-text