Parallel N-Path Quantification Hierarchical K-Means Clustering Algorithm for Video Retrieval

Kaiyang Liao; Fan Zhao; Yuanlin Zheng; Congjun Cao; Mingzhu Zhang

doi:10.1142/s021800141750029x

Parallel N-Path Quantification Hierarchical K-Means Clustering Algorithm for Video Retrieval

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s021800141750029x ◽

2017 ◽

Vol 31 (09) ◽

pp. 1750029 ◽

Cited By ~ 1

Author(s):

Kaiyang Liao ◽

Fan Zhao ◽

Yuanlin Zheng ◽

Congjun Cao ◽

Mingzhu Zhang

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Retrieval System ◽

Video Retrieval ◽

Large Datasets ◽

Large Scale Data ◽

Image Retrieval System ◽

Labeling Method ◽

Parallel Clustering ◽

Scale Data

Using clustering method to detect useful patterns in large datasets has attracted considerable interest recently. The HKM clustering algorithm (Hierarchical K-means) is very efficient in large-scale data analysis. It has been widely used to build visual vocabulary for large scale video/image retrieval system. However, the speed and even the accuracy of hierarchical K-means clustering algorithm still have room to be improved. In this paper, we propose a Parallel N-path quantification hierarchical K-means clustering algorithm which improves on the hierarchical K-means clustering algorithm in the following ways. Firstly, we replace the Euclidean kernel with the Hellinger kernel to improve the accuracy. Secondly, the Greedy N-best Paths Labeling method is adopted to improve the clustering accuracy. Thirdly, the multi-core processors-based parallel clustering algorithm is proposed. Our results confirm that the proposed clustering algorithm is much faster and more effective.

Download Full-text

A stratified sampling based clustering algorithm for large-scale data

Knowledge-Based Systems ◽

10.1016/j.knosys.2018.09.007 ◽

2019 ◽

Vol 163 ◽

pp. 416-428 ◽

Cited By ~ 11

Author(s):

Xingwang Zhao ◽

Jiye Liang ◽

Chuangyin Dang

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Stratified Sampling ◽

Large Scale Data ◽

Scale Data

Download Full-text

The Research on Large Scale Data Set Clustering Algorithm Based on Tag Set

Communications in Computer and Information Science - Computational Intelligence and Intelligent Systems ◽

10.1007/978-981-10-0356-1_38 ◽

2016 ◽

pp. 365-372

Author(s):

Qiang Chen

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Data Set ◽

Large Scale Data ◽

Scale Data

Download Full-text

An Efficient K-Medoids Clustering Algorithm for Large Scale Data

Machine Learning-based Natural Scene Recognition for Mobile Robot Localization in An Unknown Environment ◽

10.1007/978-981-13-9217-7_5 ◽

2019 ◽

pp. 85-108

Author(s):

Xiaochun Wang ◽

Xiali Wang ◽

Don Mitchell Wilkes

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Large Scale Data ◽

Scale Data

Download Full-text

Using Uncertain DM-Chameleon Clustering Algorithm Based on Machine Learning to Predict Landslide Hazards

Journal of Robotics and Mechatronics ◽

10.20965/jrm.2019.p0329 ◽

2019 ◽

Vol 31 (2) ◽

pp. 329-338 ◽

Cited By ~ 1

Author(s):

Jian Hu ◽

Haiwan Zhu ◽

Yimin Mao ◽

Canlong Zhang ◽

Tian Liang ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

Clustering Algorithm ◽

Uncertain Data ◽

Landslide Hazard ◽

Data Sets ◽

Large Scale Data ◽

Landslide Hazards ◽

Hazard Levels ◽

Scale Data

Landslide hazard prediction is a difficult, time-consuming process when traditional methods are used. This paper presents a method that uses machine learning to predict landslide hazard levels automatically. Due to difficulties in obtaining and effectively processing rainfall in landslide hazard prediction, and to the existing limitation in dealing with large-scale data sets in the M-chameleon algorithm, a new method based on an uncertain DM-chameleon algorithm (developed M-chameleon) is proposed to assess the landslide susceptibility model. First, this method designs a new two-phase clustering algorithm based on M-chameleon, which effectively processes large-scale data sets. Second, the new E-H distance formula is designed by combining the Euclidean and Hausdorff distances, and this enables the new method to manage uncertain data effectively. The uncertain data model is presented at the same time to effectively quantify triggering factors. Finally, the model for predicting landslide hazards is constructed and verified using the data from the Baota district of the city of Yan’an, China. The experimental results show that the uncertain DM-chameleon algorithm of machine learning can effectively improve the accuracy of landslide prediction and has high feasibility. Furthermore, the relationships between hazard factors and landslide hazard levels can be extracted based on clustering results.

Download Full-text

Large-scale data clustering algorithm based on quantum immune regulation network

2017 IEEE Symposium Series on Computational Intelligence (SSCI) ◽

10.1109/ssci.2017.8285302 ◽

2017 ◽

Author(s):

Yangyang Li ◽

Xiaoyu Bai ◽

Xiaoju Hou ◽

Licheng Jiao

Keyword(s):

Immune Regulation ◽

Data Clustering ◽

Large Scale ◽

Clustering Algorithm ◽

Large Scale Data ◽

Regulation Network ◽

Scale Data

Download Full-text

A EM Probabilistic Clustering Algorithm for Large Scale Data Sets based on Partial Constraints Information

INTERNATIONAL JOURNAL ON Advances in Information Sciences and Service Sciences ◽

10.4156/aiss.vol3.issue10.3 ◽

2011 ◽

Vol 3 (10) ◽

pp. 20-29

Author(s):

Shen Yan ◽

Song Shunlin ◽

Zhu Yuquan

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Data Sets ◽

Probabilistic Clustering ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

Part Priority Clustering Algorithm for Large-Scale Data Set

2013 5th International Conference on Intelligent Human-Machine Systems and Cybernetics ◽

10.1109/ihmsc.2013.100 ◽

2013 ◽

Author(s):

Zhihao Yin ◽

Bencheng Yu ◽

Zhifeng Wang ◽

Wang Ran

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Data Set ◽

Large Scale Data ◽

Scale Data

Download Full-text

Privacy-preserving constrained spectral clustering algorithm for large-scale data sets

IET Information Security ◽

10.1049/iet-ifs.2019.0255 ◽

2020 ◽

Vol 14 (3) ◽

pp. 321-331 ◽

Cited By ~ 1

Author(s):

Ji Li ◽

Jianghong Wei ◽

Mao Ye ◽

Wenfen Liu ◽

Xuexian Hu

Keyword(s):

Spectral Clustering ◽

Large Scale ◽

Clustering Algorithm ◽

Privacy Preserving ◽

Data Sets ◽

Large Scale Data ◽

Spectral Clustering Algorithm ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

Stochastic Gradient Descent Based K-Means Algorithm on Large Scale Data Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.687-691.1342 ◽

2014 ◽

Vol 687-691 ◽

pp. 1342-1345 ◽

Cited By ~ 1

Author(s):

Jie Ding ◽

Li Peng Zhu ◽

Bin Hu ◽

Ren Long Hang ◽

Yu Bao Sun

Keyword(s):

Gradient Descent ◽

Large Scale ◽

Clustering Algorithm ◽

Distance Matrix ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Data Sets ◽

Human Beings ◽

Large Scale Data ◽

Scale Data

With the rapid advance of data collection and storage technique, it is easy to acquire tens of millions or even billions of data sets. How to explore and exploit the useful or interesting information for human beings from these data sets has become an urgent issue. Traditional k-means clustering algorithm has been widely used in data mining community. First, randomly initialize k clustering centres. Then, all instances are classified into k different classes according to their distances to clustering centres. Lastly, update the clustering centres by the mean of its corresponding constituent instances. This whole process will be iterated until convergence. Obviously, at each iteration, distance matrix from all instances to k clustering centres must be calculated which will cost so much time when encounter large scale data sets. To address this issue, in this paper, we proposed a fast optimization algorithm based on stochastic gradient descent (SGD). At each iteration, randomly choose an instance, search its corresponding clustering centre and then update it immediately. Experimental results show that our proposed method achieves a competitive clustering results with less time cost.

Download Full-text

Parallel Implementation of Improved K-Means Based on a Cloud Platform

Information Technology And Control ◽

10.5755/j01.itc.48.4.23881 ◽

2019 ◽

Vol 48 (4) ◽

pp. 673-681

Author(s):

Shufen Zhang ◽

Zhiyu Liu ◽

Xuebin Chen ◽

Changyin Luo

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Programming Model ◽

Parallel Implementation ◽

Clustering Algorithms ◽

Data Set ◽

Large Scale Data ◽

Sample Density ◽

Scale Data ◽

Selection Of

In order to solve the problem of traditional K-Means clustering algorithm in dealing with large-scale data set, a Hadoop K-Means (referred to HKM) clustering algorithm is proposed. Firstly, according to the sample density, the algorithm eliminates the effects of noise points in the data set. Secondly, it optimizes the selection of the initial center point using the thought of the max-min distance. Finally, it uses a MapReduce programming model to realize the parallelization. Experimental results show that the proposed algorithm not only has high accuracy and stability in clustering results, but can also solve the problems of scalability encountered by traditional clustering algorithms in dealing with large scale data.

Download Full-text