A Novel Scalable Signature Based Subspace Clustering Approach for Big Data

“Big data” as the name suggests is a collection of large and complicated data sets which are usually hard to process with on-hand data management tools or other conventional processing applications. A scalable signature based subspace clustering approach is presented in this article that would avoid identification of redundant clusters. Various distance measures are utilized to perform experiments that validate the performance of the proposed algorithm. Also, for the same purpose of validation, the synthetic data sets that are chosen have different dimensions, and their size will be distributed when opened with Weka. The F1 quality measure and the runtime of these synthetic data sets are computed. The performance of the proposed algorithm is compared with other existing clustering algorithms such as CLIQUE.INSCY and SUNCLU.

Download Full-text

Big Data Management Tools for Hadoop

NoSQL: Database for Storage and Retrieval of Data in Cloud ◽

10.1201/9781315155579-11 ◽

2017 ◽

pp. 199-214

Author(s):

V. P. Lijo ◽

Lydia J. Gnanasigamani ◽

Hari Seetha ◽

B. K. Tripathy

Keyword(s):

Big Data ◽

Data Management ◽

Management Tools

Download Full-text

Uncertainty-Based Clustering Algorithms for Large Data Sets

Modern Technologies for Big Data Classification and Clustering - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-2805-0.ch001 ◽

2018 ◽

pp. 1-33 ◽

Cited By ~ 1

Author(s):

B. K. Tripathy ◽

Hari Seetha ◽

M. N. Murty

Keyword(s):

Big Data ◽

Data Clustering ◽

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Mining Machine ◽

Data Sets ◽

Fuzzy C Means ◽

Intuitionistic Fuzzy ◽

New Algorithms

Data clustering plays a very important role in Data mining, machine learning and Image processing areas. As modern day databases have inherent uncertainties, many uncertainty-based data clustering algorithms have been developed in this direction. These algorithms are fuzzy c-means, rough c-means, intuitionistic fuzzy c-means and the means like rough fuzzy c-means, rough intuitionistic fuzzy c-means which base on hybrid models. Also, we find many variants of these algorithms which improve them in different directions like their Kernelised versions, possibilistic versions, and possibilistic Kernelised versions. However, all the above algorithms are not effective on big data for various reasons. So, researchers have been trying for the past few years to improve these algorithms in order they can be applied to cluster big data. The algorithms are relatively few in comparison to those for datasets of reasonable size. It is our aim in this chapter to present the uncertainty based clustering algorithms developed so far and proposes a few new algorithms which can be developed further.

Download Full-text

A Novel Hierarchical Clustering Approach Based on Universal Gravitation

Mathematical Problems in Engineering ◽

10.1155/2020/6748056 ◽

2020 ◽

Vol 2020 ◽

pp. 1-15

Author(s):

Peng Zhang ◽

Kun She

Keyword(s):

Hierarchical Clustering ◽

Clustering Analysis ◽

Gravitational Force ◽

Clustering Algorithms ◽

Influence Coefficient ◽

Data Sets ◽

Universal Gravitation ◽

Real World Data ◽

Gravitational Influence ◽

Clustering Approach

The target of the clustering analysis is to group a set of data points into several clusters based on the similarity or distance. The similarity or distance is usually a scalar used in numerous traditional clustering algorithms. Nevertheless, a vector, such as data gravitational force, contains more information than a scalar and can be applied in clustering analysis to promote clustering performance. Therefore, this paper proposes a three-stage hierarchical clustering approach called GHC, which takes advantage of the vector characteristic of data gravitational force inspired by the law of universal gravitation. In the first stage, a sparse gravitational graph is constructed based on the top k data gravitations between each data point and its neighbors in the local region. Then the sparse graph is partitioned into many subgraphs by the gravitational influence coefficient. In the last stage, the satisfactory clustering result is obtained by merging these subgraphs iteratively by using a new linkage criterion. To demonstrate the performance of GHC algorithm, the experiments on synthetic and real-world data sets are conducted, and the results show that the GHC algorithm achieves better performance than the other existing clustering algorithms.

Download Full-text

The fast clustering algorithm for the big data based on K-means

International Journal of Wavelets Multiresolution and Information Processing ◽

10.1142/s0219691320500538 ◽

2020 ◽

Vol 18 (06) ◽

pp. 2050053

Author(s):

Ting Xie ◽

Taiping Zhang

Keyword(s):

Big Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Feature Space ◽

Data Sets ◽

Benchmark Data ◽

Clustering Model ◽

Alternating Direction ◽

Learning Technique ◽

Noise Data

As a powerful unsupervised learning technique, clustering is the fundamental task of big data analysis. However, many traditional clustering algorithms for big data that is a collection of high dimension, sparse and noise data do not perform well both in terms of computational efficiency and clustering accuracy. To alleviate these problems, this paper presents Feature K-means clustering model on the feature space of big data and introduces its fast algorithm based on Alternating Direction Multiplier Method (ADMM). We show the equivalence of the Feature K-means model in the original space and the feature space and prove the convergence of its iterative algorithm. Computationally, we compare the Feature K-means with Spherical K-means and Kernel K-means on several benchmark data sets, including artificial data and four face databases. Experiments show that the proposed approach is comparable to the state-of-the-art algorithm in big data clustering.

Download Full-text

Dynamic and Optimized Prototype Clustering for Relational Data based on Multiple Prototypes

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.l2795.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 5630-5634

Keyword(s):

Artificial Intelligence ◽

Data Clustering ◽

Synthetic Data ◽

Medical Data ◽

Relational Data ◽

Data Sets ◽

Optimization Approach ◽

Complex Task ◽

Data Points ◽

Clustering Approach

In artificial intelligence related applications such as bio-medical, bio-informatics, data clustering is an important and complex task with different situations. Prototype based clustering is the reasonable and simplicity to describe and evaluate data which can be treated as non-vertical representation of relational data. Because of Barycentric space present in prototype clustering, maintain and update the structure of the cluster with different data points is still challenging task for different data points in bio-medical relational data. So that in this paper we propose and introduce A Novel Optimized Evidential C-Medoids (NOEC) which is relates to family o prototype based clustering approach for update and proximity of medical relational data. We use Ant Colony Optimization approach to enable the services of similarity with different features for relational update cluster medical data. Perform our approach on different bio-medical related synthetic data sets. Experimental results of proposed approach give better and efficient results with comparison of different parameters in terms of accuracy and time with processing of medical relational data sets.

Download Full-text

Big data management: Security and privacy concerns

International Journal of ADVANCED AND APPLIED SCIENCES ◽

10.21833/ijaas.2021.05.009 ◽

2021 ◽

Vol 8 (5) ◽

pp. 73-83

Author(s):

Ibrahim A. Atoum ◽

◽

Ismail M. Keshta ◽

Keyword(s):

Big Data ◽

Data Management ◽

Large Data ◽

Security And Privacy ◽

Data Sets ◽

Privacy Concerns ◽

Software Companies ◽

Systemic Analysis ◽

Functional Areas ◽

Privacy Issues

Big data has been used by different companies to deliver simple products and provide enhanced customer insights through predictive technology such as artificial intelligence. Big data is a field that mainly deals with the extraction and systemic analysis of large data sets to help businesses discover trends. Today, many companies use Big Data to facilitate growth in different functional areas as well as expand their ability to handle large customer databases. Big data has grown the demand for information management experts such that many software companies are increasingly investing in firms that specialize in data management and analytics. Nevertheless, the issue of data protection or privacy is a threat to big data management. This article presents some of the major concerns surrounding the application and use of Big Data about challenges of security and privacy of data stored on technological devices. The paper also discusses some of the current studies being undertaken aimed at addressing security and privacy issues in Big Data.

Download Full-text

EBSeq: improving mixing computations for multi-group differential expression analysis

10.1101/2020.06.19.162180 ◽

2020 ◽

Author(s):

Xiuyu Ma ◽

Christina Kendziorski ◽

Michael A. Newton

Keyword(s):

Differential Expression ◽

Clustering Algorithms ◽

Differential Expression Analysis ◽

Synthetic Data ◽

Data Sets ◽

Empirical Bayesian ◽

False Discovery Rates ◽

Data Analyst ◽

False Discovery ◽

Multiple Sample

ABSTRACTEBSeq is a Bioconductor package designed to calculate empirical-Bayesian inference summaries from sequence-based gene-expression (RNA-Seq) data. It produces gene or isoform-specific scores that measure various patterns of differential expression among a set of sample groups, and is most commonly deployed to measure differential expression between two groups. Its use of local posterior probabilities from a fitted mixture model provides the data analyst a direct way to score the false discovery rate of any reported list of genes, and it is one of the only tools that can address local false discovery rates when analyzing multiple sample groups. Contemporary applications have increasing numbers of sample groups, and the algorithms deployed in EBSeq are neither space nor time efficient in this important case. We describe a version update utilizing code improvements and novel pruning and clustering algorithms in order to reduce the complexity of mixture computations. The algorithms are supported by a theoretical analysis and tested empirically on a variety of benchmark and synthetic data sets.

Download Full-text

A Novel Approach for Clustering Big Data based on MapReduce

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v8i3.pp1711-1719 ◽

2018 ◽

Vol 8 (3) ◽

pp. 1711 ◽

Cited By ~ 1

Author(s):

Gourav Bathla ◽

Himanshu Aggarwal ◽

Rinkle Rani

Keyword(s):

Big Data ◽

Categorical Data ◽

Large Scale ◽

Clustering Algorithms ◽

Numerical Data ◽

Large Data ◽

Data Sets ◽

Single Node ◽

Novel Approach ◽

Network Analytics

Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.

Download Full-text

Importance of Big Data

Handbook of Research on Trends and Future Directions in Big Data and Web Intelligence - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-8505-5.ch001 ◽

2015 ◽

pp. 1-19 ◽

Cited By ~ 1

Author(s):

Seema Ansari ◽

Radha Mohanlal ◽

Javier Poncela ◽

Adeel Ansari ◽

Komal Mohanlal

Keyword(s):

Big Data ◽

Heterogeneous Data ◽

Data Sets ◽

Data Types ◽

It Industry ◽

Management Tools ◽

Location Data ◽

Processing Power ◽

Multiple Data ◽

Big Data Applications

Combining vast amounts of heterogeneous data and increasing the processing power of existing database management tools is no doubt the emerging need of IT industry in coming years. The complexity and size of data sets that need to be acquired, analyzed, stored, sorted or transferred has spiked in the recent years. Due to the tremendously increasing volume of multiple data types, creating Big Data applications that can extract the valuable trends and relationships required for further processes or deriving useful results is quite challenging task. Companies, corporate organizations or be it government agencies, all need to analyze and execute Big Data implementation to pave new paths of productivity and innovation. This chapter discusses the emerging technology of modern era: Big Data with detailed description of the three V's (Variety, Velocity and Volume). Further chapters will enable to understand the concepts of data mining and big data analysis, Potentials of Big Data in five domains i.e. Healthcare, Public sector, Retail, Manufacturing and Personal location Data.

Download Full-text

Clustering Big Data streams: recent challenges and contributions

it - Information Technology ◽

10.1515/itit-2016-0007 ◽

2016 ◽

Vol 58 (4) ◽

Cited By ~ 1

Author(s):

Marwan Hassani ◽

Thomas Seidl

Keyword(s):

Big Data ◽

Data Streams ◽

Clustering Algorithms ◽

Subspace Clustering ◽

High Dimensional ◽

Evaluation Measures ◽

Stream Clustering ◽

Density Based Clustering ◽

Static Data ◽

Big Data Streams

AbstractTraditional clustering algorithms merely considered static data. Today'sSince the growth of dataIn this article, novel methods for an efficient subspace clustering of high-dimensional big data streams are presented. Approaches that efficiently combine the anytime clustering concept with the stream subspace clustering paradigm are discussed. Additionally, efficient and adaptive density-based clustering algorithms are presented for high-dimensional data streams. Novel open-source assessment framework and evaluation measures are additionally presented for subspace stream clustering.

Download Full-text