Efficiently Hiding Sensitive Itemsets with Transaction Deletion Based on Genetic Algorithms

Data mining is used to mine meaningful and useful information or knowledge from a very large database. Some secure or private information can be discovered by data mining techniques, thus resulting in an inherent risk of threats to privacy. Privacy-preserving data mining (PPDM) has thus arisen in recent years to sanitize the original database for hiding sensitive information, which can be concerned as an NP-hard problem in sanitization process. In this paper, a compact prelarge GA-based (cpGA2DT) algorithm to delete transactions for hiding sensitive itemsets is thus proposed. It solves the limitations of the evolutionary process by adopting both the compact GA-based (cGA) mechanism and the prelarge concept. A flexible fitness function with three adjustable weights is thus designed to find the appropriate transactions to be deleted in order to hide sensitive itemsets with minimal side effects of hiding failure, missing cost, and artificial cost. Experiments are conducted to show the performance of the proposed cpGA2DT algorithm compared to the simple GA-based (sGA2DT) algorithm and the greedy approach in terms of execution time and three side effects.

Download Full-text

Legal and Technical Issues of Privacy Preservation in Data Mining

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch180 ◽

2011 ◽

pp. 1158-1163 ◽

Cited By ~ 6

Author(s):

Kirsten Wahlstrom ◽

John F. Roddick ◽

Rick Sarre ◽

Vladimir Estivill-Castro ◽

Denise de Vries

Keyword(s):

Data Mining ◽

Social Responsibility ◽

Private Information ◽

New Technologies ◽

Ethical Issues ◽

Sensitive Information ◽

Personal Privacy ◽

Customer Data ◽

Moral Principles ◽

Very Large Databases

To paraphrase Winograd (1992), we bring to our communities a tacit comprehension of right and wrong that makes social responsibility an intrinsic part of our culture. Our ethics are the moral principles we use to assert social responsibility and to perpetuate safe and just societies. Moreover, the introduction of new technologies can have a profound effect on our ethical principles. The emergence of very large databases, and the associated automated data analysis tools, present yet another set of ethical challenges to consider. Socio-ethical issues have been identified as pertinent to data mining and there is a growing concern regarding the (ab)use of sensitive information (Clarke, 1999; Clifton et al., 2002; Clifton and Estivill-Castro, 2002; Gehrke, 2002). Estivill-Castro et al., discuss surveys regarding public opinion on personal privacy that show a raised level of concern about the use of private information (Estivill-Castro et al., 1999). There is some justification for this concern; a 2001 survey in InfoWeek found that over 20% of companies store customer data with information about medical profile and/or customer demographics with salary and credit information, and over 15% store information about customers’ legal histories.

Download Full-text

Reducing Side Effects of Hiding Sensitive Itemsets in Privacy Preserving Data Mining

The Scientific World JOURNAL ◽

10.1155/2014/235837 ◽

2014 ◽

Vol 2014 ◽

pp. 1-12 ◽

Cited By ~ 10

Author(s):

Chun-Wei Lin ◽

Tzung-Pei Hong ◽

Hung-Chuan Hsu

Keyword(s):

Data Mining ◽

Side Effects ◽

Execution Time ◽

Privacy Preserving ◽

Sensitive Information ◽

Privacy Preserving Data Mining ◽

Confidential Data

Data mining is traditionally adopted to retrieve and analyze knowledge from large amounts of data. Private or confidential data may be sanitized or suppressed before it is shared or published in public. Privacy preserving data mining (PPDM) has thus become an important issue in recent years. The most general way of PPDM is to sanitize the database to hide the sensitive information. In this paper, a novel hiding-missing-artificial utility (HMAU) algorithm is proposed to hide sensitive itemsets through transaction deletion. The transaction with the maximal ratio of sensitive to nonsensitive one is thus selected to be entirely deleted. Three side effects of hiding failures, missing itemsets, and artificial itemsets are considered to evaluate whether the transactions are required to be deleted for hiding sensitive itemsets. Three weights are also assigned as the importance to three factors, which can be set according to the requirement of users. Experiments are then conducted to show the performance of the proposed algorithm in execution time, number of deleted transactions, and number of side effects.

Download Full-text

A Multi-Threshold Ant Colony System-based Sanitization Model in Shared Medical Environments

ACM Transactions on Internet Technology ◽

10.1145/3408296 ◽

2021 ◽

Vol 21 (2) ◽

pp. 1-26

Author(s):

Jimmy Ming-Tai Wu ◽

Gautam Srivastava ◽

Jerry Chun-Wei Lin ◽

Qian Teng

Keyword(s):

Data Mining ◽

Private Information ◽

Ant Colony ◽

Security And Privacy ◽

Legal Issue ◽

Ant Colony System ◽

Sensitive Information ◽

Data Mining Algorithm ◽

Useful Knowledge ◽

Global Pandemic

During the past several years, revealing some useful knowledge or protecting individual’s private information in an identifiable health dataset (i.e., within an Electronic Health Record) has become a tradeoff issue. Especially in this era of a global pandemic, security and privacy are often overlooked in lieu of usability. Privacy preserving data mining (PPDM) is definitely going to be have an important role to resolve this problem. Nevertheless, the scenario of mining information in an identifiable health dataset holds high complexity compared to traditional PPDM problems. Leaking individual private information in an identifiable health dataset has becomes a serious legal issue. In this article, the proposed Ant Colony System to Data Mining algorithm takes the multi-threshold constraint to secure and sanitize patents’ records in different lengths, which is applicable in a real medical situation. The experimental results show the proposed algorithm not only has the ability to hide all sensitive information but also to keep useful knowledge for mining usage in the sanitized database.

Download Full-text

A Grid-Based Swarm Intelligence Algorithm for Privacy-Preserving Data Mining

Applied Sciences ◽

10.3390/app9040774 ◽

2019 ◽

Vol 9 (4) ◽

pp. 774 ◽

Cited By ~ 6

Author(s):

Tsu-Yang Wu ◽

Jerry Lin ◽

Yuyu Zhang ◽

Chun-Hao Chen

Keyword(s):

Data Mining ◽

Side Effects ◽

Evolutionary Process ◽

Privacy Preserving ◽

Optimal Solutions ◽

Confidential Information ◽

Nsga Ii ◽

Privacy Preserving Data Mining ◽

Single Objective ◽

Grid Based

Privacy-preserving data mining (PPDM) has become an interesting and emerging topic in recent years because it helps hide confidential information, while allowing useful knowledge to be discovered at the same time. Data sanitization is a common way to perturb a database, and thus sensitive or confidential information can be hidden. PPDM is not a trivial task and can be concerned an Non-deterministic Polynomial-time (NP)-hard problem. Many algorithms have been studied to derive optimal solutions using the evolutionary process, although most are based on straightforward or single-objective methods used to discover the candidate transactions/items for sanitization. In this paper, we present a multi-objective algorithm using a grid-based method (called GMPSO) to find optimal solutions as candidates for sanitization. The designed GMPSO uses two strategies for updating gbest and pbest during the evolutionary process. Moreover, the pre-large concept is adapted herein to speed up the evolutionary process, and thus multiple database scans during each evolutionary process can be reduced. From the designed GMPSO, multiple Pareto solutions rather than single-objective algorithms can be derived based on Pareto dominance. In addition, the side effects of the sanitization process can be significantly reduced. Experiments have shown that the designed GMPSO achieves better side effects than the previous single-objective algorithm and the NSGA-II-based approach, and the pre-large concept can also help with speeding up the computational cost compared to the NSGA-II-based algorithm.

Download Full-text

Deep active reinforcement learning for privacy preserve data mining in 5G environments

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-219262 ◽

2021 ◽

pp. 1-8

Author(s):

Usman Ahmed ◽

Jerry Chun-Wei Lin ◽

Gautam Srivastava ◽

Hsing-Chung Chen

Keyword(s):

Data Mining ◽

Active Learning ◽

Private Information ◽

Pattern Mining ◽

Research Area ◽

High Dimensional ◽

Data Sets ◽

Sensitive Information ◽

Using Data ◽

Transactional Data

Finding frequent patterns identifies the most important patterns in data sets. Due to the huge and high-dimensional nature of transactional data, classical pattern mining techniques suffer from the limitations of dimensions and data annotations. Recently, data mining while preserving privacy is considered an important research area in recent decades. Information privacy is a tradeoff that must be considered when using data. Through many years, privacy-preserving data mining (PPDM) made use of methods that are mostly based on heuristics. The operation of deletion was used to hide the sensitive information in PPDM. In this study, we used deep active learning to hide sensitive operations and protect private information. This paper combines entropy-based active learning with an attention-based approach to effectively detect sensitive patterns. The constructed models are then validated using high-dimensional transactional data with attention-based and active learning methods in a reinforcement environment. The results show that the proposed model can support and improve the decision boundaries by increasing the number of training instances through the use of a pooling technique and an entropy uncertainty measure. The proposed paradigm can achieve cleanup by hiding sensitive items and avoiding non-sensitive items. The model outperforms greedy, genetic, and particle swarm optimization approaches.

Download Full-text

DISTORTION-BASED HEURISTIC METHOD FOR SENSITIVE ASSOCIATION RULE HIDING

Journal of Computer Science and Cybernetics ◽

10.15625/1813-9663/35/4/14131 ◽

2019 ◽

Vol 35 (4) ◽

pp. 337-354

Author(s):

Bac Le ◽

Lien Kieu ◽

Dat Tran

Keyword(s):

Data Mining ◽

Side Effects ◽

Association Rule ◽

Heuristic Method ◽

Sensitive Information ◽

Data Loss ◽

Individual Privacy ◽

Maximal Frequent Itemsets ◽

Sensitive Knowledge ◽

Privacy Issues

In the past few years, privacy issues in data mining have received considerable attention in the data mining literature. However, the problem of data security cannot simply be solved by restricting data collection or against unauthorized access, it should be dealt with by providing solutions that not only protect sensitive information, but also not affect to the accuracy of the results in data mining and not violate the sensitive knowledge related with individual privacy or competitive advantage in businesses. Sensitive association rule hiding is an important issue in privacy preserving data mining. The aim of association rule hiding is to minimize the side effects on the sanitized database, which means to reduce the number of missing non-sensitive rules and the number of generated ghost rules. Current methods for hiding sensitive rules cause side effects and data loss. In this paper, we introduce a new distortion-based method to hide sensitive rules. This method proposes the determination of critical transactions based on the number of non-sensitive maximal frequent itemsets that contain at least one item to the consequent of the sensitive rule, they can be directly affected by the modified transactions. Using this set, the number of non-sensitive itemsets that need to be considered is reduced dramatically. We compute the smallest number of transactions for modification in advance to minimize the damage to the database. Comparative experimental results on real datasets showed that the proposed method can achieve better results than other methods with fewer side effects and data loss.

Download Full-text

SEASONAL VARIATIONS IN LIPID PROFILES FROM 2.8 MILLION US ADULTS: THE VERY LARGE DATABASE OF LIPIDS (VLDL 14)

Journal of the American College of Cardiology ◽

10.1016/s0735-1097(14)61458-3 ◽

2014 ◽

Vol 63 (12) ◽

pp. A1458

Author(s):

Parag Joshi ◽

Seth Martin ◽

Michael Blaha ◽

John McEvoy ◽

Raul Santos ◽

...

Keyword(s):

Seasonal Variations ◽

Lipid Profiles ◽

Large Database ◽

Very Large Database

Download Full-text

Study A Public Key in RSA Algorithm

European Journal of Engineering Research and Science ◽

10.24018/ejers.2020.5.4.1843 ◽

2020 ◽

Vol 5 (4) ◽

pp. 395-398

Author(s):

Taleb Samad Obaid

Keyword(s):

Private Information ◽

Original Data ◽

Prime Numbers ◽

Public Key ◽

The Internet ◽

Sensitive Information ◽

Rsa Algorithm ◽

Encryption And Decryption ◽

Main Disadvantage ◽

Encryption Decryption

To transmit sensitive information over the unsafe communication network like the internet network, the security is precarious tasks to protect this information. Always, we have much doubt that there are more chances to uncover the information that is being sent through network terminals or the internet by professional/amateur parasitical persons. To protect our information we may need a secure way to safeguard our transferred information. So, encryption/decryption, stenographic and vital cryptography may be adapted to care for the required important information. In system cryptography, the information transferred between both sides sender/receiver in the network must be scrambled using the encryption algorithm. The second side (receiver) should be outlook the original data using the decryption algorithms. Some encryption techniques applied the only one key in the cooperation of encryption and decryption algorithms. When the similar key used in both proceeds is called symmetric algorithm. Other techniques may use two different keys in encryption/decryption in transferring information which is known as the asymmetric key. In general, the algorithms that implicated asymmetric keys are much more secure than others using one key. RSA algorithm used asymmetric keys; one of them for encryption the message, and is known as a public key and another used to decrypt the encrypted message and is called a private key. The main disadvantage of the RSA algorithm is that extra time is taken to perform the encryption process. In this study, the MATLAB library functions are implemented to achieve the work. The software helps us to hold very big prime numbers to generate the required keys which enhanced the security of transmitted information and we expected to be difficult for a hacker to interfere with the private information. The algorithms are implemented successfully on different sizes of messages files.

Download Full-text

Privacy Preservation using (L, D) Inference Model Based on Dependency Identification Information Gain

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f1196.0986s319 ◽

2019 ◽

Vol 8 (6S3) ◽

pp. 1170-1173

Keyword(s):

Data Mining ◽

Information Gain ◽

Original Data ◽

Perturbation Approach ◽

Sensitive Information ◽

Functional Dependencies ◽

Inference Model ◽

Data Set ◽

Data Mining Techniques ◽

Original Dataset

The improvement of an information processing and Memory capacity, the vast amount of data is collected for various data analyses purposes. Data mining techniques are used to get knowledgeable information. The process of extraction of data by using data mining techniques the data get discovered publically and this leads to breaches of specific privacy data. Privacypreserving data mining is used to provide to protection of sensitive information from unwanted or unsanctioned disclosure. In this paper, we analysis the problem of discovering similarity checks for functional dependencies from a given dataset such that application of algorithm (l, d) inference with generalization can anonymised the micro data without loss in utility. [8] This work has presented Functional dependency based perturbation approach which hides sensitive information from the user, by applying (l, d) inference model on the dependency attributes based on Information Gain. This approach works on both categorical and numerical attributes. The perturbed data set does not affects the original dataset it maintains the same or very comparable patterns as the original data set. Hence the utility of the application is always high, when compared to other data mining techniques. The accuracy of the original and perturbed datasets is compared and analysed using tools, data mining classification algorithm.

Download Full-text

Efficient and Privacy-Preserving Multi-User Outsourced K-Means Clustering

Computer and Information Science ◽

10.5539/cis.v14n2p26 ◽

2021 ◽

Vol 14 (2) ◽

pp. 26

Author(s):

Na Li ◽

Lianguan Huang ◽

Yanling Li ◽

Meng Sun

Keyword(s):

Data Mining ◽

Big Data ◽

Clustering Algorithm ◽

Privacy Preserving ◽

Locality Sensitive Hashing ◽

Sensitive Information ◽

The Public ◽

Big Data Mining ◽

Euclidean Distances ◽

Computational Resources

In recent years, with the development of the Internet, the data on the network presents an outbreak trend. Big data mining aims at obtaining useful information through data processing, such as clustering, clarifying and so on. Clustering is an important branch of big data mining and it is popular because of its simplicity. A new trend for clients who lack of storage and computational resources is to outsource the data and clustering task to the public cloud platforms. However, as datasets used for clustering may contain some sensitive information (e.g., identity information, health information), simply outsourcing them to the cloud platforms can't protect the privacy. So clients tend to encrypt their databases before uploading to the cloud for clustering. In this paper, we focus on privacy protection and efficiency promotion with respect to k-means clustering, and we propose a new privacy-preserving multi-user outsourced k-means clustering algorithm which is based on locality sensitive hashing (LSH). In this algorithm, we use a Paillier cryptosystem encrypting databases, and combine LSH to prune off some unnecessary computations during the clustering. That is, we don't need to compute the Euclidean distances between each data record and each clustering center. Finally, the theoretical and experimental results show that our algorithm is more efficient than most existing privacy-preserving k-means clustering.

Download Full-text