numerical attributes
Recently Published Documents


TOTAL DOCUMENTS

92
(FIVE YEARS 21)

H-INDEX

10
(FIVE YEARS 2)

Author(s):  
Sujatha Krishna ◽  
Udayarani Vinayaka Murthy

<span>Big data has remodeled the way organizations supervise, examine and leverage data in any industry. To safeguard sensitive data from public contraventions, several countries investigated this issue and carried out privacy protection mechanism. With the aid of quasi-identifiers privacy is not said to be preserved to a greater extent. This paper proposes a method called evolutionary tree-based quasi-identifier and federated gradient (ETQI-FD) for privacy preservations over big healthcare data. The first step involved in the ETQI-FD is learning quasi-identifiers. Learning quasi-identifiers by employing information loss function separately for categorical and numerical attributes accomplishes both the largest dissimilarities and partition without a comprehensive exploration between tuples of features or attributes. Next with the learnt quasi-identifiers, privacy preservation of data item is made by applying federated gradient arbitrary privacy preservation learning model. This model attains optimal balance between privacy and accuracy. In the federated gradient privacy preservation learning model, we evaluate the determinant of each attribute to the outputs. Then injecting Adaptive Lorentz noise to data attributes our ETQI-FD significantly minimizes the influence of noise on the final results and therefore contributing to privacy and accuracy. An experimental evaluation of ETQI-FD method achieves better accuracy and privacy than the existing methods.</span>


2021 ◽  
Vol 12 (4) ◽  
pp. 101-124
Author(s):  
Makhlouf Ledmi ◽  
Hamouma Moumen ◽  
Abderrahim Siam ◽  
Hichem Haouassi ◽  
Nabil Azizi

Association rules are the specific data mining methods aiming to discover explicit relations between the different attributes in a large dataset. However, in reality, several datasets may contain both numeric and categorical attributes. Recently, many meta-heuristic algorithms that mimic the nature are developed for solving continuous problems. This article proposes a new algorithm, DCSA-QAR, for mining quantitative association rules based on crow search algorithm (CSA). To accomplish this, new operators are defined to increase the ability to explore the searching space and ensure the transition from the continuous to the discrete version of CSA. Moreover, a new discretization algorithm is adopted for numerical attributes taking into account dependencies probably that exist between attributes. Finally, to evaluate the performance, DCSA-QAR is compared with particle swarm optimization and mono and multi-objective evolutionary approaches for mining association rules. The results obtained over real-world datasets show the outstanding performance of DCSA-QAR in terms of quality measures.


2021 ◽  
Vol 28 (2) ◽  
pp. 66-89
Author(s):  
Alexandra Virgínia Valente da Silva ◽  
Carlos Manoel Pedro Vaz ◽  
Ednaldo José Ferreira ◽  
Rafael Galbieri

The advance of cotton farming in the Brazilian savannah boosted and made possible a highly technified, efficient and profitable production, elevating the country from the condition of cotton fiber importer in the 70s to one of the main exporters so far. Despite the increasing contribution of technologies such as transgenic cultivars, machines, inputs and more efficient data management, in recent years there has been a stagnation of cotton productivity in the State of Mato Grosso (MT). Data Mining (MD) techniques offer an excellent opportunity to assess this problem. Through the rules-based classification applied to a real database (BD) of cotton production in MT, factors were identified that were affecting and consequently limiting the increase in productivity. In the pre-processing of the data, we perform the attributes, selection, transformation and identification of outliers. Numerical attributes were discretized using automatic techniques: Kononenko (KO), Better Encoding (BE) and combination: KO + BE. In modeling the rule algorithms used were PART and JRip, both implemented in the WEKA tool. Performance was assessed using statistical metrics: accuracy, recall, cost and their combination using the I_FC index (created by the authors). Results showed better performance for the PART classifier, with discretization by the KO + BE technique, followed by binary conversion. The analysis of the rules made it possible to identify the attributes that most impact productivity. This article is an excerpt from an ICMC/USP Professional Master's Dissertation in Science carried out in São Carlos-SP/BR.


Author(s):  
Mathias Gröbe ◽  
Dirk Burghardt

AbstractIn cartographic generalization, the selection is an often-used method to adjust information density in a map. This paper deals with methods for selecting point features for a specific scale with numerical attributes, such as population, elevation, or visitors. With the Label Grid approach and the method of Functional Importance, two existing approaches are described, which have not been published in the scientific literature so far. They are explained and illustrated in the method chapter for better understanding. Furthermore, a new approach based on the Discrete Isolation measure is introduced. It combines the spatial position and the attribute's value and is defined as the minimum distance to the nearest point with a higher value. All described selection methods are implemented and made available as Plugins named “Point selection algorithms” for QGIS. Based on this implementation, the three methods are compared regarding runtime, parameterization, legibility, and generalization degree. Finally, recommendations are given on which data and use cases the approaches are suitable. We see digital maps with multiple scales as the main application of those methods. The possibilities of labeling the selected points are not considered within the scope of this work.


2021 ◽  
Vol 7 ◽  
pp. e512
Author(s):  
Reynald Eugenie ◽  
Erick Stattner

In this paper, we focus on the problem of the search for subgroups in numerical data. This approach aims to identify the subsets of objects, called subgroups, which exhibit interesting characteristics compared to the average, according to a quality measure calculated on a target variable. In this article, we present DISGROU, a new approach that identifies subgroups whose attribute intervals may be discontinuous. Unlike the main algorithms in the field, the originality of our proposal lies in the way it breaks down the intervals of the attributes during the subgroup research phase. The basic assumption of our approach is that the range of attributes defining the groups can be disjoint to improve the quality of the identified subgroups. Indeed the traditional methods in the field perform the subgroup search process only over continuous intervals, which results in the identification of subgroups defined over wider intervals thus containing some irrelevant objects that degrade the quality function. In this way, another advantage of our approach is that it does not require a prior discretization of the attributes, since it works directly on the numerical attributes. The efficiency of our proposal is first demonstrated by comparing the results with two algorithms that are references in the field and then by applying to a case study.


Author(s):  
Shihua Liu ◽  
Hao Zhang ◽  
Xianghua Liu

A Two-stage clustering framework and a clustering algorithm for mixed attribute data based on density peaks and Goodall distance are proposed. Firstly, the subset of numerical attributes of the dataset is clustered, and then the result is mapped into one-dimensional categorical attribute and added to the subset of categorical attribute data. Finally, the new dataset is clustered by the density peaks clustering algorithm to obtain the final result. Experiments on three commonly used UCI datasets show that this algorithm can effectively realize mixed attribute clustering and produce better clustering results than the traditional K-prototypes algorithm do. The clustering accuracy on the Acute, Heart and Credit datasets are 17%, 24%, and 21% higher on average than that of the K-prototypes, respectively.


Author(s):  
Huisheng Zhu ◽  
Bin Yu

The rapid development of technology and increasing numbers of customers have saturated the communication market. Communication operators must give focused attention to the problem of customer churn. Analyzing the customer’s communication behavior and building a prediction model of customer churn can provide the advance evidence for communication operators to minimize churn. This paper describes how to design a HMM to predict customer churn based on communication data. First, we oversample churners to increase the number of positive samples and establish the relative balance of positive and negative samples. Second, the continuous numerical attributes that affect communication customer churn are relatively discretized and their monthly values are converted into monthly change tendencies. Next, we select the communication features by calculating the information gains and information gain rates of all communication attributes. We then construct and optimize a prediction model of customer churn based on HMM. Finally, we test and evaluate the model by using a Spark cluster and the communication data set of Taizhou Branch of China Telecom. Experimental evaluation provides proof that our prediction model is exceptionally reliable.


Author(s):  
Indriyani Indriyani ◽  
M. Ihsan Alfani Putera

A database can consist of numerical and non-numerical attributes. However, several data processing algorithms, such as K-means clustering, can be used only in a dataset with numerical attributes. Data generalization by using Naïve Bayes and K-means clustering methods is usually employed WEKA (Waikato environment for knowledge analysis) application. Although the strength of WEKA lies in increasingly complete and sophisticated algorithms, the success of data mining still lies in the knowledge factor of the human implementer. The task of collecting high-quality data and knowledge of modeling and the use of appropriate algorithms is needed to guarantee the accuracy of the expected formulations. In this paper, we propose a simple web-based application that can be used like WEKA. The methodology used in this study includes several stages. The first stage is the preparation of data, which is the tic-tac-toe game dataset that is converted to CSV (comma-separated values) format. The next stage is the process of modifying data from non-numeric to numeric, specifically for clustering with the K-means algorithm. Afterward, the calculation of the distance between data is conducted and followed by data clustering. The final stage is the summary of these processes and results. From the experimental results, it was found that clustering can be done on categorical attributes that are transformed first into the numerical form using web-based applications.


Author(s):  
J. E. Wolff

While the determinable/determinates model of quantities treats quantities as (special cases of) variable attributes, the approaches considered in this chapter focus on quantities as numerical attributes: being numerically representable is what makes quantities special, if anything does. I distinguish three different attitudes to quantities thus conceived: restrictive realism, restrictive empiricism, and permissive empiricism. Restrictive realists hold that quantitativeness is a feature of attributes, not concepts, and not all attributes are quantitative; restrictive empiricists hold that quantitativeness is a feature of concepts, not attributes, but only some concepts are quantitative; permissivists hold that there is nothing special about quantitative concepts, since any attribute can be numerically represented. This chapter argues that we should reject the idea that quantities are numerical attributes or concepts and suggest that we should focus on the uniqueness of the numerical representation instead, a claim that will be made more precise in Chapter 6.


Sign in / Sign up

Export Citation Format

Share Document