Program Comprehension through Data Mining

Author(s):  
Ioannis N. Kouris

Software development has various stages, that can be conceptually grouped into two phases namely development and production (Figure 1). The development phase includes requirements engineering, architecting, design, implementation and testing. The production phase on the other hand includes the actual deployment of the end product and its maintenance. Software maintenance is the last and most difficult stage in the software lifecycle (Sommerville, 2001), as well as the most costly one. According to Zelkowitz, Shaw and Gannon (1979) the production phase accounts for 67% of the costs of the whole process, whereas according to Van Vliet (2000) the actual cost of software maintenance has been estimated at more than half of the total software development cost. The development phase is critical in order to facilitate efficient and simple software maintenance. The earlier stages should be done by taking into consideration apart from any functional requirements also the later maintenance task. For example the design stage should plan the structure in a way that can be easily altered. Similarly, the implementation stage should create code that can be easily read, understood, and changed, and should also keep the code length to a minimum. According to Van Vliet (2000) the final source code length generated is the determinant factor for the total cost during maintenance, since obviously the less code is written the easier the maintenance becomes. According to Erdil et al. (2003) there are four major problems that can slow down the whole maintenance process: unstructured code, maintenance programmers having insufficient knowledge of the system, documentation being absent, out of date, or at best insufficient, and software maintenance having a bad image. Thus the success of the maintenance phase relies on these problems being fixed earlier in the life cycle. In real life however when programmers decide to perform some maintenance task on a program such as to fix bugs, to make modifications, to create software updates etc. these are usually done in a state of time and commercial pressures and with the logic of cost reduction, thus finally resulting in a problematic system with ever increased complexity. As a consequence the maintainers spend from 50% up to almost 90% of their time trying to comprehend the program (Erdös and Sneed; 1998, Von Mayrhauser and Vans; 1994, Pigoski, 1996). Providing maintainers with tools and techniques to comprehend the programs has become and is receiving a lot of financial and research interest given the widespread of computers and software in all aspects of life. In this work we briefly present some of the most important techniques proposed in the field thus far and focus primarily on the use of data mining techniques in general and especially on association rules. Accordingly we give some possible solutions to problems faced by these methods.

Author(s):  
Krzysztof Jurczuk ◽  
Marcin Czajkowski ◽  
Marek Kretowski

AbstractThis paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.


2003 ◽  
Vol 9 (3-4) ◽  
pp. 361-386 ◽  
Author(s):  
V. J. Modi ◽  
A. Akinturk ◽  
W. Tse

Bluff structures in the form of tall buildings, smokestacks, control towers, bridges, etc., are susceptible to vortex resonance and galloping type of instabilities. One approach to vibration control of such systems is through energy dissipation using sloshing liquid dampers. In this paper we focus on enhancing the energy dissipation efficiency of a rectangular liquid damper through the introduction of two-dimensional obstacles as well as floating particles. The investigation has two phases. To begin with, a parametric free vibration study aimed at the optimization of the obstacle geometry is undertaken to arrive at configurations promising increased damping ratio and hence higher energy dissipation. The study is complemented by an extensive wind tunnel test program, which substantiates the effectiveness of this class of damper in suppressing both vortex resonance and galloping type of instabilities. Simplicity of design, ease of implementation, minimal maintenance, reliability as well as high efficiency make such liquid dampers quite attractive for real-life applications.


Author(s):  
J. L. ÁLVAREZ-MACÍAS ◽  
J. MATA-VÁZQUEZ ◽  
J. C. RIQUELME-SANTOS

In this paper we present a new method for the application of data mining tools on the management phase of software development process. Specifically, we describe two tools, the first one based on supervised learning, and the second one on unsupervised learning. The goal of this method is to induce a set of management rules that make easy the development process to the managers. Depending on how and to what is this method applied, it will permit an a priori analysis, a monitoring of the project or a post-mortem analysis.


2015 ◽  
Vol 28 (3) ◽  
pp. 1-14 ◽  
Author(s):  
Ehsan Saghehei ◽  
Azizollah Memariani

The approach used in this paper is an implementation of a data mining process against real-life transactions of debit cards with the aim of detecting suspicious behavior. The framework designed for this purpose has been obtained through merging supervised and unsupervised models. First, due to unlabeled data, Twostep and Self-Organizing Map algorithms have been used in clustering the transactions. A C5.0 classification algorithm has been applied to evaluate supervised models and also to detect suspicious behaviors. An innovative plan has been designed to evaluate hybrid models and select the most appropriate model for the solution of the fraud detection problem. The evaluation of the models and the final analysis of the data took place in four stages. The appropriate hybrid model was selected from among 16 models. The results show a high ability of selected model in detecting suspicious behavior in transactions involving debit cards.


Author(s):  
Suma B. ◽  
Shobha G.

<span>Privacy preserving data mining has become the focus of attention of government statistical agencies and database security research community who are concerned with preventing privacy disclosure during data mining. Repositories of large datasets include sensitive rules that need to be concealed from unauthorized access. Hence, association rule hiding emerged as one of the powerful techniques for hiding sensitive knowledge that exists in data before it is published. In this paper, we present a constraint-based optimization approach for hiding a set of sensitive association rules, using a well-structured integer linear program formulation. The proposed approach reduces the database sanitization problem to an instance of the integer linear programming problem. The solution of the integer linear program determines the transactions that need to be sanitized in order to conceal the sensitive rules while minimizing the impact of sanitization on the non-sensitive rules. We also present a heuristic sanitization algorithm that performs hiding by reducing the support or the confidence of the sensitive rules. The results of the experimental evaluation of the proposed approach on real-life datasets indicate the promising performance of the approach in terms of side effects on the original database.</span>


2014 ◽  
Vol 2014 ◽  
pp. 1-11 ◽  
Author(s):  
Lopamudra Dey ◽  
Sanjay Chakraborty

“Clustering” the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.


Author(s):  
Naveen Dahiya ◽  
Vishal Bhatnagar ◽  
Manjeet Singh ◽  
Neeti Sangwan

Data mining has proven to be an important technique in terms of efficient information extraction, classification, clustering, and prediction of future trends from a database. The valuable properties of data mining have been put to use in many applications. One such application is Software Development Life Cycle (SDLC), where effective use of data mining techniques has been made by researchers. An exhaustive survey on application of data mining in SDLC has not been done in the past. In this chapter, the authors carry out an in-depth survey of existing literature focused towards application of data mining in SDLC and propose a framework that will classify the work done by various researchers in identification of prominent data mining techniques used in various phases of SDLC and pave the way for future research in the emerging area of data mining in SDLC.


Sign in / Sign up

Export Citation Format

Share Document