SAMLDroid: A Static Taint Analysis and Machine Learning Combined High-Accuracy Method for Identifying Android Apps with Location Privacy Leakage Risks

Insecure applications (apps) are increasingly used to steal users’ location information for illegal purposes, which has aroused great concern in recent years. Although the existing methods, i.e., static and dynamic taint analysis, have shown great merit for identifying such apps, which mainly rely on statically analyzing source code or dynamically monitoring the location data flow, identification accuracy is still under research, since the analysis results contain a certain false positive or true negative rate. In order to improve the accuracy and reduce the misjudging rate in the process of vetting suspicious apps, this paper proposes SAMLDroid, a combined method of static code analysis and machine learning for identifying Android apps with location privacy leakage, which can effectively improve the identification rate compared with existing methods. SAMLDroid first uses static analysis to scrutinize source code to investigate apps with location acquiring intentions. Then it exploits a well-trained classifier and integrates an app’s multiple features to dynamically analyze the pattern and deliver the final verdict about the app’s property. Finally, it is proved by conducting experiments, that the accuracy rate of SAMLDroid is up to 98.4%, which is nearly 20% higher than Apparecium.

Download Full-text

Identification Author of Source Code by Machine Learning Methods

SPIIRAS Proceedings ◽

10.15622/sp.2019.18.3.741-765 ◽

2019 ◽

Vol 18 (3) ◽

pp. 742-766 ◽

Cited By ~ 3

Author(s):

Anna Kurtukova ◽

Alexander Romanov

Keyword(s):

Neural Network ◽

Machine Learning ◽

Support Vector Machine ◽

Source Code ◽

Educational Process ◽

Machine Learning Algorithms ◽

Identification Accuracy ◽

Support Vector ◽

Hybrid Neural Network ◽

Source Codes

The paper is devoted to the analysis of the problem of determining the source code author , which is of interest to researchers in the field of information security, computer forensics, assessment of the quality of the educational process, protection of intellectual property. The paper presents a detailed analysis of modern solutions to the problem. The authors suggest two new identification techniques based on machine learning algorithms: support vector machine, fast correlation filter and informative features; the technique based on hybrid convolutional recurrent neural network. The experimental database includes samples of source codes written in Java, C ++, Python, PHP, JavaScript, C, C # and Ruby. The data was obtained using a web service for hosting IT-projects – Github. The total number of source codes exceeds 150 thousand samples. The average length of each of them is 850 characters. The case size is 542 authors. The experiments were conducted with source codes written in the most popular programming languages. Accuracy of the developed techniques for different numbers of authors was assessed using 10-fold cross-validation. An additional series of experiments was conducted with the number of authors from 2 to 50 for the most popular Java programming language. The graphs of the relationship between identification accuracy and case size are plotted. The analysis of result showed that the method based on hybrid neural network gives 97% accuracy, and it’s at the present time the best-known result. The technique based on the support vector machine made it possible to achieve 96% accuracy. The difference between the results of the hybrid neural network and the support vector machine was approximately 5%.

Download Full-text

Engrossing Prosecution of Code Smells Type Identification and Rectification using Machine Learning AdaBoost Classifier

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.d1331.029420 ◽

2020 ◽

Vol 9 (4) ◽

pp. 624-631

Keyword(s):

Machine Learning ◽

Decision Tree ◽

False Positive Rate ◽

Source Code ◽

Minimum Cost ◽

Space Complexity ◽

Identification Accuracy ◽

Code Smells ◽

Code Smell ◽

Software Code

Software code smells are the structural features which reside in a software source code. Code smell detection is an established method to discover the problems in source code and reorganize the inner structure of object-oriented software for improving the quality of such software, particularly in terms of maintainability, reusability and cost minimization. The developer identified where the code smell is identified and rectified within a system is a major challenging issue. The various code smell detection technique has been designed but it failed to classify the code type and minimum rectification cost. In order to perform classification with minimum cost, an efficient technique called Machine Learning Ada-Boost Classifier (MLABC) technique is introduced. The MLABC technique improves the software quality by identifying and rectifying the different types of software code smell in source code. Initially, MLABC technique uses decision tree as base classifier to identify the code smell type. The decision tree is used to classify the code smell type based on the certain rule. After that, the base classifiers are combined to make a strong classifier using adaboost machine learning technique. The output of strong classifier is used to identify the code smell type. Finally, the code smell type rectification is performed by applying the refactoring technique where the code smell is identified with minimum cost and space complexity. Experimental results shows that the proposed MLABC technique improves the software code quality in terms of code smell type identification accuracy, false positive rate, code smell type rectification cost and space complexity with the source code

Download Full-text

Enhanced Bug Prediction in JavaScript Programs with Hybrid Call-Graph Based Invocation Metrics

Technologies ◽

10.3390/technologies9010003 ◽

2020 ◽

Vol 9 (1) ◽

pp. 3

Author(s):

Gábor Antal ◽

Zoltán Tóth ◽

Péter Hegedűs ◽

Rudolf Ferenc

Keyword(s):

Software Maintenance ◽

Positive Impact ◽

Source Code ◽

Code Analysis ◽

Static Source ◽

Static Code Analysis ◽

Function Calls ◽

Hybrid Code ◽

Code Metrics ◽

Scripting Language

Bug prediction aims at finding source code elements in a software system that are likely to contain defects. Being aware of the most error-prone parts of the program, one can efficiently allocate the limited amount of testing and code review resources. Therefore, bug prediction can support software maintenance and evolution to a great extent. In this paper, we propose a function level JavaScript bug prediction model based on static source code metrics with the addition of a hybrid (static and dynamic) code analysis based metric of the number of incoming and outgoing function calls (HNII and HNOI). Our motivation for this is that JavaScript is a highly dynamic scripting language for which static code analysis might be very imprecise; therefore, using a purely static source code features for bug prediction might not be enough. Based on a study where we extracted 824 buggy and 1943 non-buggy functions from the publicly available BugsJS dataset for the ESLint JavaScript project, we can confirm the positive impact of hybrid code metrics on the prediction performance of the ML models. Depending on the ML algorithm, applied hyper-parameters, and target measures we consider, hybrid invocation metrics bring a 2–10% increase in model performances (i.e., precision, recall, F-measure). Interestingly, replacing static NOI and NII metrics with their hybrid counterparts HNOI and HNII in itself improves model performances; however, using them all together yields the best results.

Download Full-text

Automatic detection of Long Method and God Class code smells through neural source code embeddings

10.36227/techrxiv.17206010.v1 ◽

2021 ◽

Author(s):

Aleksandar Kovačević ◽

Jelena Slivka ◽

Dragan Vidaković ◽

Katarina-Glorija Grujić ◽

Nikola Luburić ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

Negative Impact ◽

Source Code ◽

Systematic Evaluation ◽

Small Scale ◽

Code Smells ◽

Code Metrics ◽

Code Smell ◽

F Measure

Code smells are structures in code that often have a negative impact on its quality. Manually detecting code smells is challenging and researchers proposed many automatic code smell detectors. Most of the studies propose detectors based on code metrics and heuristics. However, these studies have several limitations, including evaluating the detectors using small-scale case studies and an inconsistent experimental setting. Furthermore, heuristic-based detectors suffer from limitations that hinder their adoption in practice. Thus, researchers have recently started experimenting with machine learning (ML) based code smell detection. This paper compares the performance of multiple ML-based code smell detection models against multiple traditionally employed metric-based heuristics for detection of God Class and Long Method code smells. We evaluate the effectiveness of different source code representations for machine learning: traditionally used code metrics and code embeddings (code2vec, code2seq, and CuBERT). We perform our experiments on the large-scale, manually labeled MLCQ dataset. We consider the binary classification problem – we classify the code samples as smelly or non-smelly and use the F1-measure of the minority (smell) class as a measure of performance. In our experiments, the ML classifier trained using CuBERT source code embeddings achieved the best performance for both God Class (F-measure of 0.53) and Long Method detection (F-measure of 0.75). With the help of a domain expert, we perform the error analysis to discuss the advantages of the CuBERT approach. This study is the first to evaluate the effectiveness of pre-trained neural source code embeddings for code smell detection to the best of our knowledge. A secondary contribution of our study is the systematic evaluation of the effectiveness of multiple heuristic-based approaches on the same large-scale, manually labeled MLCQ dataset.

Download Full-text

An Efficient Anonymous Communication Scheme to Protect the Privacy of the Source Node Location in the Internet of Things

Security and Communication Networks ◽

10.1155/2021/6670847 ◽

2021 ◽

Vol 2021 ◽

pp. 1-16

Author(s):

Fengyin Li ◽

Pei Ren ◽

Guoyu Yang ◽

Yuhong Sun ◽

Yilei Wang ◽

...

Keyword(s):

Wireless Sensor Networks ◽

Sensor Networks ◽

Location Privacy ◽

Routing Algorithm ◽

Residual Energy ◽

Source Node ◽

Wireless Sensor ◽

Anonymous Communication ◽

Privacy Leakage ◽

Communication Scheme

Advances in machine learning (ML) in recent years have enabled a dizzying array of applications such as data analytics, autonomous systems, and security diagnostics. As an important part of the Internet of Things (IoT), wireless sensor networks (WSNs) have been widely used in military, transportation, medical, and household fields. However, in the applications of wireless sensor networks, the adversary can infer the location of a source node and an event by backtracking attacks and traffic analysis. The location privacy leakage of a source node has become one of the most urgent problems to be solved in wireless sensor networks. To solve the problem of source location privacy leakage, in this paper, we first propose a proxy source node selection mechanism by constructing the candidate region. Secondly, based on the residual energy of the node, we propose a shortest routing algorithm to achieve better forwarding efficiency. Finally, by combining the proposed proxy source node selection mechanism with the proposed shortest routing algorithm based on the residual energy, we further propose a new, anonymous communication scheme. Meanwhile, the performance analysis indicates that the anonymous communication scheme can effectively protect the location privacy of the source nodes and reduce the network overhead.

Download Full-text

Encoding Health Records into Pathway Representations for Deep Learning

10.3233/shti210800 ◽

2021 ◽

Author(s):

Marco Luca Sbodio ◽

Natasha Mulligan ◽

Stefanie Speichert ◽

Vanessa Lopez ◽

Joao Bettencourt-Silva

Keyword(s):

Neural Network ◽

Machine Learning ◽

Deep Learning ◽

Source Code ◽

Training Dataset ◽

Health Records ◽

Learning Tasks ◽

Patient Pathways ◽

Computational Resources ◽

The Impact

There is a growing trend in building deep learning patient representations from health records to obtain a comprehensive view of a patient’s data for machine learning tasks. This paper proposes a reproducible approach to generate patient pathways from health records and to transform them into a machine-processable image-like structure useful for deep learning tasks. Based on this approach, we generated over a million pathways from FAIR synthetic health records and used them to train a convolutional neural network. Our initial experiments show the accuracy of the CNN on a prediction task is comparable or better than other autoencoders trained on the same data, while requiring significantly less computational resources for training. We also assess the impact of the size of the training dataset on autoencoders performances. The source code for generating pathways from health records is provided as open source.

Download Full-text

Enabling Machine Learning on Resource Constrained Devices by Source Code Generation of the Learned Models

Lecture Notes in Computer Science - Computational Science – ICCS 2018 ◽

10.1007/978-3-319-93701-4_54 ◽

2018 ◽

pp. 682-694 ◽

Cited By ~ 3

Author(s):

Tomasz Szydlo ◽

Joanna Sendorek ◽

Robert Brzoza-Woch

Keyword(s):

Machine Learning ◽

Code Generation ◽

Source Code ◽

Resource Constrained ◽

Resource Constrained Devices ◽

Constrained Devices

Download Full-text

Logging Analysis and Prediction in Open Source Java Project

Research Anthology on Usage and Development of Open Source Software ◽

10.4018/978-1-7998-9158-1.ch038 ◽

2021 ◽

pp. 733-761

Author(s):

Sangeeta Lal ◽

Neetu Sardana ◽

Ashish Sureka

Keyword(s):

Machine Learning ◽

Content Analysis ◽

Software Development ◽

Anomaly Detection ◽

Open Source ◽

Large Scale ◽

Source Code ◽

Scale Analysis ◽

Large Scale Analysis ◽

Research Questions

Log statements present in source code provide important information to the software developers because they are useful in various software development activities such as debugging, anomaly detection, and remote issue resolution. Most of the previous studies on logging analysis and prediction provide insights and results after analyzing only a few code constructs. In this chapter, the authors perform an in-depth, focused, and large-scale analysis of logging code constructs at two levels: the file level and catch-blocks level. They answer several research questions related to statistical and content analysis. Statistical and content analysis reveals the presence of differentiating properties among logged and nonlogged code constructs. Based on these findings, the authors propose a machine-learning-based model for catch-blocks logging prediction. The machine-learning-based model is found to be effective in catch-blocks logging prediction.

Download Full-text

A Case Study on Testing for Software Security

Advanced Automated Software Testing ◽

10.4018/978-1-4666-0089-8.ch005 ◽

2012 ◽

pp. 89-112

Author(s):

Natarajan Meghanathan ◽

Alexander Roy Geoghegan

Keyword(s):

Software Security ◽

Denial Of Service ◽

Source Code ◽

Code Analysis ◽

Software Vulnerabilities ◽

Static Code Analysis ◽

Software Program ◽

Automated Tools ◽

High Level

The high-level contribution of this book chapter is to illustrate how to conduct static code analysis of a software program and mitigate the vulnerabilities associated with the program. The automated tools used to test for software security are the Source Code Analyzer and Audit Workbench, developed by Fortify, Inc. The first two sections of the chapter are comprised of (i) An introduction to Static Code Analysis and its usefulness in testing for Software Security and (ii) An introduction to the Source Code Analyzer and the Audit Workbench tools and how to use them to conduct static code analysis. The authors then present a detailed case study of static code analysis conducted on a File Reader program (developed in Java) using these automated tools. The specific software vulnerabilities that are discovered, analyzed, and mitigated include: (i) Denial of Service, (ii) System Information Leak, (iii) Unreleased Resource (in the context of Streams), and (iv) Path Manipulation. The authors discuss the potential risk in having each of these vulnerabilities in a software program and provide the solutions (and the Java code) to mitigate these vulnerabilities. The proposed solutions for each of these four vulnerabilities are more generic and could be used to correct such vulnerabilities in software developed in any other programming language.

Download Full-text

Atomicity Violation in Multithreaded Applications and Its Detection in Static Code Analysis Process

Applied Sciences ◽

10.3390/app10228005 ◽

2020 ◽

Vol 10 (22) ◽

pp. 8005

Author(s):

Damian Giebas ◽

Rafał Wojszczyk

Keyword(s):

Parallel Computing ◽

Source Code ◽

Code Analysis ◽

Atomicity Violation ◽

Code Model ◽

Analysis Process ◽

Static Code Analysis ◽

Violation Detection ◽

Definition Of

This paper is a contribution to the field of research dealing with the parallel computing, which is used in multithreaded applications. The paper discusses the characteristics of atomicity violation in multithreaded applications and develops a new definition of atomicity violation based on previously defined relationships between operations, that can be used to atomicity violation detection. A method of detection of conflicts causing atomicity violation was also developed using the source code model of multithreaded applications that predicts errors in the software.

Download Full-text