Machine-Learning-Based Android Malware Family Classification Using Built-In and Custom Permissions

Minki Kim; Daehan Kim; Changha Hwang; Seongje Cho; Sangchul Han; Minkyu Park

doi:10.3390/app112110244

Machine-Learning-Based Android Malware Family Classification Using Built-In and Custom Permissions

Applied Sciences ◽

10.3390/app112110244 ◽

2021 ◽

Vol 11 (21) ◽

pp. 10244

Author(s):

Minki Kim ◽

Daehan Kim ◽

Changha Hwang ◽

Seongje Cho ◽

Sangchul Han ◽

...

Keyword(s):

Machine Learning ◽

Positive Impact ◽

Imbalanced Data ◽

Classification Performance ◽

Malware Analysis ◽

Learning Approaches ◽

Android Malware ◽

Open Questions ◽

Imbalanced Data Classification ◽

Family Classification

Malware family classification is grouping malware samples that have the same or similar characteristics into the same family. It plays a crucial role in understanding notable malicious patterns and recovering from malware infections. Although many machine learning approaches have been devised for this problem, there are still several open questions including, “Which features, classifiers, and evaluation metrics are better for malware familial classification”? In this paper, we propose a machine learning approach to Android malware family classification using built-in and custom permissions. Each Android app must declare proper permissions to access restricted resources or to perform restricted actions. Permission declaration is an efficient and obfuscation-resilient feature for malware analysis. We developed a malware family classification technique using permissions and conducted extensive experiments with several classifiers on a well-known dataset, DREBIN. We then evaluated the classifiers in terms of four metrics: macrolevel F1-score, accuracy, balanced accuracy (BAC), and the Matthews correlation coefficient (MCC). BAC and the MCC are known to be appropriate for evaluating imbalanced data classification. Our experimental results showed that: (i) custom permissions had a positive impact on classification performance; (ii) even when the same classifier and the same feature information were used, there was a difference up to 3.67% between accuracy and BAC; (iii) LightGBM and AdaBoost performed better than other classifiers we considered.

Download Full-text

Detection of Malicious Software by Analyzing Distinct Artifacts Using Machine Learning and Deep Learning Algorithms

Electronics ◽

10.3390/electronics10141694 ◽

2021 ◽

Vol 10 (14) ◽

pp. 1694

Author(s):

Mathew Ashik ◽

A. Jyothish ◽

S. Anandaram ◽

P. Vinod ◽

Francesco Mercaldo ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Deep Learning ◽

Support Vector ◽

Malware Analysis ◽

Learning Approaches ◽

Dynamic Features ◽

System Calls ◽

Prevention Methods ◽

Structural Aspects

Malware is one of the most significant threats in today’s computing world since the number of websites distributing malware is increasing at a rapid rate. Malware analysis and prevention methods are increasingly becoming necessary for computer systems connected to the Internet. This software exploits the system’s vulnerabilities to steal valuable information without the user’s knowledge, and stealthily send it to remote servers controlled by attackers. Traditionally, anti-malware products use signatures for detecting known malware. However, the signature-based method does not scale in detecting obfuscated and packed malware. Considering that the cause of a problem is often best understood by studying the structural aspects of a program like the mnemonics, instruction opcode, API Call, etc. In this paper, we investigate the relevance of the features of unpacked malicious and benign executables like mnemonics, instruction opcodes, and API to identify a feature that classifies the executable. Prominent features are extracted using Minimum Redundancy and Maximum Relevance (mRMR) and Analysis of Variance (ANOVA). Experiments were conducted on four datasets using machine learning and deep learning approaches such as Support Vector Machine (SVM), Naïve Bayes, J48, Random Forest (RF), and XGBoost. In addition, we also evaluate the performance of the collection of deep neural networks like Deep Dense network, One-Dimensional Convolutional Neural Network (1D-CNN), and CNN-LSTM in classifying unknown samples, and we observed promising results using APIs and system calls. On combining APIs/system calls with static features, a marginal performance improvement was attained comparing models trained only on dynamic features. Moreover, to improve accuracy, we implemented our solution using distinct deep learning methods and demonstrated a fine-tuned deep neural network that resulted in an F1-score of 99.1% and 98.48% on Dataset-2 and Dataset-3, respectively.

Download Full-text

Android Malware Analysis Using Machine Learning Classifiers

Algorithms for Intelligent Systems - Proceedings of International Conference on Computational Intelligence and Emerging Power System ◽

10.1007/978-981-16-4103-9_15 ◽

2021 ◽

pp. 171-179

Author(s):

Sakshi Jain ◽

Tarul Khandelwal ◽

Yash Jain ◽

Jyoti Gajrani

Keyword(s):

Machine Learning ◽

Malware Analysis ◽

Android Malware ◽

Machine Learning Classifiers ◽

Learning Classifiers

Download Full-text

A Novel Model for Imbalanced Data Classification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6145 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6680-6687

Author(s):

Jian Yin ◽

Chunjing Gan ◽

Kaiqi Zhao ◽

Xuan Lin ◽

Zhe Quan ◽

...

Keyword(s):

Imbalanced Data ◽

Data Classification ◽

Classification Performance ◽

Classification Model ◽

Proposed Model ◽

Imbalanced Data Classification ◽

Public Datasets ◽

Distribution Cost ◽

Novel Model ◽

Learning Data

Recently, imbalanced data classification has received much attention due to its wide applications. In the literature, existing researches have attempted to improve the classification performance by considering various factors such as the imbalanced distribution, cost-sensitive learning, data space improvement, and ensemble learning. Nevertheless, most of the existing methods focus on only part of these main aspects/factors. In this work, we propose a novel imbalanced data classification model that considers all these main aspects. To evaluate the performance of our proposed model, we have conducted experiments based on 14 public datasets. The results show that our model outperforms the state-of-the-art methods in terms of recall, G-mean, F-measure and AUC.

Download Full-text

Multi Labeled Imbalanced Data Classification Based on Advanced Min-Max Machine Learning

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.l3718.119119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 1776-1778

Keyword(s):

Machine Learning ◽

Uncertain Data ◽

Imbalanced Data ◽

Data Classification ◽

Data Sets ◽

Imbalanced Data Classification ◽

Different Types ◽

Measured System ◽

Traditional Approaches

Some true applications, for example, content arrangement and sub-cell confinement of protein successions, include multi-mark grouping with imbalanced information. Different types of traditional approaches are introduced to describe the relation of hubristic and undertaking formations, classification of different attributes with imbalanced for different uncertain data sets. Here this addresses the issues by utilizing the min-max particular system. The min-max measured system can break down a multi-mark issue into a progression of little two-class sub-issues, which would then be able to be consolidated by two straightforward standards. Additionally present a few decay procedures to improve the presentation of min-max particular systems. Trial results on sub-cellular restriction demonstrate that our strategy has preferable speculation execution over customary SVMs in settling the multi-name and imbalanced information issues. In addition, it is additionally a lot quicker than customary SVMs

Download Full-text

An Improved Random Forest Algorithm for Class-Imbalanced Data Classification and its Application in PAD Risk Factors Analysis

The Open Electrical & Electronic Engineering Journal ◽

10.2174/1874129001307010062 ◽

2013 ◽

Vol 7 (1) ◽

pp. 62-70 ◽

Cited By ~ 9

Author(s):

Dengju Yao ◽

Jing Yang ◽

Xiaojuan Zhan

Keyword(s):

Machine Learning ◽

Random Forest ◽

Imbalanced Data ◽

Machine Learning Algorithms ◽

Majority Voting ◽

Training Dataset ◽

Random Forest Algorithm ◽

Research Subjects ◽

Minority Class ◽

Imbalanced Data Classification

The classification problem is one of the important research subjects in the field of machine learning. However, most machine learning algorithms train a classifier based on the assumption that the number of training examples of classes is almost equal. When a classifier was trained on imbalanced data, the performance of the classifier declined clearly. For resolving the class-imbalanced problem, an improved random forest algorithm was proposed based on sampling with replacement. We extracted multiple example subsets randomly with replacement from majority class, and the example number of extracted example subsets is as the same with minority class example dataset. Then, multiple new training datasets were constructed by combining the each exacted majority example subset and minority class dataset respectively, and multiple random forest classifiers were training on these training dataset. For a prediction example, the class was determined by majority voting of multiple random forest classifiers. The experimental results on five groups UCI datasets and a real clinical dataset show that the proposed method could deal with the class-imbalanced data problem and the improved random forest algorithm outperformed original random forest and other methods in literatures.

Download Full-text

Dcmd: Distance-based classification using mixture distributions on microbiome data

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008799 ◽

2021 ◽

Vol 17 (3) ◽

pp. e1008799

Author(s):

Konstantin Shestopaloff ◽

Mei Dong ◽

Fan Gao ◽

Wei Xu

Keyword(s):

Machine Learning ◽

Count Data ◽

Human Microbiome ◽

Mixture Distribution ◽

Classification Performance ◽

Study Data ◽

Mixture Distributions ◽

Learning Approaches ◽

Nearest Neighbours ◽

Microbiome Data

Current advances in next-generation sequencing techniques have allowed researchers to conduct comprehensive research on the microbiome and human diseases, with recent studies identifying associations between the human microbiome and health outcomes for a number of chronic conditions. However, microbiome data structure, characterized by sparsity and skewness, presents challenges to building effective classifiers. To address this, we present an innovative approach for distance-based classification using mixture distributions (DCMD). The method aims to improve classification performance using microbiome community data, where the predictors are composed of sparse and heterogeneous count data. This approach models the inherent uncertainty in sparse counts by estimating a mixture distribution for the sample data and representing each observation as a distribution, conditional on observed counts and the estimated mixture, which are then used as inputs for distance-based classification. The method is implemented into a k-means classification and k-nearest neighbours framework. We develop two distance metrics that produce optimal results. The performance of the model is assessed using simulated and human microbiome study data, with results compared against a number of existing machine learning and distance-based classification approaches. The proposed method is competitive when compared to the other machine learning approaches, and shows a clear improvement over commonly used distance-based classifiers, underscoring the importance of modelling sparsity for achieving optimal results. The range of applicability and robustness make the proposed method a viable alternative for classification using sparse microbiome count data. The source code is available at https://github.com/kshestop/DCMD for academic use.

Download Full-text

Android malware analysis approach based on control flow graphs and machine learning algorithms

2016 4th International Symposium on Digital Forensic and Security (ISDFS) ◽

10.1109/isdfs.2016.7473512 ◽

2016 ◽

Cited By ~ 6

Author(s):

Mehmet Ali Atici ◽

Seref Sagiroglu ◽

Ibrahim Alper Dogru

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Control Flow ◽

Machine Learning Algorithms ◽

Analysis Approach ◽

Malware Analysis ◽

Android Malware ◽

Flow Graphs

Download Full-text

Two Anatomists Are Better than One—Dual-Level Android Malware Detection

Symmetry ◽

10.3390/sym12071128 ◽

2020 ◽

Vol 12 (7) ◽

pp. 1128 ◽

Cited By ~ 1

Author(s):

Vasileios Kouliaridis ◽

Georgios Kambourakis ◽

Dimitris Geneiatakis ◽

Nektaria Potha

Keyword(s):

Machine Learning ◽

Static Analysis ◽

Web Application ◽

Classification Performance ◽

Detection Accuracy ◽

Android Malware ◽

Static And Dynamic Analysis ◽

Dynamic Instrumentation ◽

Android Malware Detection ◽

Hybrid Solutions

The openness of the Android operating system and its immense penetration into the market makes it a hot target for malware writers. This work introduces Androtomist, a novel tool capable of symmetrically applying static and dynamic analysis of applications on the Android platform. Unlike similar hybrid solutions, Androtomist capitalizes on a wealth of features stemming from static analysis along with rigorous dynamic instrumentation to dissect applications and decide if they are benign or not. The focus is on anomaly detection using machine learning, but the system is able to autonomously conduct signature-based detection as well. Furthermore, Androtomist is publicly available as open source software and can be straightforwardly installed as a web application. The application itself is dual mode, that is, fully automated for the novice user and configurable for the expert one. As a proof-of-concept, we meticulously assess the detection accuracy of Androtomist against three different popular malware datasets and a handful of machine learning classifiers. We particularly concentrate on the classification performance achieved when the results of static analysis are combined with dynamic instrumentation vis-à-vis static analysis only. Our study also introduces an ensemble approach by averaging the output of all base classification models per malware instance separately, and provides a deeper insight on the most influencing features regarding the classification process. Depending on the employed dataset, for hybrid analysis, we report notably promising to excellent results in terms of the accuracy, F1, and AUC metrics.

Download Full-text

Classification of malignant tumours in breast ultrasound using unsupervised machine learning approaches

Scientific Reports ◽

10.1038/s41598-021-81008-x ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Wei-Chung Shia ◽

Li-Sheng Lin ◽

Dar-Ren Chen

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Performance Enhancement ◽

Feature Selection Method ◽

Classification Performance ◽

Significant Feature ◽

Learning Approaches ◽

Malignant Tumours ◽

Local Weight ◽

Predictive Values

AbstractTraditional computer-aided diagnosis (CAD) processes include feature extraction, selection, and classification. Effective feature extraction in CAD is important in improving the classification’s performance. We introduce a machine-learning method and have designed an analysis procedure of benign and malignant breast tumour classification in ultrasound (US) images without a need for a priori tumour region-selection processing, thereby decreasing clinical diagnosis efforts while maintaining high classification performance. Our dataset constituted 677 US images (benign: 312, malignant: 365). Regarding two-dimensional US images, the oriented gradient descriptors’ histogram pyramid was extracted and utilised to obtain feature vectors. The correlation-based feature selection method was used to evaluate and select significant feature sets for further classification. Sequential minimal optimisation—combining local weight learning—was utilised for classification and performance enhancement. The image dataset’s classification performance showed an 81.64% sensitivity and 87.76% specificity for malignant images (area under the curve = 0.847). The positive and negative predictive values were 84.1 and 85.8%, respectively. Here, a new workflow, utilising machine learning to recognise malignant US images was proposed. Comparison of physician diagnoses and the automatic classifications made using machine learning yielded similar outcomes. This indicates the potential applicability of machine learning in clinical diagnoses.

Download Full-text

Identification of Diagnostic Plasma Biomarkers of Coronary Microvascular Disease in Postmenopausal Women Using Machine Learning Methods

Journal of the Endocrine Society ◽

10.1210/jendso/bvab048.586 ◽

2021 ◽

Vol 5 (Supplement_1) ◽

pp. A288-A288

Author(s):

Alicia Arredondo Eve ◽

Elif Tunc ◽

Yu-Jeh Liu ◽

Saumya Agrawal ◽

Huriye Huriye Erbak Yilmaz ◽

...

Keyword(s):

Machine Learning ◽

Postmenopausal Women ◽

Microvascular Disease ◽

Classification Performance ◽

Learning Approaches ◽

Plasma Samples ◽

Coronary Syndrome ◽

Healthy Women ◽

Metabolite Profiles ◽

Plasma Metabolite

Abstract Introduction: Coronary microvascular disease (CMD) affects small arteries that feed the heart and is more prevalent in postmenopausal women. Since CMD and Coronary artery disease (CAD) have distinct pathologies, but are treated the same way, the majority of the patients with CMD do not receive a proper diagnosis and treatment, which in turn results in higher rates of adverse future events such as heart failure, sudden cardiac death, and acute coronary syndrome (ACS). Previously, we performed full metabolite profiling of plasma samples using GC-MS analysis and tested their classification performance using machine learning approaches. This initial proof-of-concept study showed that plasma metabolite profiles can be used to develop diagnostic signatures for CMD. In the current study, we hypothesize that plasma metabolite and protein composition is different for postmenopausal women with no heart disease, with CAD, or with CMD. Methods: We obtained plasma samples from 70 postmenopausal women who are healthy, women who have CMD, and women who have CAD at the time of blood collection. In addition to GC-MS metabolite profiles, we performed LC-MS metabolomic profiling, and proteomic profiling of a panel of 92 proteins that were implicated in cardiometabolic disease. We identified a combination of metabolites and proteins, and further tested their classification performance using machine learning approaches to identify potential circulating biomarkers for CMD. Results: We identified a comprehensive list of metabolites and proteins that were involved in endothelial cell function, nitric oxide metabolism and inflammation, which significantly different in plasma from women with CMD. We further validated difference in the level of several protein biomarkers, such as RAGE, PTX3, AGRP, CNTN1, and MMP-3, which are statistically significantly higher in postmenopausal women with CMD when compared with healthy women or women with CAD. Conclusion: Our research identified a group of potential molecules that can be used in the design of easy and low-cost blood biomarkers for the clinical diagnosis of CMD.

Download Full-text