Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms

2011 ◽  
Vol 30 (2) ◽  
pp. 19-50 ◽  
Author(s):  
Johan Perols

SUMMARY This study compares the performance of six popular statistical and machine learning models in detecting financial statement fraud under different assumptions of misclassification costs and ratios of fraud firms to nonfraud firms. The results show, somewhat surprisingly, that logistic regression and support vector machines perform well relative to an artificial neural network, bagging, C4.5, and stacking. The results also reveal some diversity in predictors used across the classification algorithms. Out of 42 predictors examined, only six are consistently selected and used by different classification algorithms: auditor turnover, total discretionary accruals, Big 4 auditor, accounts receivable, meeting or beating analyst forecasts, and unexpected employee productivity. These findings extend financial statement fraud research and can be used by practitioners and regulators to improve fraud risk models. Data Availability: A list of fraud companies used in this study is available from the author upon request. All other data sources are described in the text.

Author(s):  
Muskan Patidar

Abstract: Social networking platforms have given us incalculable opportunities than ever before, and its benefits are undeniable. Despite benefits, people may be humiliated, insulted, bullied, and harassed by anonymous users, strangers, or peers. Cyberbullying refers to the use of technology to humiliate and slander other people. It takes form of hate messages sent through social media and emails. With the exponential increase of social media users, cyberbullying has been emerged as a form of bullying through electronic messages. We have tried to propose a possible solution for the above problem, our project aims to detect cyberbullying in tweets using ML Classification algorithms like Naïve Bayes, KNN, Decision Tree, Random Forest, Support Vector etc. and also we will apply the NLTK (Natural language toolkit) which consist of bigram, trigram, n-gram and unigram on Naïve Bayes to check its accuracy. Finally, we will compare the results of proposed and baseline features with other machine learning algorithms. Findings of the comparison indicate the significance of the proposed features in cyberbullying detection. Keywords: Cyber bullying, Machine Learning Algorithms, Twitter, Natural Language Toolkit


2020 ◽  
Vol 8 (5) ◽  
pp. 4624-4627

In recent years, a lot of data has been generated about students, which can be utilized for deciding the career path of the student. This paper discusses some of the machine learning techniques which can be used to predict the performance of a student and help to decide his/her career path. Some of the key Machine Learning (ML) algorithms applied in our research work are Linear Regression, Logistics Regression, Support Vector machine, Naïve Bayes Classifier and K- means Clustering. The aim of this paper is to predict the student career path using Machine Learning algorithms. We compare the efficiencies of different ML classification algorithms on a real dataset obtained from University students.


P300 speller in Brain Computer Interface (BCI) allows locked-in or completely paralyzed patients to communicate with humans. To achieve the performance of characterization and increase accuracy, machine learning techniques are used. The study is about an event related potential (ERP) P300 signal detection and classification using various machine learning algorithms. Linear Discriminant Analysis (LDA) and Support Vector Machine (SVM) are used to classify P300 and Non-P300 signal from Electroencephalography (EEG) signal. The performance of the system is evaluated based on f1-score using BCI competition III dataset II. In our system, we used LDA and SVM classification algorithms. Both the classifiers gave 91.0% classification accuracy.


Author(s):  
Charu Latkar

For the protection and proximity of railway networks it is substantial to Promptly detect and identify faults in the railway tracks. In this paper, railway track fault diagnosis is approximated from the vertical and lateral acceleration using a MPU6050. MPU6050 consisting of three sensors namely gyroscope, magnetometer and accelerometer are used to distinguish line and level as symetricities in a railway track. A GSM module is used to notify the location of faults on tracks. Arduino Microcontroller is interfaced using Arduino UNO IDE. The results show that the condition of railway track irregularity and railway track striation can be approximated constructively. The processed data is uploaded to the open source cloud provider thingspeak.com. The use of various Machine Learning Algorithms are proposed to accomplish the above tasks based on the commonly available measured signals. By considering the signals from multiple railway tracks in a geographic location, faults are diagnosed from their spatial and temporal dependencies. The irregularities in the railway tracks are detected using the Inertial Monitoring Unit, providing the necessary data about future deformities using Machine Learning. Using Python 3.0, a generative model is developed to show that the AdaBoost network can learn these dependencies directly from the data. Seven different classification algorithms used for this project are Logistic regression,Naive Bayes Algorithm,Support Vector Machine, Ensemble Machine (Average) learning Algorithm, XGBoost Classifier, Extreme Machine Learning and AdaBoost Classifier. Among the above 7 classification algorithms, AdaBoost Learning has given the highest accuracy,i.e of 93.93 %. The AdaBoost Machine Learning Model is used throughout the model.


2020 ◽  
Vol 15 ◽  
Author(s):  
Shivani Aggarwal ◽  
Kavita Pandey

Background: Polycystic ovary syndrome is commonly known as PCOS and it is surprising that it affects up to 18% of women in reproductive age. PCOS is the most usually occurring hormone-related disorder. Some of the symptoms of PCOS are irregular periods, increased facial and body hair growth, attain more weight, darkening of skin, diabetes and trouble conceiving (infertility). It also came into light that patients suffering from PCOS also possess a range of metabolic abnormalities. Due to metabolic abnormalities, some disorder may occur which increase the risk of insulin resistance, type 2 diabetes and impaired glucose tolerance (a sign of prediabetes). Family members of women suffering from PCOS are also at higher hazardous level for developing the same metabolic abnormalities. Obesity and overweight status contribute to insulin resistance in PCOS. Objective: In the modern era, there are several new technologies available to diagnose PCOS and one of them is Machine learning algorithms because they are exposed to new data. These algorithms learn from past experiences to produce reliable and repeatable decisions. In this article, Machine learning algorithms are used to identify the important features to diagnose PCOS. Methods: Several classification algorithms like Support vector machine (SVM), Logistic Regression, Gradient Boosting, Random Forest, Decision Tree and K-Nearest Neighbor (KNN) are uses well organized test datasets for classify huge records. Initially a dataset of 541 instances and 41 attributes has been taken to apply the prediction models and a manual feature selection is done over it. Results: After the feature selection, a set of 12 attributes has been identified which plays a crucial role in diagnosing PCOS. Conclusion: There are several researches progressing in the direction of diagnosing PCOS but till now the relevant features are not identify for the same.


2017 ◽  
Author(s):  
Woo-Young Ahn ◽  
Paul Hendricks ◽  
Nathaniel Haines

AbstractThe easyml (easy machine learning) package lowers the barrier to entry to machine learning and is ideal for undergraduate/graduate students, and practitioners who want to quickly apply machine learning algorithms to their research without having to worry about the best practices of implementing each algorithm. The package provides standardized recipes for regression and classification algorithms in R and Python and implements them in a functional, modular, and extensible framework. This package currently implements recipes for several common machine learning algorithms (e.g., penalized linear models, random forests, and support vector machines) and provides a unified interface to each one. Importantly, users can run and evaluate each machine learning algorithm with a single line of coding. Each recipe is robust, implements best practices specific to each algorithm, and generates a report with details about the model, its performance, as well as journal-quality visualizations. The package’s functional, modular, and extensible framework also allows researchers and more advanced users to easily implement new recipes for other algorithms.


Author(s):  
Meenu Gupta ◽  
Vijender Kumar Solanki ◽  
Vijay Kumar Singh ◽  
Vicente García-Díaz

Data mining is used in various domains of research to identify a new cause for tan effect in the society over the globe. This article includes the same reason for using the data mining to identify the Accident Occurrences in different regions and to identify the most valid reason for happening accidents over the globe. Data Mining and Advanced Machine Learning algorithms are used in this research approach and this article discusses about hyperline, classifications, pre-processing of the data, training the machine with the sample datasets which are collected from different regions in which we have structural and semi-structural data. We will dive into deep of machine learning and data mining classification algorithms to find or predict something novel about the accident occurrences over the globe. We majorly concentrate on two classification algorithms to minify the research and task and they are very basic and important classification algorithms. SVM (Support vector machine), CNB Classifier. This discussion will be quite interesting with WEKA tool for CNB classifier, Bag of Words Identification, Word Count and Frequency Calculation.


Author(s):  
SUNDARAMBAL BALARAMAN

Classification algorithms are very widely used algorithms for the study of various categories of data located in multiple databases that have real-world implementations. The main purpose of this research work is to identify the efficiency of classification algorithms in the study of breast cancer analysis. Mortality rate of women increases due to frequent cases of breast cancer. The conventional method of diagnosing breast cancer is time consuming and hence research works are being carried out in multiple dimensions to address this issue. In this research work, Google colab, an excellent environment for Python coders, is used as a tool to implement machine learning algorithms for predicting the type of cancer. The performance of machine learning algorithms is analyzed based on the accuracy obtained from various classification models such as logistic regression, K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Naïve Bayes, Decision Tree and Random forest. Experiments show that these classifiers work well for the classification of breast cancers with accuracy>90% and the logistic regression stood top with an accuracy of 98.5%. Also implementation using Google colab made the task very easier without spending hours of installation of environment and supporting libraries which we used to do earlier.


2019 ◽  
Vol 11 (11) ◽  
pp. 1351 ◽  
Author(s):  
Tsitsi Bangira ◽  
Silvia Maria Alfieri ◽  
Massimo Menenti ◽  
Adriaan van Niekerk

Small reservoirs play an important role in mining, industries, and agriculture, but storage levels or stage changes are very dynamic. Accurate and up-to-date maps of surface water storage and distribution are invaluable for informing decisions relating to water security, flood monitoring, and water resources management. Satellite remote sensing is an effective way of monitoring the dynamics of surface waterbodies over large areas. The European Space Agency (ESA) has recently launched constellations of Sentinel-1 (S1) and Sentinel-2 (S2) satellites carrying C-band synthetic aperture radar (SAR) and a multispectral imaging radiometer, respectively. The constellations improve global coverage of remotely sensed imagery and enable the development of near real-time operational products. This unprecedented data availability leads to an urgent need for the application of fully automatic, feasible, and accurate retrieval methods for mapping and monitoring waterbodies. The mapping of waterbodies can take advantage of the synthesis of SAR and multispectral remote sensing data in order to increase classification accuracy. This study compares automatic thresholding to machine learning, when applied to delineate waterbodies with diverse spectral and spatial characteristics. Automatic thresholding was applied to near-concurrent normalized difference water index (NDWI) (generated from S2 optical imagery) and VH backscatter features (generated from S1 SAR data). Machine learning was applied to a comprehensive set of features derived from S1 and S2 data. During our field surveys, we observed that the waterbodies visited had different sizes and varying levels of turbidity, sedimentation, and eutrophication. Five machine learning algorithms (MLAs), namely decision tree (DT), k-nearest neighbour (k-NN), random forest (RF), and two implementations of the support vector machine (SVM) were considered. Several experiments were carried out to better understand the complexities involved in mapping spectrally and spatially complex waterbodies. It was found that the combination of multispectral indices with SAR data is highly beneficial for classifying complex waterbodies and that the proposed thresholding approach classified waterbodies with an overall classification accuracy of 89.3%. However, the varying concentrations of suspended sediments (turbidity), dissolved particles, and aquatic plants negatively affected the classification accuracies of the proposed method, whereas the MLAs (SVM in particular) were less sensitive to such variations. The main disadvantage of using MLAs for operational waterbody mapping is the requirement for suitable training samples, representing both water and non-water land covers. The dynamic nature of reservoirs (many reservoirs are depleted at least once a year) makes the re-use of training data unfeasible. The study found that aggregating (combining) the thresholding results of two SAR and multispectral features, namely the S1 VH polarisation and the S2 NDWI, respectively, provided better overall accuracies than when thresholding was applied to any of the individual features considered. The accuracies of this dual thresholding technique were comparable to those of machine learning and may thus offer a viable solution for automatic mapping of waterbodies.


Author(s):  
Harsha A K ◽  
Thyagaraja Murthy A

The introduction of Transport Layer Security has been one of the most important contributors to the privacy and security of internet communications during the last decade. Malware authors have followed suit, using TLS to hide potentially dangerous network connections. Because of the growing use of encryption and other evasion measures, traditional content-based network traffic categorization is becoming more challenging. In this paper, we provide a malware classification technique that uses packet information and machine learning algorithms to detect malware. We employ the use of classification algorithms such as support vector machine and random forest. We start by eliminating characteristics that are highly correlated. We utilized the Random Forest method to choose only the 10 best characteristics from all the remaining features after eliminating the unnecessary ones. Following the feature selection phase, we employ several classification algorithms and evaluate their performance. Random forest algorithm performed exceptionally well in our experiments resulting in an accuracy score of over 0.99.


Sign in / Sign up

Export Citation Format

Share Document