Exploratory Data Analysis on Large Data Sets: The Example of Salary Variation in Spanish Social Security Data

New challenges arise in data visualization when the research involves a sizable database. With many data points, classical scatterplots are non-informative due to the cluttering of points. On the contrary, simple plots, such as the boxplot that are of limited use in small samples, offer great potential to facilitate group comparison in the case of an extensive sample. This article presents exploratory data analysis methods useful for inspecting variation across groups in crucial variables and detecting heterogeneity. The exploratory data analysis methods (introduced by Tukey in his seminal book of 1977) encompass a set of statistical tools aimed to extract information from data using simple graphical tools. In this article, some of the exploratory data analysis methods like the boxplot and scatterplot are revisited and enhanced using modern graphical computational devices (as, for example, the heat-map) and their use illustrated with Spanish Social Security data. We explore how earnings vary across several factors like age, gender, type of occupation, and contract, and in particular, the gender gap in salaries is visualized in various dimensions relating to the type of occupation. The exploratory data analysis methods are also applied to assessing and refining competing regressions by plotting residuals-versus-fitted values. The methods discussed should be useful to researchers to assess heterogeneity in data, across-group variation, and classical diagnostic plots of residuals from alternative models fits. JEL CLASSIFICATION: C55; J01; J08; Y10; C80

Download Full-text

Intelligent Data Analysis

Intelligent Information Technologies ◽

10.4018/978-1-59904-941-0.ch015 ◽

2011 ◽

pp. 308-314 ◽

Cited By ~ 1

Author(s):

Xiaohui Liu

Keyword(s):

Data Analysis ◽

High Performance ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Intelligent Data Analysis ◽

Statistical Knowledge ◽

Interdisciplinary Study ◽

Performance Computing ◽

Effective Analysis

Intelligent Data Analysis (IDA) is an interdisciplinary study concerned with the effective analysis of data. IDA draws the techniques from diverse fields, including artificial intelligence, databases, high-performance computing, pattern recognition, and statistics. These fields often complement each other (e.g., many statistical methods, particularly those for large data sets, rely on computation, but brute computing power is no substitute for statistical knowledge) (Berthold & Hand 2003; Liu, 1999).

Download Full-text

An Analytical Model for Evaluating Social Security Schemes-A Focus on “Ayushman Bharat” Universal Health Scheme in India

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c6631.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 8929-8936

Keyword(s):

Data Mining ◽

Social Security ◽

National Level ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Financial Decision ◽

Frame Work ◽

The Government ◽

Time Basis

The government initiated social security schemes in countries such as India, target a large proportion of the population to provide various types of benefits that involve a number of stakeholders. Such schemes are executed by a large number of transactions between the Government agencies and the other stakeholders on a real time basis, thus resulting in large data sets. Current research advancements in the domain of social security schemes include analysis of sequential activities and debt occurrences for such transactions at the national level only. It has been a challenge in recent times to monitor and evaluate the performance of such gigantic schemes which also involves financial decision making at different levels. This paper proposes an innovative frame-work that combines data mining strategies with actuarial techniques to evaluate one of the popular schemes in India, namely AB-PMJAY (“Ayushman Bharat–Pradhan Mantri Jan Arogya Yojana”) launched by the Government in 2018 at family level. In the proposed framework, the scheme has been divided into a number of sub-processes for which various data mining techniques such as, clustering, classification, anomaly detection and actuarial techniques for pricing are proposed to evaluate the scheme effective at micro level

Download Full-text

Intelligent Data Analysis

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch120 ◽

2011 ◽

pp. 634-638 ◽

Cited By ~ 1

Author(s):

Xiaohui Liu

Keyword(s):

Data Analysis ◽

High Performance ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Intelligent Data Analysis ◽

Statistical Knowledge ◽

Interdisciplinary Study ◽

Performance Computing ◽

Effective Analysis

Intelligent Data Analysis (IDA) is an interdisciplinary study concerned with the effective analysis of data. IDA draws the techniques from diverse fields, including artificial intelligence, databases, high-performance computing, pattern recognition, and statistics. These fields often complement each other (e.g., many statistical methods, particularly those for large data sets, rely on computation, but brute computing power is no substitute for statistical knowledge) (Berthold & Hand 2003; Liu, 1999).

Download Full-text

Wains: A Pattern-Seeking Artificial Life Species

Artificial Life ◽

10.1162/artl_a_00074 ◽

2012 ◽

Vol 18 (4) ◽

pp. 399-423 ◽

Cited By ~ 3

Author(s):

Amy de Buitléir ◽

Michael Russell ◽

Mark Daly

Keyword(s):

Data Analysis ◽

Artificial Life ◽

Initial Phase ◽

First Generation ◽

Hebbian Learning ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Self Organizing Maps ◽

Rich Data

We describe the initial phase of a research project to develop an artificial life framework designed to extract knowledge from large data sets with minimal preparation or ramp-up time. In this phase, we evolved an artificial life population with a new brain architecture. The agents have sufficient intelligence to discover patterns in data and to make survival decisions based on those patterns. The species uses diploid reproduction, Hebbian learning, and Kohonen self-organizing maps, in combination with novel techniques such as using pattern-rich data as the environment and framing the data analysis as a survival problem for artificial life. The first generation of agents mastered the pattern discovery task well enough to thrive. Evolution further adapted the agents to their environment by making them a little more pessimistic, and also by making their brains more efficient.

Download Full-text

A New Look at the Structures of Old Sepsis Actors by Exploratory Data Analysis Tools

Antibiotics ◽

10.3390/antibiotics8040225 ◽

2019 ◽

Vol 8 (4) ◽

pp. 225 ◽

Cited By ~ 2

Author(s):

Antonio Gnoni ◽

Emanuele De Nitto ◽

Salvatore Scacco ◽

Luigi Santacroce ◽

Luigi Leonardo Palese

Keyword(s):

Septic Shock ◽

Data Analysis ◽

Degrees Of Freedom ◽

Exploratory Data Analysis ◽

Protein Structures ◽

Random Projection ◽

Projection Algorithm ◽

Data Sets ◽

Exploratory Data ◽

Sepsis And Septic Shock

Sepsis is a life-threatening condition that accounts for numerous deaths worldwide, usually complications of common community infections (i.e., pneumonia, etc), or infections acquired during the hospital stay. Sepsis and septic shock, its most severe evolution, involve the whole organism, recruiting and producing a lot of molecules, mostly proteins. Proteins are dynamic entities, and a large number of techniques and studies have been devoted to elucidating the relationship between the conformations adopted by proteins and what is their function. Although molecular dynamics has a key role in understanding these relationships, the number of protein structures available in the databases is so high that it is currently possible to build data sets obtained from experimentally determined structures. Techniques for dimensionality reduction and clustering can be applied in exploratory data analysis in order to obtain information on the function of these molecules, and this may be very useful in immunology to better understand the structure-activity relationship of the numerous proteins involved in host defense, moreover in septic patients. The large number of degrees of freedom that characterize the biomolecules requires special techniques which are able to analyze this kind of data sets (with a small number of entries respect to the number of degrees of freedom). In this work we analyzed the ability of two different types of algorithms to provide information on the structures present in three data sets built using the experimental structures of allosteric proteins involved in sepsis. The results obtained by means of a principal component analysis algorithm and those obtained by a random projection algorithm are largely comparable, proving the effectiveness of random projection methods in structural bioinformatics. The usefulness of random projection in exploratory data analysis is discussed, including validation of the obtained clusters. We have chosen these proteins because of their involvement in sepsis and septic shock, aimed to highlight the potentiality of bioinformatics to point out new diagnostic and prognostic tools for the patients.

Download Full-text

Data Analysis and Simulations of the Large Data Sets in the Galactic Astronomy

Numerical Analysis - Theory and Application ◽

10.5772/23923 ◽

2011 ◽

Author(s):

Eduardo B. de Amores

Keyword(s):

Data Analysis ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Galactic Astronomy

Download Full-text

Deep Learning Security Systems

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f1347.0986s319 ◽

2019 ◽

Vol 8 (6S3) ◽

pp. 1823-1826

Keyword(s):

Big Data ◽

Deep Learning ◽

Data Analysis ◽

Data Analytics ◽

Big Data Analytics ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Security Systems ◽

Abstract Knowledge

Big Data Analytics and Deep Learning are not supposed to be two entirely different concepts. Big Data means extremely huge large data sets that can be analyzed to find patterns, trends. One technique that can be used for data analysis so that able to help us find abstract patterns in Big Data is Deep Learning. If we apply Deep Learning to Big Data, we can find unknown and useful patterns that were impossible so far. With the help of Deep Learning, AI is getting smart. There is a hypothesis in this regard, the more data, the more abstract knowledge. So a handy survey of Big Data, Deep Learning and its application in Big Data is necessary.

Download Full-text

Data Science Methods for Psychology

Psychology ◽

10.1093/obo/9780199828340-0259 ◽

2020 ◽

Author(s):

Jeffrey Stanton

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Analysis ◽

Data Collection ◽

Data Science ◽

Large Data ◽

Large Data Sets ◽

Predictive Analysis ◽

Data Sets ◽

The Impact

The term “data science” refers to an emerging field of research and practice that focuses on obtaining, processing, visualizing, analyzing, preserving, and re-using large collections of information. A related term, “big data,” has been used to refer to one of the important challenges faced by data scientists in many applied environments: the need to analyze large data sources, in certain cases using high-speed, real-time data analysis techniques. Data science encompasses much more than big data, however, as a result of many advancements in cognate fields such as computer science and statistics. Data science has also benefited from the widespread availability of inexpensive computing hardware—a development that has enabled “cloud-based” services for the storage and analysis of large data sets. The techniques and tools of data science have broad applicability in the sciences. Within the field of psychology, data science offers new opportunities for data collection and data analysis that have begun to streamline and augment efforts to investigate the brain and behavior. The tools of data science also enable new areas of research, such as computational neuroscience. As an example of the impact of data science, psychologists frequently use predictive analysis as an investigative tool to probe the relationships between a set of independent variables and one or more dependent variables. While predictive analysis has traditionally been accomplished with techniques such as multiple regression, recent developments in the area of machine learning have put new predictive tools in the hands of psychologists. These machine learning tools relax distributional assumptions and facilitate exploration of non-linear relationships among variables. These tools also enable the analysis of large data sets by opening options for parallel processing. In this article, a range of relevant areas from data science is reviewed for applicability to key research problems in psychology including large-scale data collection, exploratory data analysis, confirmatory data analysis, and visualization. This bibliography covers data mining, machine learning, deep learning, natural language processing, Bayesian data analysis, visualization, crowdsourcing, web scraping, open source software, application programming interfaces, and research resources such as journals and textbooks.

Download Full-text

A Data Analysis Based Framework to Detect Anomalies in Large Data Sets Using Benford’s Law - Web Appendix

The BRC Academy Journal of Business ◽

10.15239/j.brcacadjb.2017.07.01.wa05 ◽

2017 ◽

Vol 7 (1) ◽

Author(s):

Mustafa Canbolat ◽

D. Donald Kent

Keyword(s):

Data Analysis ◽

Large Data ◽

Large Data Sets ◽

Benford’S Law ◽

Data Sets ◽

Benford's Law

Download Full-text