scholarly journals Exploratory Data Analysis on Large Data Sets: The Example of Salary Variation in Spanish Social Security Data

2020 ◽  
Author(s):  
Catia Nicodemo ◽  
albert satorra
2020 ◽  
pp. 234094442095733
Author(s):  
Catia Nicodemo ◽  
Albert Satorra

New challenges arise in data visualization when the research involves a sizable database. With many data points, classical scatterplots are non-informative due to the cluttering of points. On the contrary, simple plots, such as the boxplot that are of limited use in small samples, offer great potential to facilitate group comparison in the case of an extensive sample. This article presents exploratory data analysis methods useful for inspecting variation across groups in crucial variables and detecting heterogeneity. The exploratory data analysis methods (introduced by Tukey in his seminal book of 1977) encompass a set of statistical tools aimed to extract information from data using simple graphical tools. In this article, some of the exploratory data analysis methods like the boxplot and scatterplot are revisited and enhanced using modern graphical computational devices (as, for example, the heat-map) and their use illustrated with Spanish Social Security data. We explore how earnings vary across several factors like age, gender, type of occupation, and contract, and in particular, the gender gap in salaries is visualized in various dimensions relating to the type of occupation. The exploratory data analysis methods are also applied to assessing and refining competing regressions by plotting residuals-versus-fitted values. The methods discussed should be useful to researchers to assess heterogeneity in data, across-group variation, and classical diagnostic plots of residuals from alternative models fits. JEL CLASSIFICATION: C55; J01; J08; Y10; C80


Author(s):  
Xiaohui Liu

Intelligent Data Analysis (IDA) is an interdisciplinary study concerned with the effective analysis of data. IDA draws the techniques from diverse fields, including artificial intelligence, databases, high-performance computing, pattern recognition, and statistics. These fields often complement each other (e.g., many statistical methods, particularly those for large data sets, rely on computation, but brute computing power is no substitute for statistical knowledge) (Berthold & Hand 2003; Liu, 1999).


2019 ◽  
Vol 8 (3) ◽  
pp. 8929-8936

The government initiated social security schemes in countries such as India, target a large proportion of the population to provide various types of benefits that involve a number of stakeholders. Such schemes are executed by a large number of transactions between the Government agencies and the other stakeholders on a real time basis, thus resulting in large data sets. Current research advancements in the domain of social security schemes include analysis of sequential activities and debt occurrences for such transactions at the national level only. It has been a challenge in recent times to monitor and evaluate the performance of such gigantic schemes which also involves financial decision making at different levels. This paper proposes an innovative frame-work that combines data mining strategies with actuarial techniques to evaluate one of the popular schemes in India, namely AB-PMJAY (“Ayushman Bharat–Pradhan Mantri Jan Arogya Yojana”) launched by the Government in 2018 at family level. In the proposed framework, the scheme has been divided into a number of sub-processes for which various data mining techniques such as, clustering, classification, anomaly detection and actuarial techniques for pricing are proposed to evaluate the scheme effective at micro level


Author(s):  
Xiaohui Liu

Intelligent Data Analysis (IDA) is an interdisciplinary study concerned with the effective analysis of data. IDA draws the techniques from diverse fields, including artificial intelligence, databases, high-performance computing, pattern recognition, and statistics. These fields often complement each other (e.g., many statistical methods, particularly those for large data sets, rely on computation, but brute computing power is no substitute for statistical knowledge) (Berthold & Hand 2003; Liu, 1999).


2012 ◽  
Vol 18 (4) ◽  
pp. 399-423 ◽  
Author(s):  
Amy de Buitléir ◽  
Michael Russell ◽  
Mark Daly

We describe the initial phase of a research project to develop an artificial life framework designed to extract knowledge from large data sets with minimal preparation or ramp-up time. In this phase, we evolved an artificial life population with a new brain architecture. The agents have sufficient intelligence to discover patterns in data and to make survival decisions based on those patterns. The species uses diploid reproduction, Hebbian learning, and Kohonen self-organizing maps, in combination with novel techniques such as using pattern-rich data as the environment and framing the data analysis as a survival problem for artificial life. The first generation of agents mastered the pattern discovery task well enough to thrive. Evolution further adapted the agents to their environment by making them a little more pessimistic, and also by making their brains more efficient.


Antibiotics ◽  
2019 ◽  
Vol 8 (4) ◽  
pp. 225 ◽  
Author(s):  
Antonio Gnoni ◽  
Emanuele De Nitto ◽  
Salvatore Scacco ◽  
Luigi Santacroce ◽  
Luigi Leonardo Palese

Sepsis is a life-threatening condition that accounts for numerous deaths worldwide, usually complications of common community infections (i.e., pneumonia, etc), or infections acquired during the hospital stay. Sepsis and septic shock, its most severe evolution, involve the whole organism, recruiting and producing a lot of molecules, mostly proteins. Proteins are dynamic entities, and a large number of techniques and studies have been devoted to elucidating the relationship between the conformations adopted by proteins and what is their function. Although molecular dynamics has a key role in understanding these relationships, the number of protein structures available in the databases is so high that it is currently possible to build data sets obtained from experimentally determined structures. Techniques for dimensionality reduction and clustering can be applied in exploratory data analysis in order to obtain information on the function of these molecules, and this may be very useful in immunology to better understand the structure-activity relationship of the numerous proteins involved in host defense, moreover in septic patients. The large number of degrees of freedom that characterize the biomolecules requires special techniques which are able to analyze this kind of data sets (with a small number of entries respect to the number of degrees of freedom). In this work we analyzed the ability of two different types of algorithms to provide information on the structures present in three data sets built using the experimental structures of allosteric proteins involved in sepsis. The results obtained by means of a principal component analysis algorithm and those obtained by a random projection algorithm are largely comparable, proving the effectiveness of random projection methods in structural bioinformatics. The usefulness of random projection in exploratory data analysis is discussed, including validation of the obtained clusters. We have chosen these proteins because of their involvement in sepsis and septic shock, aimed to highlight the potentiality of bioinformatics to point out new diagnostic and prognostic tools for the patients.


Big Data Analytics and Deep Learning are not supposed to be two entirely different concepts. Big Data means extremely huge large data sets that can be analyzed to find patterns, trends. One technique that can be used for data analysis so that able to help us find abstract patterns in Big Data is Deep Learning. If we apply Deep Learning to Big Data, we can find unknown and useful patterns that were impossible so far. With the help of Deep Learning, AI is getting smart. There is a hypothesis in this regard, the more data, the more abstract knowledge. So a handy survey of Big Data, Deep Learning and its application in Big Data is necessary.


Psychology ◽  
2020 ◽  
Author(s):  
Jeffrey Stanton

The term “data science” refers to an emerging field of research and practice that focuses on obtaining, processing, visualizing, analyzing, preserving, and re-using large collections of information. A related term, “big data,” has been used to refer to one of the important challenges faced by data scientists in many applied environments: the need to analyze large data sources, in certain cases using high-speed, real-time data analysis techniques. Data science encompasses much more than big data, however, as a result of many advancements in cognate fields such as computer science and statistics. Data science has also benefited from the widespread availability of inexpensive computing hardware—a development that has enabled “cloud-based” services for the storage and analysis of large data sets. The techniques and tools of data science have broad applicability in the sciences. Within the field of psychology, data science offers new opportunities for data collection and data analysis that have begun to streamline and augment efforts to investigate the brain and behavior. The tools of data science also enable new areas of research, such as computational neuroscience. As an example of the impact of data science, psychologists frequently use predictive analysis as an investigative tool to probe the relationships between a set of independent variables and one or more dependent variables. While predictive analysis has traditionally been accomplished with techniques such as multiple regression, recent developments in the area of machine learning have put new predictive tools in the hands of psychologists. These machine learning tools relax distributional assumptions and facilitate exploration of non-linear relationships among variables. These tools also enable the analysis of large data sets by opening options for parallel processing. In this article, a range of relevant areas from data science is reviewed for applicability to key research problems in psychology including large-scale data collection, exploratory data analysis, confirmatory data analysis, and visualization. This bibliography covers data mining, machine learning, deep learning, natural language processing, Bayesian data analysis, visualization, crowdsourcing, web scraping, open source software, application programming interfaces, and research resources such as journals and textbooks.


Sign in / Sign up

Export Citation Format

Share Document