Questioning the legitimacy of data

This paper is based upon the closing keynote presentation that was given by danah boyd at the inaugural NISO Plus conference held from February 23–25, 2020 in Baltimore, MD (USA). It focuses on how data are used, and how they can be manipulated to meet specific objectives – both good and bad. The paper reinforces the importance of understanding the biases and limitations of any data set. Topics covered include data quality, data voids, data infrastructures, alternative facts, and agnotology. The paper stresses that data become legitimate because we collectively believe that those data are sound, valid, and fit for use. This not only means that there is power in collecting and disseminating data, but also that there is power in interpreting and manipulating the data. The struggle over data’s legitimacy says more about our society – and our values – than it says about the data itself.

Download Full-text

Overcoming challenges to data quality in the ASPREE clinical trial

Trials ◽

10.1186/s13063-019-3789-2 ◽

2019 ◽

Vol 20 (1) ◽

Author(s):

Jessica E. Lockery ◽

◽

Taya A. Collyer ◽

Christopher M. Reid ◽

Michael E. Ernst ◽

...

Keyword(s):

Clinical Trials ◽

Missing Data ◽

Data Quality ◽

Large Scale ◽

Controlled Trial ◽

Data Entry ◽

Quality Data ◽

Community Based ◽

Data Set ◽

The Impact

Abstract Background Large-scale studies risk generating inaccurate and missing data due to the complexity of data collection. Technology has the potential to improve data quality by providing operational support to data collectors. However, this potential is under-explored in community-based trials. The Aspirin in reducing events in the elderly (ASPREE) trial developed a data suite that was specifically designed to support data collectors: the ASPREE Web Accessible Relational Database (AWARD). This paper describes AWARD and the impact of system design on data quality. Methods AWARD’s operational requirements, conceptual design, key challenges and design solutions for data quality are presented. Impact of design features is assessed through comparison of baseline data collected prior to implementation of key functionality (n = 1000) with data collected post implementation (n = 18,114). Overall data quality is assessed according to data category. Results At baseline, implementation of user-driven functionality reduced staff error (from 0.3% to 0.01%), out-of-range data entry (from 0.14% to 0.04%) and protocol deviations (from 0.4% to 0.08%). In the longitudinal data set, which contained more than 39 million data values collected within AWARD, 96.6% of data values were entered within specified query range or found to be accurate upon querying. The remaining data were missing (3.4%). Participant non-attendance at scheduled study activity was the most common cause of missing data. Costs associated with cleaning data in ASPREE were lower than expected compared with reports from other trials. Conclusions Clinical trials undertake complex operational activity in order to collect data, but technology rarely provides sufficient support. We find the AWARD suite provides proof of principle that designing technology to support data collectors can mitigate known causes of poor data quality and produce higher-quality data. Health information technology (IT) products that support the conduct of scheduled activity in addition to traditional data entry will enhance community-based clinical trials. A standardised framework for reporting data quality would aid comparisons across clinical trials. Trial registration International Standard Randomized Controlled Trial Number Register, ISRCTN83772183. Registered on 3 March 2005.

Download Full-text

A Systematic Framework for Analyzing Observation Data in Patient-Centered Registries: Case Study for Patients With Depression (Preprint)

10.2196/preprints.18366 ◽

2020 ◽

Author(s):

Maryam Zolnoori ◽

Mark D Williams ◽

William B Leasure ◽

Kurt B Angstman ◽

Che Ngufor

Keyword(s):

Data Quality ◽

Quality Data ◽

Patient Centered ◽

High Quality ◽

Data Set ◽

High Quality Data ◽

Research Questions ◽

Quality Issues ◽

The Impact ◽

Systematic Framework

BACKGROUND Patient-centered registries are essential in population-based clinical care for patient identification and monitoring of outcomes. Although registry data may be used in real time for patient care, the same data may further be used for secondary analysis to assess disease burden, evaluation of disease management and health care services, and research. The design of a registry has major implications for the ability to effectively use these clinical data in research. OBJECTIVE This study aims to develop a systematic framework to address the data and methodological issues involved in analyzing data in clinically designed patient-centered registries. METHODS The systematic framework was composed of 3 major components: visualizing the multifaceted and heterogeneous patient-centered registries using a data flow diagram, assessing and managing data quality issues, and identifying patient cohorts for addressing specific research questions. RESULTS Using a clinical registry designed as a part of a collaborative care program for adults with depression at Mayo Clinic, we were able to demonstrate the impact of the proposed framework on data integrity. By following the data cleaning and refining procedures of the framework, we were able to generate high-quality data that were available for research questions about the coordination and management of depression in a primary care setting. We describe the steps involved in converting clinically collected data into a viable research data set using registry cohorts of depressed adults to assess the impact on high-cost service use. CONCLUSIONS The systematic framework discussed in this study sheds light on the existing inconsistency and data quality issues in patient-centered registries. This study provided a step-by-step procedure for addressing these challenges and for generating high-quality data for both quality improvement and research that may enhance care and outcomes for patients. INTERNATIONAL REGISTERED REPORT DERR1-10.2196/18366

Download Full-text

A Systematic Framework for Analyzing Observation Data in Patient-Centered Registries: Case Study for Patients With Depression

JMIR Research Protocols ◽

10.2196/18366 ◽

2020 ◽

Vol 9 (10) ◽

pp. e18366

Author(s):

Maryam Zolnoori ◽

Mark D Williams ◽

William B Leasure ◽

Kurt B Angstman ◽

Che Ngufor

Keyword(s):

Data Quality ◽

Quality Data ◽

Patient Centered ◽

High Quality ◽

Data Set ◽

High Quality Data ◽

Research Questions ◽

Quality Issues ◽

The Impact ◽

Systematic Framework

Background Patient-centered registries are essential in population-based clinical care for patient identification and monitoring of outcomes. Although registry data may be used in real time for patient care, the same data may further be used for secondary analysis to assess disease burden, evaluation of disease management and health care services, and research. The design of a registry has major implications for the ability to effectively use these clinical data in research. Objective This study aims to develop a systematic framework to address the data and methodological issues involved in analyzing data in clinically designed patient-centered registries. Methods The systematic framework was composed of 3 major components: visualizing the multifaceted and heterogeneous patient-centered registries using a data flow diagram, assessing and managing data quality issues, and identifying patient cohorts for addressing specific research questions. Results Using a clinical registry designed as a part of a collaborative care program for adults with depression at Mayo Clinic, we were able to demonstrate the impact of the proposed framework on data integrity. By following the data cleaning and refining procedures of the framework, we were able to generate high-quality data that were available for research questions about the coordination and management of depression in a primary care setting. We describe the steps involved in converting clinically collected data into a viable research data set using registry cohorts of depressed adults to assess the impact on high-cost service use. Conclusions The systematic framework discussed in this study sheds light on the existing inconsistency and data quality issues in patient-centered registries. This study provided a step-by-step procedure for addressing these challenges and for generating high-quality data for both quality improvement and research that may enhance care and outcomes for patients. International Registered Report Identifier (IRRID) DERR1-10.2196/18366

Download Full-text

An Overview of ARM Program Climate Research Facility Data Quality Assurance

The Open Atmospheric Science Journal ◽

10.2174/1874282300802010192 ◽

2008 ◽

Vol 2 (1) ◽

pp. 192-216 ◽

Cited By ~ 15

Author(s):

R.A. Peppler ◽

C.N. Long ◽

D.L. Sisterson ◽

D.D. Turner ◽

C.P. Bahrmann ◽

...

Keyword(s):

Quality Assurance ◽

Data Collection ◽

Data Quality ◽

Data Stream ◽

Value Added ◽

Quality Data ◽

Quality Assurance Program ◽

Research Facility ◽

Data Set ◽

Climate Research

We present an overview of key aspects of the Atmospheric Radiation Measurement (ARM) Program Climate Research Facility (ACRF) data quality assurance program. Processes described include instrument deployment and calibration; instrument and facility maintenance; data collection and processing infrastructure; data stream inspection and assessment; problem reporting, review and resolution; data archival, display and distribution; data stream reprocessing; engineering and operations management; and the roles of value-added data processing and targeted field campaigns in specifying data quality and characterizing field measurements. The paper also includes a discussion of recent directions in ACRF data quality assurance. A comprehensive, end-to-end data quality assurance program is essential for producing a high-quality data set from measurements made by automated weather and climate networks. The processes developed during the ARM Program offer a possible framework for use by other instrumentation- and geographically-diverse data collection networks and highlight the myriad aspects that go into producing research-quality data.

Download Full-text

Internet Panels, Professional Respondents, and Data Quality

Methodology ◽

10.1027/1614-2241/a000094 ◽

2015 ◽

Vol 11 (3) ◽

pp. 81-88 ◽

Cited By ~ 15

Author(s):

Suzette M. Matthijsse ◽

Edith D. de Leeuw ◽

Joop J. Hox

Keyword(s):

Data Quality ◽

Market Research ◽

Latent Class ◽

Quality Data ◽

Comparison Study ◽

Web Surveys ◽

Data Set ◽

Online Panel ◽

Online Panels ◽

Internet Panels

Abstract. Most web surveys collect data through nonprobability or opt-in online panels, which are characterized by self-selection. A concern in online research is the emergence of professional respondents, who frequently participate in surveys and are mainly doing so for the incentives. This study investigates if professional respondents can be distinguished in online panels and if they provide lower quality data than nonprofessionals. We analyzed a data set of the NOPVO (Netherlands Online Panel Comparison) study that includes 19 panels, which together capture 90% of the respondents in online market research in the Netherlands. Latent class analysis showed that four types of respondents can be distinguished, ranging from the professional respondent to the altruistic respondent. A profile of professional respondents is depicted. Professional respondents appear not to be a great threat to data quality.

Download Full-text

Energetics of interactions in the solid state of 2-hydroxy-8-X-quinoline derivatives (X = Cl, Br, I, S-Ph): comparison of Hirshfeld atom, X-ray wavefunction and multipole refinements

IUCrJ ◽

10.1107/s2052252519007358 ◽

2019 ◽

Vol 6 (5) ◽

pp. 868-883 ◽

Cited By ~ 3

Author(s):

Magdalena Woinska ◽

Monika Wanat ◽

Przemyslaw Taciak ◽

Tomasz Pawinski ◽

Wladek Minor ◽

...

Keyword(s):

Data Quality ◽

Electron Density ◽

Intermolecular Interactions ◽

Thermal Motion ◽

Quality Data ◽

Data Sets ◽

Quinoline Derivatives ◽

Hydrogen Atoms ◽

Data Set ◽

X Ray

In this work, two methods of high-resolution X-ray data refinement: multipole refinement (MM) and Hirshfeld atom refinement (HAR) – together with X-ray wavefunction refinement (XWR) – are applied to investigate the refinement of positions and anisotropic thermal motion of hydrogen atoms, experiment-based reconstruction of electron density, refinement of anharmonic thermal vibrations, as well as the effects of excluding the weakest reflections in the refinement. The study is based on X-ray data sets of varying quality collected for the crystals of four quinoline derivatives with Cl, Br, I atoms and the -S-Ph group as substituents. Energetic investigations are performed, comprising the calculation of the energy of intermolecular interactions, cohesive and geometrical relaxation energy. The results obtained for experimentally derived structures are verified against the values calculated for structures optimized using dispersion-corrected periodic density functional theory. For the high-quality data sets (the Cl and -S-Ph compounds), both MM and XWR could be successfully used to refine the atomic displacement parameters and the positions of hydrogen atoms; however, the bond lengths obtained with XWR were more precise and closer to the theoretical values. In the application to the more challenging data sets (the Br and I compounds), only XWR enabled free refinement of hydrogen atom geometrical parameters, nevertheless, the results clearly showed poor data quality. For both refinement methods, the energy values (intermolecular interactions, cohesive and relaxation) calculated for the experimental structures were in similar agreement with the values associated with the optimized structures – the most significant divergences were observed when experimental geometries were biased by poor data quality. XWR was found to be more robust in avoiding incorrect distortions of the reconstructed electron density as a result of data quality issues. Based on the problem of anharmonic thermal motion refinement, this study reveals that for the most correct interpretation of the obtained results, it is necessary to use the complete data set, including the weak reflections in order to draw conclusions.

Download Full-text

Designing Information Product (IP) Maps On the Process of Data Processing and Academic Information

International Journal of New Media Technology ◽

10.31937/ijnmt.v4i1.534 ◽

2017 ◽

Vol 4 (1) ◽

pp. 25-31 ◽

Cited By ~ 1

Author(s):

Diana Effendi

Keyword(s):

Data Quality ◽

Data Management ◽

Information Management ◽

Information Quality ◽

Quality Data ◽

Management Approach ◽

Quality Of Data ◽

Information Product ◽

Academic Activities

Information Product Approach (IP Approach) is an information management approach. It can be used to manage product information and data quality analysis. IP-Map can be used by organizations to facilitate the management of knowledge in collecting, storing, maintaining, and using the data in an organized. The process of data management of academic activities in X University has not yet used the IP approach. X University has not given attention to the management of information quality of its. During this time X University just concern to system applications used to support the automation of data management in the process of academic activities. IP-Map that made in this paper can be used as a basis for analyzing the quality of data and information. By the IP-MAP, X University is expected to know which parts of the process that need improvement in the quality of data and information management. Index term: IP Approach, IP-Map, information quality, data quality. REFERENCES[1] H. Zhu, S. Madnick, Y. Lee, and R. Wang, “Data and Information Quality Research: Its Evolution and Future,” Working Paper, MIT, USA, 2012.[2] Lee, Yang W; at al, Journey To Data Quality, MIT Press: Cambridge, 2006.[3] L. Al-Hakim, Information Quality Management: Theory and Applications. Idea Group Inc (IGI), 2007.[4] “Access : A semiotic information quality framework: development and comparative analysis : Journal ofInformation Technology.” [Online]. Available: http://www.palgravejournals.com/jit/journal/v20/n2/full/2000038a.html. [Accessed: 18-Sep-2015].[5] Effendi, Diana, Pengukuran Dan Perbaikan Kualitas Data Dan Informasi Di Perguruan Tinggi MenggunakanCALDEA Dan EVAMECAL (Studi Kasus X University), Proceeding Seminar Nasional RESASTEK, 2012, pp.TIG.1-TI-G.6.

Download Full-text

Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018)

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18031333 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1333

Author(s):

Ahmad R. Alsaber ◽

Jiazhu Pan ◽

Adeeba Al-Hurban

Keyword(s):

Air Quality ◽

Missing Data ◽

Random Forest ◽

Missing Values ◽

Imputation Method ◽

Environmental Data ◽

Environmental Research ◽

Quality Data ◽

Data Set ◽

Air Quality Data

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.

Download Full-text

A Benchmark and Evaluation of Non-Rigid Structure from Motion

International Journal of Computer Vision ◽

10.1007/s11263-020-01406-y ◽

2020 ◽

Author(s):

Sebastian Hoppe Nesgaard Jensen ◽

Mads Emil Brix Doest ◽

Henrik Aanæs ◽

Alessio Del Bue

Keyword(s):

Computer Vision ◽

Structure From Motion ◽

State Of The Art ◽

The State ◽

Quality Data ◽

Data Set ◽

Rigid Structure ◽

Public Data ◽

3D Information ◽

Further Development

AbstractNon-rigid structure from motion (nrsfm), is a long standing and central problem in computer vision and its solution is necessary for obtaining 3D information from multiple images when the scene is dynamic. A main issue regarding the further development of this important computer vision topic, is the lack of high quality data sets. We here address this issue by presenting a data set created for this purpose, which is made publicly available, and considerably larger than the previous state of the art. To validate the applicability of this data set, and provide an investigation into the state of the art of nrsfm, including potential directions forward, we here present a benchmark and a scrupulous evaluation using this data set. This benchmark evaluates 18 different methods with available code that reasonably spans the state of the art in sparse nrsfm. This new public data set and evaluation protocol will provide benchmark tools for further development in this challenging field.

Download Full-text

Workshop Synthesis: Sampling Issues, Data Quality & Data Protection

Transportation Research Procedia ◽

10.1016/j.trpro.2015.12.006 ◽

2015 ◽

Vol 11 ◽

pp. 60-65 ◽

Cited By ~ 4

Author(s):

Jimmy Armoogum ◽

Jennifer Dill

Keyword(s):

Data Quality ◽

Data Protection ◽

Quality Data

Download Full-text