scholarly journals Extending TSE to Administrative Data: A Quality Framework and Case Studies from Stats NZ

2017 ◽  
Vol 33 (2) ◽  
pp. 477-511 ◽  
Author(s):  
Giles Reid ◽  
Felipa Zabala ◽  
Anders Holmberg

Abstract Many national statistics offices acknowledge that making better use of existing administrative data can reduce the cost of meeting ongoing statistical needs. Stats NZ has developed a framework to help facilitate this reuse. The framework is an adapted Total Survey Error (TSE) paradigm for understanding how the strengths and limitations of different data sets flow through a statistical design to affect final output quality. Our framework includes three phases: 1) a single source assessment, 2) an integrated data set assessment, and 3) an estimation and output assessment. We developed a process and guidelines for applying this conceptual framework to practical decisions about statistical design, and used these in recent redevelopment projects. We discuss how we used the framework with data sources that have a non-statistical primary purpose, and how it has helped us spread total survey error ideas to non-methodologists.

BMJ Open ◽  
2019 ◽  
Vol 9 (6) ◽  
pp. e026759 ◽  
Author(s):  
John T Y Soong ◽  
Jurgita Kaubryte ◽  
Danny Liew ◽  
Carol Jane Peden ◽  
Alex Bottle ◽  
...  

ObjectivesThis study aimed to examine the prevalence of frailty coding within the Dr Foster Global Comparators (GC) international database. We then aimed to develop and validate a risk prediction model, based on frailty syndromes, for key outcomes using the GC data set.DesignA retrospective cohort analysis of data from patients over 75 years of age from the GC international administrative data. A risk prediction model was developed from the initial analysis based on seven frailty syndrome groups and their relationship to outcome metrics. A weighting was then created for each syndrome group and summated to create the Dr Foster Global Frailty Score. Performance of the score for predictive capacity was compared with an established prognostic comorbidity model (Elixhauser) and tested on another administrative database Hospital Episode Statistics (2011-2015), for external validation.Setting34 hospitals from nine countries across Europe, Australia, the UK and USA.ResultsOf 6.7 million patient records in the GC database, 1.4 million (20%) were from patients aged 75 years or more. There was marked variation in coding of frailty syndromes between countries and hospitals. Frailty syndromes were coded in 2% to 24% of patient spells. Falls and fractures was the most common syndrome coded (24%). The Dr Foster Global Frailty Score was significantly associated with in-hospital mortality, 30-day non-elective readmission and long length of hospital stay. The score had significant predictive capacity beyond that of other known predictors of poor outcome in older persons, such as comorbidity and chronological age. The score’s predictive capacity was higher in the elective group compared with non-elective, and may reflect improved performance in lower acuity states.ConclusionsFrailty syndromes can be coded in international secondary care administrative data sets. The Dr Foster Global Frailty Score significantly predicts key outcomes. This methodology may be feasibly utilised for case-mix adjustment for older persons internationally.


2011 ◽  
Vol 219-220 ◽  
pp. 151-155 ◽  
Author(s):  
Hua Ji ◽  
Hua Xiang Zhang

In many real-world domains, learning from imbalanced data sets is always confronted. Since the skewed class distribution brings the challenge for traditional classifiers because of much lower classification accuracy on rare classes, we propose the novel method on classification with local clustering based on the data distribution of the imbalanced data sets to solve this problem. At first, we divide the whole data set into several data groups based on the data distribution. Then we perform local clustering within each group both on the normal class and the disjointed rare class. For rare class, the subsequent over-sampling is employed according to the different rates. At last, we apply support vector machines (SVMS) for classification, by means of the traditional tactic of the cost matrix to enhance the classification accuracies. The experimental results on several UCI data sets show that this method can produces much higher prediction accuracies on the rare class than state-of-art methods.


2019 ◽  
Vol 35 (1) ◽  
pp. 137-165
Author(s):  
Jack Lothian ◽  
Anders Holmberg ◽  
Allyson Seyb

Abstract The linking of disparate data sets across time, space and sources is probably the foremost current issue facing Central Statistical Agencies (CSA). If one reviews the current literature looking for the prevalent challenges facing CSAs, three issues stand out: 1) using administrative data effectively; 2) big data and what it means for CSAs; and 3) integrating disparate data set (such as health, education and wealth) to provide measurable facts that can guide policy makers. CSAs are being challenged to explore the same kind of challenges faced by Google, Facebook, and Yahoo, which are using graphical/semantic web models for organizing, searching and analysing data. Additionally, time and space (geography) are becoming more important dimensions (domains) for CSAs as they start to explore new data sources and ways to integrate those to study relationships. Central agency methodologists are being pushed to include these new perspectives into their standard theories, practises and policies. Like most methodologists, the authors see surveys and the publications of their results as a process where estimation is the key tool to achieve the final goal of an accurate statistical output. Randomness and sampling exists to support this goal, and early on it was clear to us that the incoming “it-is-what-it-is” data sources were not randomly selected. These sources were obviously biased and thus would produce biased estimates. So, we set out to design a strategy to deal with this issue. This article presents a schema for integrating and linking traditional and non-traditional datasets. Like all survey methodologies, this schema addresses the fundamental issues of representativeness, estimation and total survey error measurement.


2011 ◽  
Vol 2 (4) ◽  
pp. 12-23 ◽  
Author(s):  
Rekha Kandwal ◽  
Prerna Mahajan ◽  
Ritu Vijay

This paper revisits the problem of active learning and decision making when the cost of labeling incurs cost and unlabeled data is available in abundance. In many real world applications large amounts of data are available but the cost of correctly labeling it prohibits its use. In such cases, active learning can be employed. In this paper the authors propose rough set based clustering using active learning approach. The authors extend the basic notion of Hamming distance to propose a dissimilarity measure which helps in finding the approximations of clusters in the given data set. The underlying theoretical background for this decision is rough set theory. The authors have investigated our algorithm on the benchmark data sets from UCI machine learning repository which have shown promising results.


2006 ◽  
Vol 17 (09) ◽  
pp. 1313-1325 ◽  
Author(s):  
NIKITA A. SAKHANENKO ◽  
GEORGE F. LUGER ◽  
HANNA E. MAKARUK ◽  
JOYSREE B. AUBREY ◽  
DAVID B. HOLTKAMP

This paper considers a set of shock physics experiments that investigate how materials respond to the extremes of deformation, pressure, and temperature when exposed to shock waves. Due to the complexity and the cost of these tests, the available experimental data set is often very sparse. A support vector machine (SVM) technique for regression is used for data estimation of velocity measurements from the underlying experiments. Because of good generalization performance, the SVM method successfully interpolates the experimental data. The analysis of the resulting velocity surface provides more information on the physical phenomena of the experiment. Additionally, the estimated data can be used to identify outlier data sets, as well as to increase the understanding of the other data from the experiment.


2021 ◽  
Vol 13 (18) ◽  
pp. 3741
Author(s):  
Haifeng Zhang ◽  
Alexander Ignatov

In situ sea surface temperatures (SST) are the key component of the calibration and validation (Cal/Val) of satellite SST retrievals and data assimilation (DA). The NOAA in situ SST Quality Monitor (iQuam) aims to collect, from various sources, all available in situ SST data, and integrate them into a maximally complete, uniform, and accurate dataset to support these applications. For each in situ data type, iQuam strives to ingest data from several independent sources, to ensure most complete coverage, at the cost of some redundancy in data feeds. The relative completeness of various inputs and their consistency and mutual complementarity are often unknown and are the focus of this study. For four platform types customarily employed in satellite Cal/Val and DA (drifting buoys, tropical moorings, ships, and Argo floats), five widely known data sets are analyzed: (1) International Comprehensive Ocean-Atmosphere Data Set (ICOADS), (2) Fleet Numerical Meteorology and Oceanography Center (FNMOC), (3) Atlantic Oceanographic and Meteorological Laboratory (AOML), (4) Copernicus Marine Environment Monitoring Service (CMEMS), and (5) Argo Global Data Assembly Centers (GDACs). Each data set reports SSTs from one or more platform types. It is found that drifting buoys are more fully represented in FNMOC and CMEMS. Ships are reported in FNMOC and ICOADS, which are best used in conjunction with each other, but not in CMEMS. Tropical moorings are well represented in ICOADS, FNMOC, and CMEMS. Some CMEMS mooring reports are sampled every 10 min (compared to the standard 1 h sampling in all other datasets). The CMEMS Argo profiling data set is, as expected, nearly identical with those from the two Argo GDACs.


Author(s):  
Leslie Roos ◽  
Elizabeth Wall-Wieler ◽  
Mahmoud Torabi

IntroductionLarge population-based data sets present similar analytic issues across such fields as: population health, clinical epidemiology, education, justice, and children’s services. Step-wise approaches and generalized tools can bring together several pillars: big (typically administrative) data, programming, and study design/analysis. How can we improve efficiency and explore alternative designs? Objectives and ApproachLinked data sets typically contain: 1) files presenting longitudinal histories 2) substantive files noting various events (concussions, burns, loss of a loved one, public housing entry) and several possible covariates and outcomes. Step-wise approaches enable automating tasks by developing general tools (decreasing programmer input) and facilitating alternative designs. Macros improve upon the classic ‘one design, one data set’ perspective. Two case studies highlight tradeoffs in retrospective cohort studies (quasi-experiments) among sample size, length of follow-up, and the number of time periods. ResultsStudy 1: Step 1 calculated the number of mothers with a child placed in care during various index years. Taking 1 year before and after placement generated 5,991 eligible mothers; selecting 5 years before/after decreased the N to 2,281. Step 2 selected appropriate in-province residents. Step 3 handled missing covariates and outcomes, while Step 4 ran alternative designs. One example (of several) compared maternal mental health outcomes using 8 time periods (in 2 years) before/after the event with outcomes using 16 time periods (in 4 years) before/after. Besides showing increasing maternal problems, the 4-year follow-up sometimes produced different statistically significant periods than the 2-year follow-up. Study 2. Swedish/Canadian comparisons of mothers with children placed in foster care highlighted growing differences in maternal pharmaceutical use. Conclusion/ImplicationsPresenting design alternatives is straightforward and applicable across disciplines. Ongoing work is facilitating comparisons of ‘experimental’ and control groups. Literature-derived guidelines and simulation-based techniques should lead to better design decisions. Automated model assessment can help analyze robustness, statistical power, residuals, and bias, suggesting artificial intelligence approaches.


2014 ◽  
Vol 32 (30_suppl) ◽  
pp. 212-212
Author(s):  
Stephen Flaherty ◽  
Robert Savage ◽  
Ingrid Stendhal ◽  
Susan Roston ◽  
Abhijeet Makhe ◽  
...  

212 Background: The ability to link cancer registry data to clinical and administrative data sets for quality improvement has long been desired. We sought to integrate registry data into a central data warehouse in an effort to make available for the first time consistent and reliable diagnosis and staging data to a broad hospital user group. Methods: After a short period of data analysis, the tables for Cancer Registry data (Oracle) were modeled. The source data (SQL Server) was conformed and integrated using an ETL tool (Informatica). All ETL QA work was performed with SQL queries. Cancer registry data was integrated into reporting architecture (Microstrategy) to facilitate design of standardized and ad hoc reports. Results: 140 distinct fields on demographics, staging (clinical, pathological, collaborative), site specific categories, diagnosis, and treatment were integrated into the Dana-Farber Analytics Reporting Tool (DART) for historic Cancer Registry data beginning with January 2010 newly diagnosed cancers. All Cancer Registry data and patient files (new and old) are updated in DART on a monthly basis. Conclusions: Individuals across the hospital now have the ability to link clinical and administrative data from our EMR, institutional QI data from varied systems, and pharmacy data to Cancer Registry data in the DART tool. One example of the integration of these multiple data sets is the linkage of staging data from the Cancer Registry data set and time to referral data from the administrative data set by patient MRN. As DFCI aims to cohort its patients based on their primary diagnosis for quality improvement and other internal reporting needs, the ability to analyze patients in this way becomes critical. This project sets an example for other centers as they integrate Cancer Registry data into user friendly business intelligence systems to help meet federal reporting mandates and aid internal improvement work. [Table: see text]


2020 ◽  
Vol 75 (9-10) ◽  
pp. 549-561
Author(s):  
Christian Beyer ◽  
Vishnu Unnikrishnan ◽  
Robert Brüggemann ◽  
Vincent Toulouse ◽  
Hafez Kader Omar ◽  
...  

Abstract Many current and future applications plan to provide entity-specific predictions. These range from individualized healthcare applications to user-specific purchase recommendations. In our previous stream-based work on Amazon review data, we could show that error-weighted ensembles that combine entity-centric classifiers, which are only trained on reviews of one particular product (entity), and entity-ignorant classifiers, which are trained on all reviews irrespective of the product, can improve prediction quality. This came at the cost of storing multiple entity-centric models in primary memory, many of which would never be used again as their entities would not receive future instances in the stream. To overcome this drawback and make entity-centric learning viable in these scenarios, we investigated two different methods of reducing the primary memory requirement of our entity-centric approach. Our first method uses the lossy counting algorithm for data streams to identify entities whose instances make up a certain percentage of the total data stream within an error-margin. We then store all models which do not fulfil this requirement in secondary memory, from which they can be retrieved in case future instances belonging to them should arrive later in the stream. The second method replaces entity-centric models with a much more naive model which only stores the past labels and predicts the majority label seen so far. We applied our methods on the previously used Amazon data sets which contained up to 1.4M reviews and added two subsets of the Yelp data set which contain up to 4.2M reviews. Both methods were successful in reducing the primary memory requirements while still outperforming an entity-ignorant model.


Author(s):  
Władysław Homenda ◽  
Agnieszka Jastrzȩbska ◽  
Witold Pedrycz ◽  
Fusheng Yu

AbstractIn this paper, we look closely at the issue of contaminated data sets, where apart from legitimate (proper) patterns we encounter erroneous patterns. In a typical scenario, the classification of a contaminated data set is always negatively influenced by garbage patterns (referred to as foreign patterns). Ideally, we would like to remove them from the data set entirely. The paper is devoted to comparison and analysis of three different models capable to perform classification of proper patterns with rejection of foreign patterns. It should be stressed that the studied models are constructed using proper patterns only, and no knowledge about the characteristics of foreign patterns is needed. The methods are illustrated with a case study of handwritten digits recognition, but the proposed approach itself is formulated in a general manner. Therefore, it can be applied to different problems. We have distinguished three structures: global, local, and embedded, all capable to eliminate foreign patterns while performing classification of proper patterns at the same time. A comparison of the proposed models shows that the embedded structure provides the best results but at the cost of a relatively high model complexity. The local architecture provides satisfying results and at the same time is relatively simple.


Sign in / Sign up

Export Citation Format

Share Document