Exploring Alternative Designs using ‘Big’ Administrative Data

IntroductionLarge population-based data sets present similar analytic issues across such fields as: population health, clinical epidemiology, education, justice, and children’s services. Step-wise approaches and generalized tools can bring together several pillars: big (typically administrative) data, programming, and study design/analysis. How can we improve efficiency and explore alternative designs? Objectives and ApproachLinked data sets typically contain: 1) files presenting longitudinal histories 2) substantive files noting various events (concussions, burns, loss of a loved one, public housing entry) and several possible covariates and outcomes. Step-wise approaches enable automating tasks by developing general tools (decreasing programmer input) and facilitating alternative designs. Macros improve upon the classic ‘one design, one data set’ perspective. Two case studies highlight tradeoffs in retrospective cohort studies (quasi-experiments) among sample size, length of follow-up, and the number of time periods. ResultsStudy 1: Step 1 calculated the number of mothers with a child placed in care during various index years. Taking 1 year before and after placement generated 5,991 eligible mothers; selecting 5 years before/after decreased the N to 2,281. Step 2 selected appropriate in-province residents. Step 3 handled missing covariates and outcomes, while Step 4 ran alternative designs. One example (of several) compared maternal mental health outcomes using 8 time periods (in 2 years) before/after the event with outcomes using 16 time periods (in 4 years) before/after. Besides showing increasing maternal problems, the 4-year follow-up sometimes produced different statistically significant periods than the 2-year follow-up. Study 2. Swedish/Canadian comparisons of mothers with children placed in foster care highlighted growing differences in maternal pharmaceutical use. Conclusion/ImplicationsPresenting design alternatives is straightforward and applicable across disciplines. Ongoing work is facilitating comparisons of ‘experimental’ and control groups. Literature-derived guidelines and simulation-based techniques should lead to better design decisions. Automated model assessment can help analyze robustness, statistical power, residuals, and bias, suggesting artificial intelligence approaches.

Download Full-text

Dr Foster global frailty score: an international retrospective observational study developing and validating a risk prediction model for hospitalised older persons from administrative data sets

BMJ Open ◽

10.1136/bmjopen-2018-026759 ◽

2019 ◽

Vol 9 (6) ◽

pp. e026759 ◽

Cited By ~ 2

Author(s):

John T Y Soong ◽

Jurgita Kaubryte ◽

Danny Liew ◽

Carol Jane Peden ◽

Alex Bottle ◽

...

Keyword(s):

Prediction Model ◽

Risk Prediction ◽

Administrative Data ◽

Older Persons ◽

Cohort Analysis ◽

Risk Prediction Model ◽

Data Sets ◽

Predictive Capacity ◽

Data Set ◽

Frailty Score

ObjectivesThis study aimed to examine the prevalence of frailty coding within the Dr Foster Global Comparators (GC) international database. We then aimed to develop and validate a risk prediction model, based on frailty syndromes, for key outcomes using the GC data set.DesignA retrospective cohort analysis of data from patients over 75 years of age from the GC international administrative data. A risk prediction model was developed from the initial analysis based on seven frailty syndrome groups and their relationship to outcome metrics. A weighting was then created for each syndrome group and summated to create the Dr Foster Global Frailty Score. Performance of the score for predictive capacity was compared with an established prognostic comorbidity model (Elixhauser) and tested on another administrative database Hospital Episode Statistics (2011-2015), for external validation.Setting34 hospitals from nine countries across Europe, Australia, the UK and USA.ResultsOf 6.7 million patient records in the GC database, 1.4 million (20%) were from patients aged 75 years or more. There was marked variation in coding of frailty syndromes between countries and hospitals. Frailty syndromes were coded in 2% to 24% of patient spells. Falls and fractures was the most common syndrome coded (24%). The Dr Foster Global Frailty Score was significantly associated with in-hospital mortality, 30-day non-elective readmission and long length of hospital stay. The score had significant predictive capacity beyond that of other known predictors of poor outcome in older persons, such as comorbidity and chronological age. The score’s predictive capacity was higher in the elective group compared with non-elective, and may reflect improved performance in lower acuity states.ConclusionsFrailty syndromes can be coded in international secondary care administrative data sets. The Dr Foster Global Frailty Score significantly predicts key outcomes. This methodology may be feasibly utilised for case-mix adjustment for older persons internationally.

Download Full-text

Defusing Technology:Technology Diffusion in British Columbia

International Journal of Technology Assessment in Health Care ◽

10.1017/s0266462300003020 ◽

1993 ◽

Vol 9 (1) ◽

pp. 46-61 ◽

Cited By ~ 5

Author(s):

Arminée Kazanjian ◽

Kathryn Friesen

Keyword(s):

British Columbia ◽

Historical Data ◽

Current Data ◽

Fiscal Year ◽

Data Sets ◽

Canadian Province ◽

Data Set ◽

Time Periods ◽

Institutional Profile ◽

Areas Of Interest

AbstractIn order to explore the diffusion of the selected technologies in one Canadian province (British Columbia), two administrative data sets were analyzed. The data included over 40 million payment records for each fiscal year on medical services provided to British Columbia residents (2,968,769 in 1988) and information on physical facilities, services, and personnel from 138 hospitals in the province. Three specific time periods were examined in each data set, starting with 1979–80 and ending with the most current data available at the time. The detailed retrospective analysis of laboratory and imaging technologies provides historical data in three areas of interest: (a) patterns of diffusion and volume of utilization, (b) institutional profile, and (c) provider profile. The framework for the analysis focused, where possible, on the examination of determinants of diffusion that may be amenable to policy influence.

Download Full-text

A practical introduction to methods for analyzing longitudinal data in the presence of missing data using a marijuana price survey

Journal of Criminal Psychology ◽

10.1108/jcp-02-2015-0007 ◽

2015 ◽

Vol 5 (2) ◽

pp. 137-148 ◽

Cited By ~ 2

Author(s):

Jeremy N.V Miles ◽

Priscillia Hunt

Keyword(s):

Missing Data ◽

Statistical Power ◽

Criminal Psychology ◽

Likelihood Estimation ◽

Data Sets ◽

Applied Psychology ◽

Artificial Data ◽

Data Set ◽

Content Type ◽

Biased Estimates

Purpose – In applied psychology research settings, such as criminal psychology, missing data are to be expected. Missing data can cause problems with both biased estimates and lack of statistical power. The paper aims to discuss these issues. Design/methodology/approach – Recently, sophisticated methods for appropriately dealing with missing data, so as to minimize bias and to maximize power have been developed. In this paper the authors use an artificial data set to demonstrate the problems that can arise with missing data, and make naïve attempts to handle data sets where some data are missing. Findings – With the artificial data set, and a data set comprising of the results of a survey investigating prices paid for recreational and medical marijuana, the authors demonstrate the use of multiple imputation and maximum likelihood estimation for obtaining appropriate estimates and standard errors when data are missing. Originality/value – Missing data are ubiquitous in applied research. This paper demonstrates that techniques for handling missing data are accessible and should be employed by researchers.

Download Full-text

An examination of the effect of the TESS extended mission on southern hemisphere monotransits

Astronomy and Astrophysics ◽

10.1051/0004-6361/201936703 ◽

2019 ◽

Vol 631 ◽

pp. A83 ◽

Cited By ~ 4

Author(s):

Benjamin F. Cooke ◽

Don Pollacco ◽

Daniel Bayliss

Keyword(s):

Southern Hemisphere ◽

Large Time ◽

First Year ◽

Data Sets ◽

Data Set ◽

Window Functions ◽

Time Gap ◽

Initial Survey

Context. NASA recently announced an extended mission for TESS. As a result it is expected that the southern ecliptic hemisphere will be re-observed approximately two years after the initial survey. Aims. We aim to explore how TESS re-observing the southern ecliptic hemisphere will impact the number and distribution of monotransits discovered during the first year of observations. This simulation will be able to be scaled to any future TESS re-observations. Methods. We carry out an updated simulation of TESS detections in the southern ecliptic hemisphere. This simulation includes realistic Sector window-functions based on the first 11 sectors of SPOC 2 min SAP lightcurves. We then extend this simulation to cover the expected Year 4 of the mission when TESS will re-observe the southern ecliptic fields. For recovered monotransits we also look at the possibility of predicting the period based on the coverage in the TESS data. Results. We find an updated prediction of 339 monotransits from the TESS Year 1 southern ecliptic hemisphere, and that approximately 80% of these systems (266/339) will transit again in the Year 4 observations. The Year 4 observations will also contribute new monotransits not seen in Year 1, resulting in a total of 149 monotransits from the combined Year 1 and Year 4 data sets. We find that 75% (189/266) of recovered Year 1 monotransits will only transit once in the Year 4 data set. For these systems we will be able to constrain possible periods, but period aliasing due to the large time gap between Year 1 and Year 4 observations means that the true period will remain unknown without further spectroscopic or photometric follow-up.

Download Full-text

Should I stay or should I go? A retrospective propensity score-matched analysis using administrative data of hospital-at-home for older people in Scotland

BMJ Open ◽

10.1136/bmjopen-2018-023350 ◽

2019 ◽

Vol 9 (5) ◽

pp. e023350 ◽

Cited By ~ 1

Author(s):

Apostolos Tsiachristas ◽

Graham Ellis ◽

Scott Buchanan ◽

Peter Langhorne ◽

David J Stott ◽

...

Keyword(s):

Propensity Score ◽

Administrative Data ◽

Healthcare Costs ◽

Data Set ◽

Risk Of Death ◽

Level Data ◽

Increased Risk ◽

Hospital At Home ◽

At Home

ObjectivesTo compare the characteristics of populations admitted to hospital-at-home services with the population admitted to hospital and assess the association of these services with healthcare costs and mortality.DesignIn a retrospective observational cohort study of linked patient level data, we used propensity score matching in combination with regression analysis.ParticipantsPatients aged 65 years and older admitted to hospital-at-home or hospital.InterventionsThree geriatrician-led admission avoidance hospital-at-home services in Scotland.Outcome measuresHealthcare costs and mortality.ResultsPatients in hospital-at-home were older and more socioeconomically disadvantaged, had higher rates of previous hospitalisation and there was a greater proportion of women and people with several chronic conditions compared with the population admitted to hospital. The cost of providing hospital-at-home varied between the three sites from £628 to £2928 per admission. Hospital-at-home was associated with 18% lower costs during the follow-up period in site 1 (ratio of means 0.82; 95% CI: 0.76 to 0.89). Limiting the analysis to costs during the 6 months following index discharge, patients in the hospital-at-home cohorts had 27% higher costs (ratio of means 1.27; 95% CI: 1.14 to 1.41) in site 1, 9% (ratio of means 1.09; 95% CI: 0.95 to 1.24) in site 2 and 70% in site 3 (ratio of means 1.70; 95% CI: 1.40 to 2.07) compared with patients in the control cohorts. Admission to hospital-at-home was associated with an increased risk of death during the follow-up period in all three sites (1.09, 95% CI: 1.00 to 1.19 site 1; 1.29, 95% CI: 1.15 to 1.44 site 2; 1.27, 95% CI: 1.06 to 1.54 site 3).ConclusionsOur findings indicate that in these three cohorts, the populations admitted to hospital-at-home and hospital differ. We cannot rule out the risk of residual confounding, as our analysis relied on an administrative data set and we lacked data on disease severity and type of hospitalised care received in the control cohorts.

Download Full-text

Application of time-lapse ERT imaging to watershed characterization

Geophysics ◽

10.1190/1.2907156 ◽

2008 ◽

Vol 73 (3) ◽

pp. G7-G17 ◽

Cited By ~ 115

Author(s):

Carlyle R. Miller ◽

Partha S. Routh ◽

Troy R. Brosten ◽

James P. McNamara

Keyword(s):

Reference Model ◽

Time Lapse ◽

Data Sets ◽

Data Set ◽

Resistivity Tomography ◽

Practical Applications ◽

Data Inversion ◽

Time Periods ◽

The Difference ◽

Base Data

Time-lapse electrical resistivity tomography (ERT) has many practical applications to the study of subsurface properties and processes. When inverting time-lapse ERT data, it is useful to proceed beyond straightforward inversion of data differences and take advantage of the time-lapse nature of the data. We assess various approaches for inverting and interpreting time-lapse ERT data and determine that two approaches work well. The first approach is model subtraction after separate inversion of the data from two time periods, and the second approach is to use the inverted model from a base data set as the reference model or prior information for subsequent time periods. We prefer this second approach. Data inversion methodology should be consideredwhen designing data acquisition; i.e., to utilize the second approach, it is important to collect one or more data sets for which the bulk of the subsurface is in a background or relatively unperturbed state. A third and commonly used approach to time-lapse inversion, inverting the difference between two data sets, localizes the regions of the model in which change has occurred; however, varying noise levels between the two data sets can be problematic. To further assess the various time-lapse inversion approaches, we acquired field data from a catchment within the Dry Creek Experimental Watershed near Boise, Idaho, U.S.A. We combined the complimentary information from individual static ERT inversions, time-lapse ERT images, and available hydrologic data in a robust interpretation scheme to aid in quantifying seasonal variations in subsurface moisture content.

Download Full-text

Integration of a cancer registry dataset into a hospital-wide central data warehouse.

Journal of Clinical Oncology ◽

10.1200/jco.2014.32.30_suppl.212 ◽

2014 ◽

Vol 32 (30_suppl) ◽

pp. 212-212

Author(s):

Stephen Flaherty ◽

Robert Savage ◽

Ingrid Stendhal ◽

Susan Roston ◽

Abhijeet Makhe ◽

...

Keyword(s):

Quality Improvement ◽

Data Warehouse ◽

Administrative Data ◽

Cancer Registry ◽

Registry Data ◽

Data Sets ◽

Data Set ◽

Cancer Registry Data ◽

Multiple Data Sets ◽

Central Data

212 Background: The ability to link cancer registry data to clinical and administrative data sets for quality improvement has long been desired. We sought to integrate registry data into a central data warehouse in an effort to make available for the first time consistent and reliable diagnosis and staging data to a broad hospital user group. Methods: After a short period of data analysis, the tables for Cancer Registry data (Oracle) were modeled. The source data (SQL Server) was conformed and integrated using an ETL tool (Informatica). All ETL QA work was performed with SQL queries. Cancer registry data was integrated into reporting architecture (Microstrategy) to facilitate design of standardized and ad hoc reports. Results: 140 distinct fields on demographics, staging (clinical, pathological, collaborative), site specific categories, diagnosis, and treatment were integrated into the Dana-Farber Analytics Reporting Tool (DART) for historic Cancer Registry data beginning with January 2010 newly diagnosed cancers. All Cancer Registry data and patient files (new and old) are updated in DART on a monthly basis. Conclusions: Individuals across the hospital now have the ability to link clinical and administrative data from our EMR, institutional QI data from varied systems, and pharmacy data to Cancer Registry data in the DART tool. One example of the integration of these multiple data sets is the linkage of staging data from the Cancer Registry data set and time to referral data from the administrative data set by patient MRN. As DFCI aims to cohort its patients based on their primary diagnosis for quality improvement and other internal reporting needs, the ability to analyze patients in this way becomes critical. This project sets an example for other centers as they integrate Cancer Registry data into user friendly business intelligence systems to help meet federal reporting mandates and aid internal improvement work. [Table: see text]

Download Full-text

A Three-Sample Test for Introgression

Molecular Biology and Evolution ◽

10.1093/molbev/msz178 ◽

2019 ◽

Vol 36 (12) ◽

pp. 2878-2882 ◽

Cited By ~ 11

Author(s):

Matthew W Hahn ◽

Mark S Hibbins

Keyword(s):

Statistical Power ◽

Data Sets ◽

Data Set ◽

Sample Test ◽

Multiple Sequences ◽

Single Sequence

Abstract Many methods exist for detecting introgression between nonsister species, but the most commonly used require either a single sequence from four or more taxa or multiple sequences from each of three taxa. Here, we present a test for introgression that uses only a single sequence from three taxa. This test, denoted D3, uses similar logic as the standard D-test for introgression, but by using pairwise distances instead of site patterns it is able to detect the same signal of introgression with fewer species. We use simulations to show that D3 has statistical power almost equal to D, demonstrating its use on a data set of wild bananas (Musa). The new test is easy to apply and easy to interpret, and should find wide use among currently available data sets.

Download Full-text

The Southern Annular Mode: a comparison of indices

Hydrology and Earth System Sciences ◽

10.5194/hess-16-967-2012 ◽

2012 ◽

Vol 16 (3) ◽

pp. 967-982 ◽

Cited By ~ 66

Author(s):

M. Ho ◽

A. S. Kiem ◽

D. C. Verdon-Kidd

Keyword(s):

Southern Annular Mode ◽

Data Sets ◽

Data Set ◽

Annular Mode ◽

Time Periods ◽

Development Approach ◽

The Impact ◽

The Relationship ◽

Mode A ◽

Australian Rainfall

Abstract. The Southern Annular Mode (SAM) has been identified as a climate mechanism with potentially significant impacts on the Australian hydroclimate. However, despite the identification of relationships between SAM and Australia's hydroclimate using certain data sets, and focussed on certain time periods, the association has not been extensively explored and significant uncertainties remain. One reason for this is the existence of numerous indices, methods and data sets by which SAM has been approximated. In this paper, the various SAM definitions and indices are reviewed and the similarities and discrepancies are discussed, along with the strengths and weaknesses of each index development approach. Further, the sensitivity of the relationship between SAM and Australian rainfall to choice of SAM index is quantified and recommendations are given as to the most appropriate index to use when assessing the impacts of the SAM on Australia's hydroclimate. Importantly this study highlights the need to consider the impact that the choice of SAM index, and data set used to calculate the index, has on the outcomes of any SAM attribution study.

Download Full-text

Pilot study of the ability to probabilistically link clinical trial patients to administrative data and determine long-term outcomes

Clinical Trials ◽

10.1177/1740774518815653 ◽

2018 ◽

Vol 16 (1) ◽

pp. 14-17 ◽

Cited By ~ 1

Author(s):

Annette E Hay ◽

Joseph L Pater ◽

Elyse Corn ◽

Lei Han ◽

Ximena Camacho ◽

...

Keyword(s):

Clinical Trial ◽

Data Collection ◽

Administrative Data ◽

Clinical Trial Data ◽

Administrative Databases ◽

Data Sets ◽

Probabilistic Linkage ◽

Cancer Trials

Background Clinical trials are important but extremely costly. Utilization of routinely collected administrative data may simplify and enhance clinical trial data collection. Purpose The aim of this study was to test the feasibility of use of administrative databases in Ontario, Canada, for long-term clinical trial follow-up, specifically (a) to determine whether limited patient identifiers held by the Canadian Cancer Trials Group can be used to probabilistically link with individuals in the Institute for Clinical Evaluative Sciences databases and if so, (b) the level of concordance between the two data sets. Methods This retrospective study was conducted through collaboration of established health service (Institute for Clinical Evaluative Sciences) and clinical trial (Canadian Cancer Trials Group) research groups in the province of Ontario, Canada, where healthcare is predominantly funded by the government. Adults with pre-treated metastatic colorectal cancer previously enrolled in the Canadian Cancer Trials Group CO.17 and CO.20 randomized phase III trials were included, limited to those in Ontario. The main outcomes were rate of successful probabilistic linkage and concordance of survival data, stated a priori. Results Probabilistic linkage was successful in 266/293 (90.8%) participants. In those patients for whom linkage was successful, the Canadian Cancer Trials Group (trial) and the Institute for Clinical Evaluative Sciences (administrative) data sets were concordant with regard to the occurrence of death during the period of clinical trial follow-up in 206/209 (98.6%). Death was recorded in the Institute for Clinical Evaluative Sciences, but not the Canadian Cancer Trials Group, for 57 cases, where the event occurred after the clinical trial cut-off dates. The recorded date of death matched closely between both databases. During the period of clinical trial conduct, administrative databases contained details of hospitalizations and emergency room visits not captured in the clinical trial electronic database. Conclusion Prospective use of administrative data could enhance clinical trial data collection, both for long-term follow-up and resource utilization for economic analyses and do so less expensively than current primary data collection. Recording a unique identifier (e.g. health insurance number) in trial databases would allow deterministic linkage for all participants.

Download Full-text