scholarly journals An Evolutionary Schema for Using “it-is-what-it-is” Data in Official Statistics

2019 ◽  
Vol 35 (1) ◽  
pp. 137-165
Author(s):  
Jack Lothian ◽  
Anders Holmberg ◽  
Allyson Seyb

Abstract The linking of disparate data sets across time, space and sources is probably the foremost current issue facing Central Statistical Agencies (CSA). If one reviews the current literature looking for the prevalent challenges facing CSAs, three issues stand out: 1) using administrative data effectively; 2) big data and what it means for CSAs; and 3) integrating disparate data set (such as health, education and wealth) to provide measurable facts that can guide policy makers. CSAs are being challenged to explore the same kind of challenges faced by Google, Facebook, and Yahoo, which are using graphical/semantic web models for organizing, searching and analysing data. Additionally, time and space (geography) are becoming more important dimensions (domains) for CSAs as they start to explore new data sources and ways to integrate those to study relationships. Central agency methodologists are being pushed to include these new perspectives into their standard theories, practises and policies. Like most methodologists, the authors see surveys and the publications of their results as a process where estimation is the key tool to achieve the final goal of an accurate statistical output. Randomness and sampling exists to support this goal, and early on it was clear to us that the incoming “it-is-what-it-is” data sources were not randomly selected. These sources were obviously biased and thus would produce biased estimates. So, we set out to design a strategy to deal with this issue. This article presents a schema for integrating and linking traditional and non-traditional datasets. Like all survey methodologies, this schema addresses the fundamental issues of representativeness, estimation and total survey error measurement.

2007 ◽  
Vol 64 (5) ◽  
pp. 1053-1065 ◽  
Author(s):  
Mashkoor A. Malik ◽  
Larry A. Mayer

Abstract Malik, M. A., and Mayer, L. A. 2007. Investigation of seabed fishing impacts on benthic structure using multi-beam sonar, sidescan sonar, and video. – ICES Journal of Marine Science, 64: 1053–1065. Long, linear furrows of lengths up to several kilometres were observed during a recent high-resolution, multi-beam bathymetry survey of Jeffreys Ledge, a prominent fishing ground in the Gulf of Maine located about 50 km from Portsmouth, NH, USA. These features, which have a relief of only a few centimetres, are presumed to be caused either directly by dredging gear used in the area for scallop and clam fisheries, or indirectly through the dragging of boulders by bottom gear. Extraction of these features with very small vertical expression from a noisy data set, including several instrumental artefacts, presented a number of challenges. To enhance the detection and identification of the features, data artefacts were identified and removed selectively using spatial frequency filtering. Verification of the presence of the features was carried out with repeated multi-beam bathymetry surveys and sidescan sonar surveys. Seabed marks that were clearly detected on multi-beam and sidescan sonar records were not discernible on a subsequent video survey. The inability to see the seabed marks with video may be related to their age. The fact that with time, the textural contrasts discernible by video imagery are lost has important ramifications for the appropriateness of methodologies for quantifying gear impact. The results imply that detailed investigations of seabed impact are best done with a suite of survey tools (multi-beam bathymetry, sidescan sonar, and video) and software to integrate the disparate data sets geographically.


Ocean Science ◽  
2019 ◽  
Vol 15 (2) ◽  
pp. 249-268 ◽  
Author(s):  
Johannes Schulz-Stellenfleth ◽  
Joanna Staneva

Abstract. In many coastal areas there is an increasing number and variety of observation data available, which are often very heterogeneous in their temporal and spatial sampling characteristics. With the advent of new systems, like the radar altimeter on board the Sentinel-3A satellite, a lot of questions arise concerning the accuracy and added value of different instruments and numerical models. Quantification of errors is a key factor for applications, like data assimilation and forecast improvement. In the past, the triple collocation method to estimate systematic and stochastic errors of measurements and numerical models was successfully applied to different data sets. This method relies on the assumption that three independent data sets provide estimates of the same quantity. In coastal areas with strong gradients even small distances between measurements can lead to larger differences and this assumption can become critical. In this study the triple collocation method is extended in different ways with the specific problems of the coast in mind. In addition to nearest-neighbour approximations considered so far, the presented method allows for use of a large variety of interpolation approaches to take spatial variations in the observed area into account. Observation and numerical model errors can therefore be estimated, even if the distance between the different data sources is too large to assume that they measure the same quantity. If the number of observations is sufficient, the method can also be used to estimate error correlations between certain data source components. As a second novelty, an estimator for the uncertainty in the derived observation errors is derived as a function of the covariance matrices of the input data and the number of available samples. In the first step, the method is assessed using synthetic observations and Monte Carlo simulations. The technique is then applied to a data set of Sentinel-3A altimeter measurements, in situ wave observations, and numerical wave model data with a focus on the North Sea. Stochastic observation errors for the significant wave height, as well as bias and calibration errors, are derived for the model and the altimeter. The analysis indicates a slight overestimation of altimeter wave heights, which become more pronounced at higher sea states. The smallest stochastic errors are found for the in situ measurements. Different observation geometries of in situ data and altimeter tracks are furthermore analysed, considering 1-D and 2-D interpolation approaches. For example, the geometry of an altimeter track passing between two in situ wave instruments is considered with model data being available at the in situ locations. It is shown that for a sufficiently large sample, the errors of all data sources, as well as the error correlations of the model, can be estimated with the new method.


2015 ◽  
Vol 5 (2) ◽  
pp. 137-148 ◽  
Author(s):  
Jeremy N.V Miles ◽  
Priscillia Hunt

Purpose – In applied psychology research settings, such as criminal psychology, missing data are to be expected. Missing data can cause problems with both biased estimates and lack of statistical power. The paper aims to discuss these issues. Design/methodology/approach – Recently, sophisticated methods for appropriately dealing with missing data, so as to minimize bias and to maximize power have been developed. In this paper the authors use an artificial data set to demonstrate the problems that can arise with missing data, and make naïve attempts to handle data sets where some data are missing. Findings – With the artificial data set, and a data set comprising of the results of a survey investigating prices paid for recreational and medical marijuana, the authors demonstrate the use of multiple imputation and maximum likelihood estimation for obtaining appropriate estimates and standard errors when data are missing. Originality/value – Missing data are ubiquitous in applied research. This paper demonstrates that techniques for handling missing data are accessible and should be employed by researchers.


PLoS ONE ◽  
2020 ◽  
Vol 15 (12) ◽  
pp. e0242923
Author(s):  
P. J. Stephenson ◽  
Carrie Stengel

Many conservation managers, policy makers, businesses and local communities cannot access the biodiversity data they need for informed decision-making on natural resource management. A handful of databases are used to monitor indicators against global biodiversity goals but there is no openly available consolidated list of global data sets to help managers, especially those in high-biodiversity countries. We therefore conducted an inventory of global databases of potential use in monitoring biodiversity states, pressures and conservation responses at multiple levels. We uncovered 145 global data sources, as well as a selection of global data reports, links to which we will make available on an open-access website. We describe trends in data availability and actions needed to improve data sharing. If the conservation and science community made a greater effort to publicise data sources, and make the data openly and freely available for the people who most need it, we might be able to mainstream biodiversity data into decision-making and help stop biodiversity loss.


Healthcare ◽  
2018 ◽  
Vol 6 (4) ◽  
pp. 136 ◽  
Author(s):  
Stephanie Partridge ◽  
Eloise Howse ◽  
Gwynnyth Llewellyn ◽  
Margaret Allman-Farinelli

Young adulthood is a period of transition, which for many includes higher education. Higher education is associated with specific risks to wellbeing. Understanding the available data on wellbeing in this group may help inform the future collection of data to inform policy and practice in the sector. This scoping review aimed to identify the availability of data sources on the wellbeing of the Australian young adult population who are attending tertiary education. Using the methods of Arksey and O’Malley, data from three primary sources, i.e., Australian Bureau of Statistics, Australian Institute of Health and Welfare and relevant longitudinal studies, were identified. Data sources were screened and coded, and relevant information was extracted. Key data for eight areas related to wellbeing, namely, family and community, health, education and training, work, economic wellbeing, housing, crime and justice, and culture and leisure sources were identified. Forty individual data sets from 16 surveys and six active longitudinal studies were identified. Two data sets contained seven of the areas of wellbeing, of which one was specific to young adults in tertiary education, while the other survey was not limited to young adults. Both data sets lacked information concerning crime and justice variables, which have recently been identified as being of major concern among Australian university students. We recommend that government policy address the collection of a comprehensive data set encompassing each of the eight areas of wellbeing to inform future policy and practice.


2021 ◽  
Vol 14 (11) ◽  
pp. 2519-2532
Author(s):  
Fatemeh Nargesian ◽  
Abolfazl Asudeh ◽  
H. V. Jagadish

Data scientists often develop data sets for analysis by drawing upon sources of data available to them. A major challenge is to ensure that the data set used for analysis has an appropriate representation of relevant (demographic) groups: it meets desired distribution requirements. Whether data is collected through some experiment or obtained from some data provider, the data from any single source may not meet the desired distribution requirements. Therefore, a union of data from multiple sources is often required. In this paper, we study how to acquire such data in the most cost effective manner, for typical cost functions observed in practice. We present an optimal solution for binary groups when the underlying distributions of data sources are known and all data sources have equal costs. For the generic case with unequal costs, we design an approximation algorithm that performs well in practice. When the underlying distributions are unknown, we develop an exploration-exploitation based strategy with a reward function that captures the cost and approximations of group distributions in each data source. Besides theoretical analysis, we conduct comprehensive experiments that confirm the effectiveness of our algorithms.


2018 ◽  
Author(s):  
Johannes Schulz-Stellenfleth ◽  
Joanna Staneva

Abstract. In many coastal areas there is an increasing number and variety of observation data available, which are often very heterogeneous in their temporal and spatial sampling characteristics. With the advent of new systems, like the radar altimeter onboard the SENTINEL-3a satellite, a lot of questions arise concerning the accuracy and added value of different instruments and numerical models. Quantification of errors is a key factor for applications, like data assimilation and forecast improvement. In the past, the triple collocation method to estimate systematic and stochastic errors of measurements and numerical models was successfully applied to different data sets. This method relies on the assumption, that three independent data sets provide estimates of the same quantity. In coastal areas with strong gradients even small distances between measurements can lead to larger differences and this assumption can become critical. In this study the triple collocation method is extended in different ways with the specific problems of the coast in mind. In addition to nearest neighbor approximations considered so far, the presented method allows to use a large variety of interpolation approaches to take spatial variations in the observed area into account. Observation and numerical model errors can therefore be estimated, even if the distance between the different data sources is too big to assume, that they measure the same quantity. If the number of observations is sufficient, the method can also be used to estimate error correlations between certain data source components. As a second novelty, an estimator for the uncertainty of the derived observation errors is derived as a function of the covariance matrices of the input data and the number of available samples. In the first step, the method is assessed using synthetic observations and Monte Carlo simulations. The technique is then applied to a data set of SENTINEL-3a altimeter measurements, insitu wave observation, and numerical wave model data with a focus on the North Sea. Stochastic observation errors for the significant wave height, as well as bias and calibration errors are derived for the model and the altimeter. The analysis indicates a slight overestimation of altimeter wave heights, which becomes more pronounced at higher sea states. The smallest stochastic errors are found for the insitu measurements. Different observation geometries of insitu data and altimeter tracks are furthermore analysed, considering 1D and 2D interpolation approaches. For example, the geometry of an altimeter track passing between two insitu wave instruments is considered with model data being available at the insitu locations. It is shown, that for a sufficiently large sample, the errors of all data sources, as well as the error correlations of the model, can be estimated with the new method.


2012 ◽  
Vol 19 (1) ◽  
pp. 69-80 ◽  
Author(s):  
S. Zwieback ◽  
K. Scipal ◽  
W. Dorigo ◽  
W. Wagner

Abstract. The validation of geophysical data sets (e.g. derived from models, exploration techniques or remote sensing) presents a formidable challenge as all products are inherently different and subject to errors. The collocation technique permits the retrieval of the error variances of different data sources without the need to specify one data set as a reference. In addition calibration constants can be determined to account for biases and different dynamic ranges. The method is frequently applied to the study and comparison of remote sensing, in-situ and modelled data, particularly in hydrology and oceanography. Previous studies have almost exclusively focussed on the validation of three data sources; in this paper it is shown how the technique generalizes to an arbitrary number of data sets. It turns out that only parts of the covariance structure can be resolved by the collocation technique, thus emphasizing the necessity of expert knowledge for the correct validation of geophysical products. Furthermore the bias and error variance of the estimators are derived with particular emphasis on the assumptions necessary for establishing those characteristics. Important properties of the method, such as the structural deficiencies, dependence of the accuracy on the number of measurements and the impact of violated assumptions, are illustrated by application to simulated data.


2018 ◽  
Vol 10 (3) ◽  
pp. 1473-1490 ◽  
Author(s):  
Birgit Hassler ◽  
Stefanie Kremser ◽  
Greg E. Bodeker ◽  
Jared Lewis ◽  
Kage Nesbit ◽  
...  

Abstract. An updated and improved version of a global, vertically resolved, monthly mean zonal mean ozone database has been calculated – hereafter referred to as the BSVertOzone (Bodeker Scientific Vertical Ozone) database. Like its predecessor, it combines measurements from several satellite-based instruments and ozone profile measurements from the global ozonesonde network. Monthly mean zonal mean ozone concentrations in mixing ratio and number density are provided in 5∘ latitude bins, spanning 70 altitude levels (1 to 70 km), or 70 pressure levels that are approximately 1 km apart (878.4 to 0.046 hPa). Different data sets or “tiers” are provided: Tier 0 is based only on the available measurements and therefore does not completely cover the whole globe or the full vertical range uniformly; the Tier 0.5 monthly mean zonal means are calculated as a filled version of the Tier 0 database where missing monthly mean zonal mean values are estimated from correlations against a total column ozone (TCO) database. The Tier 0.5 data set includes the full range of measurement variability and is created as an intermediate step for the calculation of the Tier 1 data where a least squares regression model is used to attribute variability to various known forcing factors for ozone. Regression model fit coefficients are expanded in Fourier series and Legendre polynomials (to account for seasonality and latitudinal structure, respectively). Four different combinations of contributions from selected regression model basis functions result in four different Tier 1 data sets that can be used for comparisons with chemistry–climate model (CCM) simulations that do not exhibit the same unforced variability as reality (unless they are nudged towards reanalyses). Compared to previous versions of the database, this update includes additional satellite data sources and ozonesonde measurements to extend the database period to 2016. Additional improvements over the previous version of the database include the following: (i) adjustments of measurements to account for biases and drifts between different data sources (using a chemistry-transport model, CTM, simulation as a transfer standard), (ii) a more objective way to determine the optimum number of Fourier and Legendre expansions for the basis function fit coefficients, and (iii) the derivation of methodological and measurement uncertainties on each database value are traced through all data modification steps. Comparisons with the ozone database from SWOOSH (Stratospheric Water and OzOne Satellite Homogenized data set) show good agreement in many regions of the globe. Minor differences are caused by different bias adjustment procedures for the two databases. However, compared to SWOOSH, BSVertOzone additionally covers the troposphere. Version 1.0 of BSVertOzone is publicly available at https://doi.org/http://doi.org/10.5281/zenodo.1217184.


2017 ◽  
Vol 33 (2) ◽  
pp. 477-511 ◽  
Author(s):  
Giles Reid ◽  
Felipa Zabala ◽  
Anders Holmberg

Abstract Many national statistics offices acknowledge that making better use of existing administrative data can reduce the cost of meeting ongoing statistical needs. Stats NZ has developed a framework to help facilitate this reuse. The framework is an adapted Total Survey Error (TSE) paradigm for understanding how the strengths and limitations of different data sets flow through a statistical design to affect final output quality. Our framework includes three phases: 1) a single source assessment, 2) an integrated data set assessment, and 3) an estimation and output assessment. We developed a process and guidelines for applying this conceptual framework to practical decisions about statistical design, and used these in recent redevelopment projects. We discuss how we used the framework with data sources that have a non-statistical primary purpose, and how it has helped us spread total survey error ideas to non-methodologists.


Sign in / Sign up

Export Citation Format

Share Document