scholarly journals Multiple imputation with missing data indicators

2021 ◽  
pp. 096228022110473
Author(s):  
Lauren J Beesley ◽  
Irina Bondarenko ◽  
Michael R Elliot ◽  
Allison W Kurian ◽  
Steven J Katz ◽  
...  

Multiple imputation is a well-established general technique for analyzing data with missing values. A convenient way to implement multiple imputation is sequential regression multiple imputation, also called chained equations multiple imputation. In this approach, we impute missing values using regression models for each variable, conditional on the other variables in the data. This approach, however, assumes that the missingness mechanism is missing at random, and it is not well-justified under not-at-random missingness without additional modification. In this paper, we describe how we can generalize the sequential regression multiple imputation imputation procedure to handle missingness not at random in the setting where missingness may depend on other variables that are also missing but not on the missing variable itself, conditioning on fully observed variables. We provide algebraic justification for several generalizations of standard sequential regression multiple imputation using Taylor series and other approximations of the target imputation distribution under missingness not at random. Resulting regression model approximations include indicators for missingness, interactions, or other functions of the missingness not at random missingness model and observed data. In a simulation study, we demonstrate that the proposed sequential regression multiple imputation modifications result in reduced bias in the final analysis compared to standard sequential regression multiple imputation, with an approximation strategy involving inclusion of an offset in the imputation model performing the best overall. The method is illustrated in a breast cancer study, where the goal is to estimate the prevalence of a specific genetic pathogenic variant.

2019 ◽  
Vol 44 (5) ◽  
pp. 625-641
Author(s):  
Timothy Hayes

Multiple imputation is a popular method for addressing data that are presumed to be missing at random. To obtain accurate results, one’s imputation model must be congenial to (appropriate for) one’s intended analysis model. This article reviews and demonstrates two recent software packages, Blimp and jomo, to multiply impute data in a manner congenial with three prototypical multilevel modeling analyses: (1) a random intercept model, (2) a random slope model, and (3) a cross-level interaction model. Following these analysis examples, I review and discuss both software packages.


2018 ◽  
Vol 26 (4) ◽  
pp. 480-488 ◽  
Author(s):  
Thomas B. Pepinsky

This letter compares the performance of multiple imputation and listwise deletion using a simulation approach. The focus is on data that are “missing not at random” (MNAR), in which case both multiple imputation and listwise deletion are known to be biased. In these simulations, multiple imputation yields results that are frequently more biased, less efficient, and with worse coverage than listwise deletion when data are MNAR. This is the case even with very strong correlations between fully observed variables and variables with missing values, such that the data are very nearly “missing at random.” These results recommend caution when comparing the results from multiple imputation and listwise deletion, when the true data generating process is unknown.


2020 ◽  
Vol 8 (1) ◽  
pp. 249-271
Author(s):  
Nathan Corder ◽  
Shu Yang

Abstract The problem of missingness in observational data is ubiquitous. When the confounders are missing at random, multiple imputation is commonly used; however, the method requires congeniality conditions for valid inferences, which may not be satisfied when estimating average causal treatment effects. Alternatively, fractional imputation, proposed by Kim 2011, has been implemented to handling missing values in regression context. In this article, we develop fractional imputation methods for estimating the average treatment effects with confounders missing at random. We show that the fractional imputation estimator of the average treatment effect is asymptotically normal, which permits a consistent variance estimate. Via simulation study, we compare fractional imputation’s accuracy and precision with that of multiple imputation.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Giulia Carreras ◽  
◽  
Guido Miccinesi ◽  
Andrew Wilcock ◽  
Nancy Preston ◽  
...  

Abstract Background Missing data are common in end-of-life care studies, but there is still relatively little exploration of which is the best method to deal with them, and, in particular, if the missing at random (MAR) assumption is valid or missing not at random (MNAR) mechanisms should be assumed. In this paper we investigated this issue through a sensitivity analysis within the ACTION study, a multicenter cluster randomized controlled trial testing advance care planning in patients with advanced lung or colorectal cancer. Methods Multiple imputation procedures under MAR and MNAR assumptions were implemented. Possible violation of the MAR assumption was addressed with reference to variables measuring quality of life and symptoms. The MNAR model assumed that patients with worse health were more likely to have missing questionnaires, making a distinction between single missing items, which were assumed to satisfy the MAR assumption, and missing values due to completely missing questionnaire for which a MNAR mechanism was hypothesized. We explored the sensitivity to possible departures from MAR on gender differences between key indicators and on simple correlations. Results Up to 39% of follow-up data were missing. Results under MAR reflected that missingness was related to poorer health status. Correlations between variables, although very small, changed according to the imputation method, as well as the differences in scores by gender, indicating a certain sensitivity of the results to the violation of the MAR assumption. Conclusions The findings confirmed the importance of undertaking this kind of analysis in end-of-life care studies.


2021 ◽  
Author(s):  
Sara Javadi ◽  
Abbas Bahrampour ◽  
Mohammad Mehdi Saber ◽  
Mohammad Reza Baneshi

Abstract Background: Among the new multiple imputation methods, Multiple Imputation by Chained ‎Equations (MICE) is a ‎popular ‎approach for implementing multiple imputations because of its ‎flexibility. Our main focus in this study ‎is to ‎compare the performance of parametric ‎imputation models based on predictive mean matching and ‎recursive partitioning methods ‎in multiple imputation by chained equations in the ‎presence of interaction in the ‎data.Methods: We compared the performance of parametric and tree-based imputation methods via simulation using two data generation models. For each combination of data generation model and imputation method, the following steps were performed: data generation, removal of observations, imputation, logistic regression analysis, and calculation of bias, Coverage Probability (CP), and Confidence Interval (CI) width for each coefficient Furthermore, model-based and empirical SE, and estimated proportion of the variance attributable to the missing data (λ) were calculated.Results: ‎We have shown by simulation that to impute a binary response in ‎observations involving an ‎interaction, manually interring the interaction term into the imputation model in the ‎predictive mean matching ‎model improves the performance of the PMM method compared to the recursive partitioning models in ‎ ‎multiple imputation by chained equations.‎ The parametric method in which we entered the interaction model into the imputation model (MICE-‎‎‎Interaction) led to smaller bias, slightly higher coverage probability for the interaction effect, but it ‎had ‎slightly ‎wider confidence intervals than tree-based imputation (especially classification and ‎regression ‎trees). Conclusions: The application of MICE-Interaction led to better performance than ‎recursive ‎partitioning methods in MICE, although ‎the user is interested in estimating the interaction and does not ‎know ‎enough about the structure of the observations, recursive partitioning methods can be ‎suggested to impute ‎the ‎missing values.


2011 ◽  
Vol 26 (S2) ◽  
pp. 572-572
Author(s):  
N. Resseguier ◽  
H. Verdoux ◽  
F. Clavel-Chapelon ◽  
X. Paoletti

IntroductionThe CES-D scale is commonly used to assess depressive symptoms (DS) in large population-based studies. Missing values in items of the scale may create biases.ObjectivesTo explore reasons for not completing items of the CES-D scale and to perform sensitivity analysis of the prevalence of DS to assess the impact of different missing data hypotheses.Methods71412 women included in the French E3N cohort returned in 2005 a questionnaire containing the CES-D scale. 45% presented at least one missing value in the scale. An interview study was carried out on a random sample of 204 participants to examine the different hypotheses for the missing value mechanism. The prevalence of DS was estimated according to different methods for handling missing values: complete cases analysis, single imputation, multiple imputation under MAR (missing at random) and MNAR (missing not at random) assumptions.ResultsThe interviews showed that participants were not embarrassed to fill in questions about DS. Potential reasons of nonresponse were identified. MAR and MNAR hypotheses remained plausible and were explored.Among complete responders, the prevalence of DS was 26.1%. After multiple imputation under MAR assumption, it was 28.6%, 29.8% and 31.7% among women presenting up to 4, to 10 and to 20 missing values, respectively. The estimates were robust after applying various scenarios of MNAR data for the sensitivity analysis.ConclusionsThe CES-D scale can easily be used to assess DS in large cohorts. Multiple imputation under MAR assumption allows to reliably handle missing values.


2019 ◽  
Vol 8 (5) ◽  
pp. 965-989
Author(s):  
M Quartagno ◽  
J R Carpenter ◽  
H Goldstein

Abstract Multiple imputation is now well established as a practical and flexible method for analyzing partially observed data, particularly under the missing at random assumption. However, when the substantive model is a weighted analysis, there is concern about the empirical performance of Rubin’s rules and also about how to appropriately incorporate possible interaction between the weights and the distribution of the study variables. One approach that has been suggested is to include the weights in the imputation model, potentially also allowing for interactions with the other variables. We show that the theoretical criterion justifying this approach can be approximately satisfied if we stratify the weights to define level-two units in our data set and include random intercepts in the imputation model. Further, if we let the covariance matrix of the variables have a random distribution across the level-two units, we also allow imputation to reflect any interaction between weight strata and the distribution of the variables. We evaluate our proposal in a number of simulation scenarios, showing it has promising performance both in terms of coverage levels of the model parameters and bias of the associated Rubin’s variance estimates. We illustrate its application to a weighted analysis of factors predicting reception-year readiness in children in the UK Millennium Cohort Study.


2021 ◽  
Vol 18 (1) ◽  
Author(s):  
Cattram D. Nguyen ◽  
John B. Carlin ◽  
Katherine J. Lee

AbstractMultiple imputation is a recommended method for handling incomplete data problems. One of the barriers to its successful use is the breakdown of the multiple imputation procedure, often due to numerical problems with the algorithms used within the imputation process. These problems frequently occur when imputation models contain large numbers of variables, especially with the popular approach of multivariate imputation by chained equations. This paper describes common causes of failure of the imputation procedure including perfect prediction and collinearity, focusing on issues when using Stata software. We outline a number of strategies for addressing these issues, including imputation of composite variables instead of individual components, introducing prior information and changing the form of the imputation model. These strategies are illustrated using a case study based on data from the Longitudinal Study of Australian Children.


2014 ◽  
Vol 26 (2) ◽  
pp. 707-723 ◽  
Author(s):  
Kyoji Furukawa ◽  
Dale L. Preston ◽  
Munechika Misumi ◽  
Harry M. Cullings

While data are unavoidably missing or incomplete in most observational studies, consequences of mishandling such incompleteness in analysis are often overlooked. When time-varying information is collected irregularly and infrequently over a long period, even precisely obtained data may implicitly involve substantial incompleteness. Motivated by an analysis to quantitatively evaluate the effects of smoking and radiation on lung cancer risks among Japanese atomic-bomb survivors, we provide a unique application of multiple imputation to incompletely observed smoking histories under the assumption of missing at random. Predicting missing values for the age of smoking initiation and, given initiation, smoking intensity and cessation age, analyses can be based on complete, though partially imputed, smoking histories. A simulation study shows that multiple imputation appropriately conditioned on the outcome and other relevant variables can produce consistent estimates when data are missing at random. Our approach is particularly appealing in large cohort studies where a considerable amount of time-varying information is incomplete under a mechanism depending in a complex manner on other variables. In application to the motivating example, this approach is expected to reduce estimation bias that might be unavoidable in naive analyses, while keeping efficiency by retaining known information.


Sign in / Sign up

Export Citation Format

Share Document