mean imputation
Recently Published Documents


TOTAL DOCUMENTS

49
(FIVE YEARS 23)

H-INDEX

6
(FIVE YEARS 1)

Author(s):  
A. Audu ◽  
A. Danbaba ◽  
S. K. Ahmad ◽  
N. Musa ◽  
A. Shehu ◽  
...  

Human-assisted surveys, such as medical and social science surveys, are frequently plagued by non-response or missing observations. Several authors have devised different imputation algorithms to account for missing observations during analyses. Nonetheless, several of these imputation schemes' estimators are based on known population meanof auxiliary variable. In this paper, a new class of almost unbiased imputation method that uses  as an estimate of is suggested. Using the Taylor series expansion technique, the MSE of the class of estimators presented was derived up to first order approximation. Conditions were also specified for which the new estimators were more efficient than the other estimators studied in the study. The results of numerical examples through simulations revealed that the suggested class of estimators is more efficient.


2021 ◽  
Vol 6 (2) ◽  
pp. 134-143
Author(s):  
Bijanto Bijanto ◽  
Ryan Yunus

The lost impact on the research process, can be serious in classifying results leading to biased parameter estimates, statistical information, decreased quality, increased standard error, and weak generalization of the findings. In this paper, we discuss the problems that exist in one of the algorithms, namely the Naive Bayes Kernel algorithm. The Naive Bayes kernel algorithm has the disadvantage of not being able to process data with the mission value. Therefore, in order to process missing value data, there is one method that we propose to overcome, namely using the mean imputation method. The data we use is public data from UCI, namely the HCV (Hepatisis C Virus) dataset. The input method used to correct the missing data so that it can be filled with the average value of the existing data. Before the imputation process means, the dataset uses yahoo bootstrap first. The data that has been corrected using the mean imputation method has just been processed using the Naive Bayes Kernel Algorithm. From the results of the research tests that have been carried out, it can be obtained an accuracy value of 96.05% and the speed of the data computing process with 1 second.


2021 ◽  
pp. e1-e9
Author(s):  
Elizabeth A. Erdman ◽  
Leonard D. Young ◽  
Dana L. Bernson ◽  
Cici Bauer ◽  
Kenneth Chui ◽  
...  

Objectives. To develop an imputation method to produce estimates for suppressed values within a shared government administrative data set to facilitate accurate data sharing and statistical and spatial analyses. Methods. We developed an imputation approach that incorporated known features of suppressed Massachusetts surveillance data from 2011 to 2017 to predict missing values more precisely. Our methods for 35 de-identified opioid prescription data sets combined modified previous or next substitution followed by mean imputation and a count adjustment to estimate suppressed values before sharing. We modeled 4 methods and compared the results to baseline mean imputation. Results. We assessed performance by comparing root mean squared error (RMSE), mean absolute error (MAE), and proportional variance between imputed and suppressed values. Our method outperformed mean imputation; we retained 46% of the suppressed value’s proportional variance with better precision (22% lower RMSE and 26% lower MAE) than simple mean imputation. Conclusions. Our easy-to-implement imputation technique largely overcomes the adverse effects of low count value suppression with superior results to simple mean imputation. This novel method is generalizable to researchers sharing protected public health surveillance data. (Am J Public Health. Published online ahead of print September 16, 2021: e1–e9. https://doi.org/10.2105/AJPH.2021.306432 )


2021 ◽  
Vol 29 (2) ◽  
Author(s):  
Nurul Azifah Mohd Pauzi ◽  
Yap Bee Wah ◽  
Sayang Mohd Deni ◽  
Siti Khatijah Nor Abdul Rahim ◽  
Suhartono

High quality data is essential in every field of research for valid research findings. The presence of missing data in a dataset is common and occurs for a variety of reasons such as incomplete responses, equipment malfunction and data entry error. Single and multiple data imputation methods have been developed for data imputation of missing values. This study investigated the performance of single imputation using mean and multiple imputation method using Multivariate Imputation by Chained Equations (MICE) via a simulation study. The MCAR which means missing completely at random were generated randomly for ten levels of missing rates (proportion of missing data): 5% to 50% for different sample sizes. Mean Square Error (MSE) was used to evaluate the performance of the imputation methods. Data imputation method depends on data types. Mean imputation is commonly used to impute missing values for continuous variable while MICE method can handle both continuous and categorical variables. The simulation results indicate that group mean imputation (GMI) performed better compared to overall mean imputation (OMI) and MICE with lowest value of MSE for all sample sizes and missing rates. The MSE of OMI, GMI, and MICE increases when missing rate increases. The MICE method has the lowest performance (i.e. highest MSE) when percentage of missing rates is more than 15%. Overall, GMI is more superior compared to OMI and MICE for all missing rates and sample size for MCAR mechanism. An application to a real dataset confirmed the findings of the simulation results. The findings of this study can provide knowledge to researchers and practitioners on which imputation method is more suitable when the data involves missing data.


2021 ◽  
Author(s):  
Shuo Feng ◽  
Celestin Hategeka ◽  
Karen Ann Grépin

Abstract Background : Poor data quality is limiting the greater use of data sourced from routine health information systems (RHIS), especially in low and middle-income countries. An important part of this issue comes from missing values, where health facilities, for a variety of reasons, miss their reports into the central system. Methods : Using data from the Health Management Information System (HMIS) and the advent of COVID-19 pandemic in the Democratic Republic of the Congo (DRC) as an illustrative case study, we implemented six commonly-used imputation methods using the DRC’s HMIS datasets and evaluated their performance through various statistical techniques, i.e., simple linear regression, segmented regression which is widely used in interrupted time series studies, and parametric comparisons through t-tests and non-parametric comparisons through Wilcoxon Rank-Sum tests. We also examined the performance of these six imputation methods under different missing mechanisms and tested their stability to changes in the data. Results : For regression analyses, there was no substantial difference found in the results generated from all methods except mean imputation and exclusion & interpolation when the RHIS dataset contained less than 20% missing values. However, as the missing proportion grew, machine learning methods such as missForest and k -NN started to produce biased estimates, and they were found to be also lack of robustness to minimal changes in data or to consecutive missingness. On the other hand, multiple imputation generated the overall most unbiased estimates and was the most robust to all changes in data. For comparing group means through t-tests, the results from mean imputation and exclusion & interpolation disagreed with the true inference obtained using the complete data, suggesting that these two methods would not only lead to biased regression estimates but also generate unreliable t-test results. Conclusions : We recommend the use of multiple imputation in addressing missing values in RHIS datasets. In cases necessary computing resources are unavailable to multiple imputation, one may consider seasonal decomposition as the next best method. Mean imputation and exclusion & interpolation, however, always produced biased and misleading results in the subsequent analyses, and thus their use in the handling of missing values should be discouraged. Keywords : Missing Data; Routine Health Information Systems (RHIS); Health Management Information System (HMIS); Health Services Research; Low and middle-income countries (LMICs); Multiple imputation


Author(s):  
Madeline D. Cabauatan Et.al

The main objective of the study was to evaluate item nonresponse procedures through a simulation study of different nonresponse levels or missing rates. A simulation study was used to explore how each of the response rates performs under a variety of circumstances. It also investigated the performance of procedures suggested for item nonresponse under various conditions and variable trends. The imputation methods considered were the cell mean imputation, random hotdeck, nearest neighbor, and simple regression. These variables are some of the major indicators for measuring productive labor and decent work in the country. For the purpose of this study, the researcher is interested in evaluating methods for imputing missing data for the number of workers and total cost of labor per establishment from the World Bank’s 2015 Enterprise Survey for the Philippines. The performances of the imputation techniques for item nonresponse were evaluated in terms of bias and coefficient of variation for accuracy and precision. Based on the results, the cell-mean imputation was seen to be most appropriate for imputing missing values for the total number of workers and total cost of labor per establishment. Since the study was limited to the variables cited, it is recommended to explore other labor indicators. Moreover, exploring choice of other clustering groups is highly recommended as clustering groups have great effect in the resulting estimates of imputation estimation. It is also recommended to explore other imputation techniques like multiple regression and other parametric models for nonresponse such as the Bayes estimation method. For regression based imputation, since the study is limited only in using the cluster groupings estimation, it is highly recommended to use other possible variables that might be related to the variable of interest to verify the results of this study.


2021 ◽  
Author(s):  
Jesse M. Vance ◽  
Kim Currie ◽  
John Zeldis ◽  
Peter Dillingham ◽  
Cliff S. Law

Abstract. Regularized time series of ocean carbon data are necessary for assessing seasonal dynamics, annual budgets, interannual variability and long-term trends. There are, however, no standardized methods for imputing gaps in ocean carbon time series, and only limited evaluation of the numerous methods available for constructing uninterrupted time series. A comparative assessment of eight imputation models was performed using data from seven long-term monitoring sites. Multivariate linear regression (MLR), mean imputation, linear interpolation, spline interpolation, Stineman interpolation, Kalman filtering, weighted moving average and multiple imputation by chained equation (MICE) models were compared using cross-validation to determine error and bias. A bootstrapping approach was employed to determine model sensitivity to varied degrees of data gaps and secondary time series with artificial gaps were used to evaluate impacts on seasonality and annual summations and to estimate uncertainty. All models were fit to DIC time series, with MLR and MICE models also applied to field measurements of temperature, salinity and remotely sensed chlorophyll, with model coefficients fit for monthly mean conditions. MLR estimated DIC with a mean error of 8.8 umol kg−1 among 5 oceanic sites and 20.0 ummol kg−1 among 2 coastal sites. The empirical methods of MLR, MICE and mean imputation retained observed seasonal cycles over greater amounts and durations of gaps resulting in lower error in annual budgets, outperforming the other statistical methods. MLR had lower bias and sampling sensitivity than MICE and mean imputation and provided the most robust option for imputing time series with gaps of various duration.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Kaja Wasik ◽  
Tomaz Berisa ◽  
Joseph K. Pickrell ◽  
Jeremiah H. Li ◽  
Dana J. Fraser ◽  
...  

Abstract Background Low pass sequencing has been proposed as a cost-effective alternative to genotyping arrays to identify genetic variants that influence multifactorial traits in humans. For common diseases this typically has required both large sample sizes and comprehensive variant discovery. Genotyping arrays are also routinely used to perform pharmacogenetic (PGx) experiments where sample sizes are likely to be significantly smaller, but clinically relevant effect sizes likely to be larger. Results To assess how low pass sequencing would compare to array based genotyping for PGx we compared a low-pass assay (in which 1x coverage or less of a target genome is sequenced) along with software for genotype imputation to standard approaches. We sequenced 79 individuals to 1x genome coverage and genotyped the same samples on the Affymetrix Axiom Biobank Precision Medicine Research Array (PMRA). We then down-sampled the sequencing data to 0.8x, 0.6x, and 0.4x coverage, and performed imputation. Both the genotype data and the sequencing data were further used to impute human leukocyte antigen (HLA) genotypes for all samples. We compared the sequencing data and the genotyping array data in terms of four metrics: overall concordance, concordance at single nucleotide polymorphisms in pharmacogenetics-related genes, concordance in imputed HLA genotypes, and imputation r2. Overall concordance between the two assays ranged from 98.2% (for 0.4x coverage sequencing) to 99.2% (for 1x coverage sequencing), with qualitatively similar numbers for the subsets of variants most important in pharmacogenetics. At common single nucleotide polymorphisms (SNPs), the mean imputation r2 from the genotyping array was 0.90, which was comparable to the imputation r2 from 0.4x coverage sequencing, while the mean imputation r2 from 1x sequencing data was 0.96. Conclusions These results indicate that low-pass sequencing to a depth above 0.4x coverage attains higher power for association studies when compared to the PMRA and should be considered as a competitive alternative to genotyping arrays for trait mapping in pharmacogenetics.


2021 ◽  
Vol 2020 (1) ◽  
pp. 511-518
Author(s):  
Iman Jihad Fadillah ◽  
Chaterina Dwi Puspita

Pada tahun 2020, hampir semua negara di dunia menghadapi wabah COVID-19, termasuk Indonesia. Salah satu dampak yang terjadi karena adanya pandemi COVID-19 adalah terhambatnya kegiatan statistik, seperti tertundanya atau berhentinya pelaksanaan pengumpulan data survei dan sensus serta pengumpulan data lainnya. Sementara itu, untuk memenuhi permintaan dan kebutuhan data selama masa pandemi COVID-19, badan statistik nasional harus tetap melakukan pengumpulan data dan menyediakan data statistik. Oleh sebab itu, badan statistik nasional harus melakukan adaptasi untuk kegiatan proses sensus dan survei yang dilakukan, seperti mencari mode pengumpulan data alternatif, mengurangi ukuran sampel, memodifikasi desain sampel, mengurangi item pertanyaan di kuesioner, atau lainnya. Berdasarkan uraian tersebut, adaptasi kegiatan pengumpulan data sensus/survei yang dilakukan pada masa pandemi COVID-19 akan berpengaruh pada kualitas data yang dihasilkan. Salah satunya adanya missing data. Untuk mengatasi masalah missing data, salah satu metode yang dapat digunakan adalah imputasi data. Salah satu jenis metode imputasi berbasis machine learning yang sering digunakan adalah Weighted K-Nearest Neighbor Imputation (Weighted KNNI). Metode Weighted KNNI memiliki akurasi yang lebih baik dibandingkan kedua metode imputasi lainnya (Unweighted KNNI dan Mean Imputation) pada tiap persentase missing data baik akurasi dari sisi RMSE maupun akurasi dari sisi MAPE. Berdasarkan hasil tersebut yang dilihat dari akurasinya, metode Weighted KNNI dapat digunakan sebagai salah satu solusi untuk menangani ketidaklengkapan data pada masa pandemi COVID19 sekarang ini.


Sign in / Sign up

Export Citation Format

Share Document