Comparison of Imputation Methods for Handling Missing Categorical Data with Univariate Pattern
DOI:
https://doi.org/10.46661/revmetodoscuanteconempresa.2196Keywords:
Imputation methods, hot-deck, polytomous regression, random forests, smoking habits, missing categorical dataAbstract
This paper examines the sample proportions estimates in the presence of univariate missing categorical data. A database about smoking habits (2011 National Addiction Survey of Mexico) was used to create simulated yet realistic datasets at rates 5% and 15% of missingness, each for MCAR, MAR and MNAR mechanisms. Then the performance of six methods for addressing missingness is evaluated: listwise, mode imputation, random imputation, hot-deck, imputation by polytomous regression and random forests. Results showed that the most effective methods for dealing with missing categorical data in most of the scenarios assessed in this paper were hot-deck and polytomous regression approaches.
Downloads
References
Andridge, R. and Little, R. (2010). A Review of Hot Deck Imputation for Survey Non-response. International Statistical Review, 78 (1), pp. 40–64.
Bacallao, J. and Bacallao, J. (2010). Imputación Múltiple en Variables Categóricas Usando Data Augmentation y Árboles de Clasificación. Investigación Operacional, 31 (2), pp. 133–139.
Barceló, C. (2008). The impact of alternative imputation methods on the measurement of income and wealth: Evidence from the Spanish Survey of Household Finances. (No. 0829). Banco de España.
Burton, A., Billingham, L.J., and Bryan, S. (2007). Cost-effectiveness in clinical trials: using multiple imputation to deal with incomplete cost data. Clinical Trials, 4 (2), pp. 154–161.
Chauvet, G., Deville, J.C., and Haziza, D. (2011). On balanced random imputation in surveys. Biometrika, 98 (2), pp. 459–471.
Desai, M., Esserman, D.A., Gammon, M.D., and Terry, M.B. (2011). The use of complete-case and multiple imputation-based analyses in molecular epidemiology studies that assess interaction effects. Epidemiologic Perspectives and Innovations, 8 (1), 5.
Durrant, G.B. (2005). Imputation methods for handling item-nonresponse in the social sciences: a methodological review. NCRM Methods Review Papers. ESRC National Centre for Research Methods and Southampton Statistical Sciences Research Institute. NCRM/002.
Eisemann, N., Annika, W., and Alexander, K. (2011) Imputation of missing values of tumour stage in population-based cancer registration. BMC Medical Research Methodology, 11.
Farhangfar A, Kurgan L, and Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recognit 41 (12), pp. 3692–3705
Follmann, D., Elliott, P., Suh, I., and Cutler, J. (1992). Variance imputation for overviews of clinical trials with continuous response. Journal of clinical epidemiology, 45 (7), pp. 769–773.
Ghosh-Dastidar, B., and Schafer, J.L. (2003). Multiple edit/multiple imputation for multivariate continuous data. Journal of the American Statistical Association, 98 (464), pp. 807–817.
Gimotty, P.A. and Brown, M.B. (1990). Imputation procedures for categorical data: their effects on the goodness-of-fit chi-square statistic. Communications in Statistics - Simulation and Computation, 19 (2), pp. 681–703.
Hill, J. (2012) Four Techniques for Dealing with Missing Data in Criminal Justice. Paper presented at the annual meeting of the ASC Annual Meeting, Palmer House Hilton, Chicago, IL, Nov 13, 2012.
Hosmer, D.W. and Lemeshow, S. (1989). Introduction to the Logistic Regression Model. Applied Logistic Regression, Second Edition, pp. 1–30.
Kalton, G. and Kish, L. (1981). Two efficient random imputation procedures. Proceedings of the survey research methods section (pp. 146–151).
Little, R.J. (1988). A test of missing completely at random for multivariate data with missing values Journal of the American Statistical Association, 83 (404), pp. 1198–1202.
Little, R.J. and Rubin, D.B. (1987). Statistical analysis with missing data (Vol. 539). New York: Wiley.
Little, R.J. and Rubin, D.B. (2002). Statistical analysis with missing values. Wiley, New York.
Little, R.J. and Schluchter, M.D. (1985). Maximum likelihood estimation for mixed continuous and categorical data with missing values. Biometrika, 72 (3), pp. 497–512.
Matsubara, E.T., Prati, R.C., Batista, G.E., and Monard, M.C. (2008). Missing value imputation using a semi-supervised rank aggregation approach. Advances in Artificial Intelligence-SBIA 2008 (pp. 217–226). Springer Berlin Heidelberg.
Panranowitz, A. and Marwala, T. (2009) Missing Data Imputation Through the Use of the Random Forest Algorithm. Advances in Intelligent and Soft Computing Volume 116, pp. 53–62.
Rieger, A., Hothorn, T., and Strobl, C. (2010). Random Forests with Missing Values in the Covariates.
Rubin, D.B. (1976) Inference and missing data. Biometrika, 63, pp. 581–592.
Schafer, J.L. and Graham, J.W. (2002). Missing data: our view of the state of the art. Psychological methods, 7 (2), 147.
Segal, M.R. (2004). Machine learning benchmarks and random forest regression
Silva-Ramírez, E.L., Pino-Mejías, R., López-Coello, M., and Cubiles-de-la-Vega, M.D. (2011). Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Networks, 24 (1), pp. 121–129.
Song, Q., Shepperd, M., and Cartwright, M. (2005). A short note on safest default missingness mechanism assumptions. Empirical Software Engineering, 10 (2), pp. 235–243.
Souverein, O.W., Zwinderman, A.H., and Tanck, M.W. (2006). Multiple imputation of missing genotype data for unrelated individuals. Annals of human genetics, 70 (3), pp. 372–381.
Stekhoven, D.J. and Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28 (1), pp. 112–118.
van Buuren, S. (2012). Flexible imputation of missing data. CRC press.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2014 Revista de Métodos Cuantitativos para la Economía y la Empresa

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Submission of manuscripts implies that the work described has not been published before (except in the form of an abstract or as part of thesis), that it is not under consideration for publication elsewhere and that, in case of acceptance, the authors agree to automatic transfer of the copyright to the Journal for its publication and dissemination. Authors retain the authors' right to use and share the article according to a personal or instutional use or scholarly sharing purposes; in addition, they retain patent, trademark and other intellectual property rights (including research data).
All the articles are published in the Journal under the Creative Commons license CC-BY-SA (Attribution-ShareAlike). It is allowed a commercial use of the work (always including the author attribution) and other derivative works, which must be released under the same license as the original work.
Up to Volume 21, this Journal has been licensing the articles under the Creative Commons license CC-BY-SA 3.0 ES. Starting from Volume 22, the Creative Commons license CC-BY-SA 4.0 is used.