Comparison of Imputation Methods for Handling Missing Categorical Data with Univariate Pattern // Una comparación de métodos de imputación de variables categóricas con patrón univariado


  • Juan Armando Torres Munguía Maestría en Estadística Aplicada Instituto Tecnológico y de Estudios Superiores de Monterrey (México)

Palabras clave:

Imputation methods, hot-deck, polytomous regression, random forests, smoking habits, missing categorical data


This paper examines the sample proportions estimates in the presence of univariate missing categorical data. A database about smoking habits (2011 National Addiction Survey of Mexico) was used to create simulated yet realistic datasets at rates 5% and 15% of missingness, each for MCAR, MAR and MNAR mechanisms. Then the performance of six methods for addressing missingness is evaluated: listwise, mode imputation, random imputation, hot-deck, imputation by polytomous regression and random forests. Results showed that the most effective methods for dealing with missing categorical data in most of the scenarios assessed in this paper were hot-deck and polytomous regression approaches.


El presente estudio examina la estimación de proporciones muestrales en la presencia de valores faltantes en una variable categórica. Se utiliza una encuesta de consumo de tabaco (Encuesta Nacional de Adicciones de México 2011) para crear bases de datos simuladas pero reales con 5% y 15% de valores perdidos para cada mecanismo de no respuesta MCAR, MAR y MNAR. Se evalúa el desempeño de seis métodos para tratar la falta de respuesta: listwise, imputación de moda, imputación aleatoria, hot-deck, imputación por regresión politómica y árboles de clasificación. Los resultados de las simulaciones indican que los métodos más efectivos para el tratamiento de la no respuesta en variables categóricas, bajo los escenarios simulados, son hot-deck y la regresión politómica.


Los datos de descargas todavía no están disponibles.


Andridge, R. and Little, R. (2010). A Review of Hot Deck Imputation for Survey Non-response. International Statistical Review, 78 (1), pp. 40–64.

Bacallao, J. and Bacallao, J. (2010). Imputación Múltiple en Variables Categóricas Usando Data Augmentation y Árboles de Clasificación. Investigación Operacional, 31 (2), pp. 133–139.

Barceló, C. (2008). The impact of alternative imputation methods on the measurement of income and wealth: Evidence from the Spanish Survey of Household Finances. (No. 0829). Banco de España.

Burton, A., Billingham, L.J., and Bryan, S. (2007). Cost-effectiveness in clinical trials: using multiple imputation to deal with incomplete cost data. Clinical Trials, 4 (2), pp. 154–161.

Chauvet, G., Deville, J.C., and Haziza, D. (2011). On balanced random imputation in surveys. Biometrika, 98 (2), pp. 459–471.

Desai, M., Esserman, D.A., Gammon, M.D., and Terry, M.B. (2011). The use of complete-case and multiple imputation-based analyses in molecular epidemiology studies that assess interaction effects. Epidemiologic Perspectives and Innovations, 8 (1), 5.

Durrant, G.B. (2005). Imputation methods for handling item-nonresponse in the social sciences: a methodological review. NCRM Methods Review Papers. ESRC National Centre for Research Methods and Southampton Statistical Sciences Research Institute. NCRM/002.

Eisemann, N., Annika, W., and Alexander, K. (2011) Imputation of missing values of tumour stage in population-based cancer registration. BMC Medical Research Methodology, 11.

Farhangfar A, Kurgan L, and Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recognit 41 (12), pp. 3692–3705

Follmann, D., Elliott, P., Suh, I., and Cutler, J. (1992). Variance imputation for overviews of clinical trials with continuous response. Journal of clinical epidemiology, 45 (7), pp. 769–773.

Ghosh-Dastidar, B., and Schafer, J.L. (2003). Multiple edit/multiple imputation for multivariate continuous data. Journal of the American Statistical Association, 98 (464), pp. 807–817.

Gimotty, P.A. and Brown, M.B. (1990). Imputation procedures for categorical data: their effects on the goodness-of-fit chi-square statistic. Communications in Statistics - Simulation and Computation, 19 (2), pp. 681–703.

Hill, J. (2012) Four Techniques for Dealing with Missing Data in Criminal Justice. Paper presented at the annual meeting of the ASC Annual Meeting, Palmer House Hilton, Chicago, IL, Nov 13, 2012.

Hosmer, D.W. and Lemeshow, S. (1989). Introduction to the Logistic Regression Model. Applied Logistic Regression, Second Edition, pp. 1–30.

Kalton, G. and Kish, L. (1981). Two efficient random imputation procedures. Proceedings of the survey research methods section (pp. 146–151).

Little, R.J. (1988). A test of missing completely at random for multivariate data with missing values Journal of the American Statistical Association, 83 (404), pp. 1198–1202.

Little, R.J. and Rubin, D.B. (1987). Statistical analysis with missing data (Vol. 539). New York: Wiley.

Little, R.J. and Rubin, D.B. (2002). Statistical analysis with missing values. Wiley, New York.

Little, R.J. and Schluchter, M.D. (1985). Maximum likelihood estimation for mixed continuous and categorical data with missing values. Biometrika, 72 (3), pp. 497–512.

Matsubara, E.T., Prati, R.C., Batista, G.E., and Monard, M.C. (2008). Missing value imputation using a semi-supervised rank aggregation approach. Advances in Artificial Intelligence-SBIA 2008 (pp. 217–226). Springer Berlin Heidelberg.

Panranowitz, A. and Marwala, T. (2009) Missing Data Imputation Through the Use of the Random Forest Algorithm. Advances in Intelligent and Soft Computing Volume 116, pp. 53–62.

Rieger, A., Hothorn, T., and Strobl, C. (2010). Random Forests with Missing Values in the Covariates.

Rubin, D.B. (1976) Inference and missing data. Biometrika, 63, pp. 581–592.

Schafer, J.L. and Graham, J.W. (2002). Missing data: our view of the state of the art. Psychological methods, 7 (2), 147.

Segal, M.R. (2004). Machine learning benchmarks and random forest regression

Silva-Ramírez, E.L., Pino-Mejías, R., López-Coello, M., and Cubiles-de-la-Vega, M.D. (2011). Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Networks, 24 (1), pp. 121–129.

Song, Q., Shepperd, M., and Cartwright, M. (2005). A short note on safest default missingness mechanism assumptions. Empirical Software Engineering, 10 (2), pp. 235–243.

Souverein, O.W., Zwinderman, A.H., and Tanck, M.W. (2006). Multiple imputation of missing genotype data for unrelated individuals. Annals of human genetics, 70 (3), pp. 372–381.

Stekhoven, D.J. and Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28 (1), pp. 112–118.

van Buuren, S. (2012). Flexible imputation of missing data. CRC press.



Cómo citar

Torres Munguía, J. A. (2016). Comparison of Imputation Methods for Handling Missing Categorical Data with Univariate Pattern // Una comparación de métodos de imputación de variables categóricas con patrón univariado. Revista De Métodos Cuantitativos Para La Economía Y La Empresa, 17, Páginas 101 a 120. Recuperado a partir de