ABSTRACT

We present some parametric imputation methods for data with one missing variable Y and fully observed covariates X. For a continuous incomplete variable, we can use normal linear regression models (i.e., regressing Y on X) for imputation. For non-continuous variables, we can conduct imputation using generalized linear models. Specifically for binary data, a commonly-used imputation model is the logistic regression model. We compare logistic regression imputation with discriminant analysis imputation method and conclude the former is more robust against model misspecifications. In addition, we show that the classic imputation algorithm used for logistic models work well despite that it only approximates the posterior distribution of parameters. We compare several strategies on handling data separation (or perfect prediction) in logistic regression imputation, which often occurs if the distribution of binary data is highly unbalanced. We start tackling the missing covariate problem, that is, one covariate is missing in a targeted regression analysis. We demonstrate that for imputing the missing covariate, the outcome variable in the regression needs to be included as a predictor in the imputation. All these imputation methods can be implemented using either R MICE or SAS PROC MI. Real examples include the U.S. birth data (e.g., missing gestational age) and BRFSS survey data.