ABSTRACT

We provide an overview of the multiple imputation analysis approach. It consists of three stages. In imputation stage, missing values are replaced by draws from their posterior predictive distributions. This process is repeated M times and results in M completed datasets. In analysis stage, each completed dataset is analyzed by standard complete-data statistical procedures, resulting in M sets of results. Finally in combining stage, these M sets of results are combined to form a single set of inference using the “Rubin's combining rules”. We provide a theoretical justification for multiple imputation, showing it is an approximation of a Bayesian analysis. We give some discussions on the between-imputation variance and within-imputation variance used in the combining step. Combining rules are very general and can be applied to hypothesis tests (e.g., p-values) for multivariate estimands. A principled imputation algorithm is data augmentation, which iterates between drawing missing values and drawing parameters from their posterior (predictive) distributions. Sometimes the data augmentation can be approximated by replacing the posterior distribution by the sampling distribution (or bootstrap distribution) of the maximum likelihood estimates. All these algorithms ensure the multiple imputation is proper (i.e., correctly accounting for the uncertainty of parameters). Many imputation methods and algorithms have been implemented in major statistical software packages such as SAS and R.