*by Paul von Hippel
*

*January 9, 2018*

When using multiple imputation, you may wonder how many imputations you need. A simple answer is that more imputations are better. As you add more imputations, your estimates get more *precise*, meaning they have smaller standard errors. And your estimates get more *replicable, *meaning they would not change too much if you imputed the data again.

There are limits, though. No matter how many imputations you use, multiple imputation estimates can never be more precise or replicable than maximum likelihood estimates. And beyond a certain number of imputations, any improvement in precision and replicability becomes negligible.

So how many imputations are enough? An old rule of thumb is that 3 to 10 imputations typically suffice (Rubin 1987). But that advice only addresses the precision and replicability of your *point estimates*. If you also want your *standard error (SE) estimates* to be replicable, you may need more imputations (Bodner 2008; Graham, Olchowski, & Gilreath 2007; White, Royston, & Wood 2011).

I’ve just published an article (von Hippel 2018) that calculates how many imputations you need for replicable SE estimates. I implemented the calculation in the new Stata command **how_many_imputations, **which you can install from the Stata command line by typing **ssc install how_many_imputations**. I also implemented the calculation using the new SAS macro **%mi_combine**, which you can download from this Google Drive folder, which also contains code and data illustrating the macro’s use.

Here’s a brief summary of what the new software does. For replicable SE estimates, the number of imputations you need is about

*M* = 1+ ½ (*FMI* / *CV(se)*)^{2}

(von Hippel, 2018), where *FMI* is the fraction of missing information, and *CV(se)* is a coefficient of variation, which you can think of as roughly the percentage by which you’d be willing to see the SE estimate change if the data were imputed again. If you have *FMI*=30% missing information, for example, and you would accept the SE estimate changing by 10% if you imputed the data again, then you’ll need *M*=5 or 6 imputations. But if you’d only accept the SE changing by 5%, then you’ll need *M*=19 imputations.

The only problem with this formula is that *FMI* is not known in advance. *FMI* is not the fraction of *values* that are missing; instead, *FMI* is the fraction of *information* that is missing about the parameters. And *FMI* has to be estimated, typically by multiple imputation.

For that reason, I recommend a two-step recipe (von Hippel, 2018):

- First, carry out a pilot analysis. Impute the data using a convenient number of imputations.
*(*20 imputations is a reasonable default, if it doesn’t take too long.) Estimate the*FMI*by analyzing the imputed data. - Next, plug the estimated
*FMI*into the formula above to figure out how many imputations you need to achieve a certain value of*CV(se)*. If you need more imputations than you had in the pilot, then add those imputations and analyze the data again.

There is a small wrinkle: when you plug an estimate of *FMI* into the formula, you shouldn’t use a point estimate. Instead, you should use the upper bound of a 95% confidence interval for *FMI*. That way you’ll only have a 2.5% chance of not having enough imputations in your final analysis.

This two-step recipe is implemented in my new Stata and SAS software. There’s more explanation in von Hippel (2018).

## Reference

von Hippel, Paul T. (2018). “How many imputations do you need? A two-stage calculation using a quadratic rule.” *Sociological Methods and Research*, published online, behind a paywall. A free pre-publication version is available as an arXiv e-print.

## See also

Bodner, T. E. (2008). What Improves with Increased Missing Data Imputations? *Structural Equation Modeling*, *15*(4), 651–675. https://doi.org/10.1080/10705510802339072

Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory. *Prevention Science*, *8*(3), 206–213. https://doi.org/10.1007/s11121-007-0070-9

Rubin, D. B. (1987). *Multiple imputation for nonresponse in surveys*. New York: Wiley.

White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. *Statistics in Medicine*, *30*(4), 377–399. https://doi.org/10.1002/sim.4067

Dear Mr Von Hippel, An R implementation of your formula was written by Jos Herrickson – https://gist.github.com/josherrickson/db8f828556a9dce2e3013c8be0ca5e05. This might help others who use R. Kinds regards, TB

LikeLiked by 1 person

Unfortunately, it is not available for the latest version of R

LikeLike

thank you for your Macro! I just have a quick question. How does this macro work if I have categorical data. Specifically, I have been using the GEE model to look at a binary outcome with categorical confounding factors. How can I use the macro in order to combine estimates if I cannot use the “mean” function since these are categorical variables? Should I just use Proc MIANALYZE?

thank you!

LikeLike

Hi, Katy. The formula, Stata command, and SAS macro are quite general and should work for any situation where you have point estimates and standard errors. They don’t assume that you’re estimating a mean and won’t be bothered by the fact that you’re estimating a GEE model. DM me if you have trouble.

LikeLike