by Paul von Hippel
January 9, 2018
When using multiple imputation, you may wonder how many imputations you need. A simple answer is that more imputations are better. As you add more imputations, your estimates get more precise, meaning they have smaller standard errors. And your estimates get more replicable, meaning they would not change too much if you imputed the data again.
There are limits, though. No matter how many imputations you use, multiple imputation estimates can never be more precise or replicable than maximum likelihood estimates. And beyond a certain number of imputations, any improvement in precision and replicability becomes negligible.
So how many imputations are enough? An old rule of thumb is that 3 to 10 imputations typically suffice (Rubin 1987). But that advice only addresses the precision and replicability of your point estimates. If you also want your standard error (SE) estimates to be replicable, you may need more imputations (Bodner 2008; Graham, Olchowski, & Gilreath 2007; White, Royston, & Wood 2011).
I’ve just published an article (von Hippel 2018) that calculates how many imputations you need for replicable SE estimates. I implemented the calculation in the new Stata command how_many_imputations, which you can install from the Stata command line by typing ssc install how_many_imputations. I also implemented the calculation using the new SAS macro %mi_combine, which you can download from this Google Drive folder, which also contains code and data illustrating the macro’s use.
Here’s a brief summary of what the new software does. For replicable SE estimates, the number of imputations you need is about
M = 1+ ½ (FMI / CV(se))2
(von Hippel, 2018), where FMI is the fraction of missing information, and CV(se) is a coefficient of variation, which you can think of as roughly the percentage by which you’d be willing to see the SE estimate change if the data were imputed again. If you have FMI=30% missing information, for example, and you would accept the SE estimate changing by 10% if you imputed the data again, then you’ll need M=5 or 6 imputations. But if you’d only accept the SE changing by 5%, then you’ll need M=19 imputations.
The only problem with this formula is that FMI is not known in advance. FMI is not the fraction of values that are missing; instead, FMI is the fraction of information that is missing about the parameters. And FMI has to be estimated, typically by multiple imputation.
For that reason, I recommend a two-step recipe (von Hippel, 2018):
- First, carry out a pilot analysis. Impute the data using a convenient number of imputations. (20 imputations is a reasonable default, if it doesn’t take too long.) Estimate the FMI by analyzing the imputed data.
- Next, plug the estimated FMI into the formula above to figure out how many imputations you need to achieve a certain value of CV(se). If you need more imputations than you had in the pilot, then add those imputations and analyze the data again.
There is a small wrinkle: when you plug an estimate of FMI into the formula, you shouldn’t use a point estimate. Instead, you should use the upper bound of a 95% confidence interval for FMI. That way you’ll only have a 2.5% chance of not having enough imputations in your final analysis.
This two-step recipe is implemented in my new Stata and SAS software. There’s more explanation in von Hippel (2018).
von Hippel, Paul T. (2018). “How many imputations do you need? A two-stage calculation using a quadratic rule.” Sociological Methods and Research, published online, behind a paywall. A free pre-publication version is available as an arXiv e-print.
Bodner, T. E. (2008). What Improves with Increased Missing Data Imputations? Structural Equation Modeling, 15(4), 651–675. https://doi.org/10.1080/10705510802339072
Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory. Prevention Science, 8(3), 206–213. https://doi.org/10.1007/s11121-007-0070-9
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399. https://doi.org/10.1002/sim.4067
4 thoughts on “How many imputations do you need?”
Dear Mr Von Hippel, An R implementation of your formula was written by Jos Herrickson – https://gist.github.com/josherrickson/db8f828556a9dce2e3013c8be0ca5e05. This might help others who use R. Kinds regards, TB
LikeLiked by 1 person
Unfortunately, it is not available for the latest version of R
thank you for your Macro! I just have a quick question. How does this macro work if I have categorical data. Specifically, I have been using the GEE model to look at a binary outcome with categorical confounding factors. How can I use the macro in order to combine estimates if I cannot use the “mean” function since these are categorical variables? Should I just use Proc MIANALYZE?
Hi, Katy. The formula, Stata command, and SAS macro are quite general and should work for any situation where you have point estimates and standard errors. They don’t assume that you’re estimating a mean and won’t be bothered by the fact that you’re estimating a GEE model. DM me if you have trouble.