Multiple imputation of missing links and missing values depends on simulating random draws from
complex probability distributions. Since the algorithms used in these
calculations are not deterministic, results obtained through multiple imputation
will vary
depending on your choice of random number generator, random number seed values, initial
parameter values, number of iterations repeated, and even on
the order of the records in your original dataset. That is, there is no
single right answer. Because of such technical variations, you will not be
able to reproduce the results presented at the Data Network
tutorial. This note describes several imputations that you should be able
to reproduce if you do the analysis steps in the order described here. We
also illustrate how you can assess the amount of uncertainty caused by
these variations as part of your sensitivity analysis.
The examples presented at the Data Network Tutorial used five missing link
imputations. There is nothing magic about doing five imputations, and
you should do a sensitivity analysis to test if five is sufficient to capture
most of the uncertainty in the imputed linked pairs. Do this by imputing
a second and then a third set of five imputations during the same Access
session. Analyze each set of imputed linked pairs using the same SAS MI,
REG, and MIANALYZE procedures.
Your specific results will depend on the order of the records when you
concatenate the separate linkage imputations in a single query or table.
For reproducibility, the following imputation results were obtained by
concatenating the record pairs created by the Linkage Imputation Wizard with a
UNION query and ordering all motorcycle cases by imputation and Crash CODESNbr.
This gives a specific record order that you can duplicate using your copy of
the test data.
SELECT * FROM LinkedPairs1 UNION SELECT * FROM LinkedPairs2 UNION SELECT * FROM LinkedPairs3 UNION SELECT * FROM LinkedPairs4 UNION SELECT * FROM LinkedPairs5;
SELECT qryUnion1to5.Imputation, IIf(IsNull([Crash].[Hour]),-1,[Crash].[Hour]) AS [Hour], IIf(IsNull([Crash].[Age]),-1,[Crash].[Age]) AS Age, IIf(IsNull([Crash].[Safety]),-1,[Crash].[Safety]) AS Helmet, Hospital.Charges, Crash.CODESNbr AS SortKey, Crash.Vehicle
AS Vehicle
FROM (Crash INNER JOIN qryUnion1to5 ON Crash.CODESNbr = qryUnion1to5.CODESNbr) INNER JOIN Hospital ON qryUnion1to5.CODESNbr_B = Hospital.CODESNbr
WHERE (((Crash.Vehicle)='MC') AND ((Hospital.Death)='N'))
ORDER BY qryUnion1to5.Imputation, Crash.CODESNbr;
FILENAME
test DDE
'MSAccess|HelmetUse.mdb;QUERY qryMCInpatients5!Data';
data
mi_input;
infile test;
input Imputed Hour Age Helmet Charges SortKey Vehicle $;
if Hour = -1 then
Hour = .;
if Age = -1 then
Age = .;
if Helmet = -1 then
Helmet = .;
if Charges = -1 then
Charges = .;
hour4 = hour;
if hour4<4 then
hour4 = hour4 + 24;
run;
proc
mi
data = mi_input
seed = 37851
out=outmi
nimpute = 5
round = 1;
title 'MI Example - 1st 5 X 5';
transform log(charges);
mcmc
nbiter = 5000
niter = 2000
;
var charges age hour4 Helmet;
*by
Imputed;
run;
data
MI25;
set outmi;
_Imputation_ = _Imputation_ + 5*(Imputed
- 1);
Helmet = max(Helmet,0);
Helmet = min(Helmet,1);
hour = hour4;
if hour
> 23 then hour =
hour - 4;
hour = max(hour,0);
hour = min(hour,23);
run;
proc
sort data=MI25;
by
_Imputation_;
run;
proc
reg data=MI25 outest
= outreg covout noprint;
by
_Imputation_;
model
charges = Helmet;
run;
proc
print data=outreg(where
=(_Type_ = 'PARMS'));
var
_Imputation_ _DEPVAR_ Intercept Helmet;
run;
proc
mianalyze data=outreg;
var
intercept Helmet;
run;
The first 5 missing link imputations, each with 5 missing value
imputations, estimate the effect of helmet use as -3586 with standard error of
4556.
The second 5 missing link imputations, each with 5 missing value
imputations, estimate the effect of helmet use as -1283 with standard error of
4262.
The third 5 missing link imputations, each with 5 missing value
imputations, estimate the effect of helmet use as -2472 with standard error of
4469.
The standard error values are approximately equal, but there are large differences between
estimates (about 0.52 standard error).
These differences are due to random effects when using 5
imputations. They suggest that we should perform and average more imputations to be
confident that our estimate is not randomly very high or very low. Here are the results
with 10
imputations.
SELECT * FROM LinkedPairs1 UNION SELECT * FROM LinkedPairs2 UNION SELECT * FROM LinkedPairs3 UNION SELECT * FROM LinkedPairs4 UNION SELECT * FROM LinkedPairs5 UNION SELECT * FROM LinkedPairs6 UNION SELECT * FROM LinkedPairs7 UNION SELECT * FROM LinkedPairs8 UNION SELECT * FROM LinkedPairs9 UNION SELECT * FROM LinkedPairs10;
SELECT qryUnion1to10.Imputation, IIf(IsNull([Crash].[Hour]),-1,[Crash].[Hour]) AS [Hour], IIf(IsNull([Crash].[Age]),-1,[Crash].[Age]) AS Age, IIf(IsNull([Crash].[Safety]),-1,[Crash].[Safety]) AS Helmet, Hospital.Charges, Crash.CODESNbr AS SortKey, Crash.Vehicle
AS Vehicle
FROM (Crash INNER JOIN qryUnion1to10 ON Crash.CODESNbr = qryUnion1to10.CODESNbr) INNER JOIN Hospital ON qryUnion1to10.CODESNbr_B = Hospital.CODESNbr
WHERE (((Crash.Vehicle)='MC') AND ((Hospital.Death)='N'))
ORDER BY qryUnion1to10.Imputation, Crash.CODESNbr;
FILENAME
test DDE
'MSAccess|HelmetUse.mdb;QUERY qryMCInpatients10!Data';
data
mi_input;
infile test;
input Imputed Hour Age Helmet Charges SortKey Vehicle $;
if Hour = -1 then
Hour = .;
if Age = -1 then
Age = .;
if Helmet = -1 then
Helmet = .;
if Charges = -1 then
Charges = .;
hour4 = hour;
if hour4<4 then
hour4 = hour4 + 24;
run;
proc mi
data = mi_input
seed = 37851
out=outmi
nimpute = 10
round = 1;
title 'MI Example - 1st 10 X 10';
transform log(charges);
mcmc
nbiter = 5000
niter = 2000
;
var charges age hour4 Helmet;
*by
Imputed;
run;
data MI25;
set outmi;
_Imputation_ = _Imputation_ + 10*(Imputed
- 1);
Helmet = max(Helmet,0);
Helmet = min(Helmet,1);
hour = hour4;
if hour
> 23 then hour =
hour - 4;
hour = max(hour,0);
hour = min(hour,23);
run;
proc sort
data=MI25;
by
_Imputation_;
run;
proc reg
data=MI25 outest =
outreg covout noprint;
by
_Imputation_;
model
charges = Helmet;
run;
proc print
data=outreg(where =(_Type_ = 'PARMS'));
var
_Imputation_ _DEPVAR_ Intercept Helmet;
run;
proc mianalyze
data=outreg;
var
intercept Helmet;
run;
The first 10 missing link imputations, each with 10 missing value
imputations, estimate the effect of helmet use as -1993 with standard error of
4409.
The second 10 missing link imputations, each with 10 missing value
imputations, estimate the effect of helmet use as -2027 with standard error of
4357.
The third 10 missing link imputations, each with 10 missing value
imputations, estimate the effect of helmet use as -2545 with standard error of
4119.
Again, the standard error values are approximately equal. Differences between estimates due to random effects when using 10
imputations (about 0.13 standard error) are smaller than when using 5
imputations (about 0.52 standard error), suggesting that we can
be more confident that any estimate is not an extreme value.
The SAS random number generator is reinitialized each time that you run the
MI procedure. You can set the seed value used for the
initialization. Different seed values produce different random numbers
and different imputed values. The results shown above were produced with a
seed of 37851. Here are the results with a seed value of 55417.
The first 10 missing link imputations, each with 10 missing value
imputations, estimate the effect of helmet use as -1433 with standard error of
4436. With a random number seed of 37851,
the estimate is -1993 with standard error 4409. The effect is different
by about 0.13 standard error.
The second 10 missing link imputations, each with 10 missing value
imputations, estimate the effect of helmet use as -2039 with standard error of
4525. With a random number seed of 37851,
the estimate is -2027 with standard error 4357. The effect is different
by about 0.04 standard error.
The third 10 missing link imputations, each with 10 missing value
imputations, estimate the effect of helmet use as -2683 with standard error of
4292. With a random number seed of 37851,
the estimate is -2545 with standard error 4119. The effect is different
by about 0.03 standard error.
Again, the standard error values are approximately equal. Estimated
helmet effects are similar to those obtained with the original random number
seed. Differences between estimates due to random effects (about 0.28 standard error) are somewhat
greater than with a seed of 37851 (about 0.13 standard error).
This SAS MI procedure performs a
specified number of iterations of the MCMC algorithm to obtain independent
random draws from target posterior distributions. Changing the number of
iterations between draws changes the imputation results. The above
results were produced with 2000 iterations between imputations.
Here are the results with 2300 iterations between imputations.
The first 10 missing link imputations, each with 10 missing value
imputations, estimate the effect of helmet use as -2022 with standard error of
4474. With 2000 iterations,
the estimate is -1993 with standard error 4409. The effect is different
by about 0.01 standard error.
The second 10 missing link imputations, each with 10 missing value
imputations, estimate the effect of helmet use as -1675 with standard error of
4415. With 2000 iterations,
the estimate is -2027 with standard error 4357. The effect is different
by about 0.08 standard error.
The third 10 missing link imputations, each with 10 missing value
imputations, estimate the effect of helmet use as -2633 with standard error of
4277. With 2000 iterations,
the estimate is -2545 with standard error 4119. The effect is different
by about 0.02 standard error.
Again, the standard error values are approximately equal. Estimated
helmet effects are similar to those obtained with the original number of
iterations. Differences between estimates due to random effects are
somewhat greater (0.22 standard error) than with 2000 iterations (about 0.13 standard error).
SAS is a registered trademark of SAS Institute, Inc.