CODES 2000 User Forum -- Data Network Note #8

Variations in Multiple Imputation Results 

Applies to: SAS 8.1 & 8.2.
Last updated: Friday November 23, 2001.

SUMMARY

Multiple imputation of missing links and missing values depends on simulating random draws from complex probability distributions.  Since the algorithms used in these calculations are not deterministic, results obtained through multiple imputation will vary depending on your choice of random number generator, random number seed values, initial parameter values, number of iterations repeated, and even on the order of the records in your original dataset.  That is, there is no single right answer.  Because of such technical variations, you will not be able to reproduce the results presented at the Data Network tutorial.  This note describes several imputations that you should be able to reproduce if you do the analysis steps in the order described here.  We also illustrate how you can assess the amount of uncertainty caused by these variations as part of your sensitivity analysis.

PROCEDURES

  1. Effect of Random Numbers for Imputing Missing Links.

The CODES 2000 Linkage Imputation Wizard calculates random numbers when imputing missing links.  The random number function is reinitialized each time you start Access.  You will get the same imputed links each time you run the Wizard for the first time after starting Access.  You will get different results if you run the Wizard a second time during the same Access session.  You cannot set the random number seed used for initialization.  Running the Wizard on two different computer systems but starting with the same Match Pairs In Sets table will produce the same random numbers and the same imputed links.

  1. Effect of the Number of Missing Link Imputations.

The examples presented at the Data Network Tutorial used five missing link imputations.  There is nothing magic about doing five imputations, and you should do a sensitivity analysis to test if five is sufficient to capture most of the uncertainty in the imputed linked pairs.  Do this by imputing a second and then a third set of five imputations during the same Access session.  Analyze each set of imputed linked pairs using the same SAS MI, REG, and MIANALYZE procedures.

Your specific results will depend on the order of the records when you concatenate the separate linkage imputations in a single query or table.  For reproducibility, the following imputation results were obtained by concatenating the record pairs created by the Linkage Imputation Wizard with a UNION query and ordering all motorcycle cases by imputation and Crash CODESNbr.  This gives a specific record order that you can duplicate using your copy of the test data.

SELECT * FROM LinkedPairs1 UNION SELECT * FROM LinkedPairs2 UNION SELECT * FROM LinkedPairs3 UNION SELECT * FROM LinkedPairs4 UNION SELECT * FROM LinkedPairs5;

SELECT qryUnion1to5.Imputation, IIf(IsNull([Crash].[Hour]),-1,[Crash].[Hour]) AS [Hour], IIf(IsNull([Crash].[Age]),-1,[Crash].[Age]) AS Age, IIf(IsNull([Crash].[Safety]),-1,[Crash].[Safety]) AS Helmet, Hospital.Charges, Crash.CODESNbr AS SortKey, Crash.Vehicle AS Vehicle
FROM (Crash INNER JOIN qryUnion1to5 ON Crash.CODESNbr = qryUnion1to5.CODESNbr) INNER JOIN Hospital ON qryUnion1to5.CODESNbr_B = Hospital.CODESNbr
WHERE (((Crash.Vehicle)='MC') AND ((Hospital.Death)='N'))
ORDER BY qryUnion1to5.Imputation, Crash.CODESNbr;

FILENAME test DDE 

    'MSAccess|HelmetUse.mdb;QUERY qryMCInpatients5!Data';

data mi_input;

    infile test;

    input Imputed Hour Age Helmet Charges SortKey Vehicle $;

    if Hour = -1 then Hour = .;

    if Age = -1 then Age = .;

    if Helmet = -1 then Helmet = .;

    if Charges = -1 then Charges = .;

    hour4 = hour;

    if hour4<4 then hour4 = hour4 + 24;

run;

proc mi

    data = mi_input

    seed = 37851

    out=outmi

    nimpute = 5

    round = 1;

    title 'MI Example - 1st 5 X 5';

    transform log(charges);

    mcmc

        nbiter = 5000

        niter = 2000

        ;

    var charges age hour4 Helmet;

    *by Imputed;

run;

data MI25;

set outmi;

_Imputation_ = _Imputation_ + 5*(Imputed - 1);

Helmet = max(Helmet,0);

Helmet = min(Helmet,1);

hour = hour4;

if hour > 23 then hour = hour - 4;

hour = max(hour,0);

hour = min(hour,23);

run;

proc sort data=MI25;

by _Imputation_;

run;

proc reg data=MI25 outest = outreg covout noprint;

by _Imputation_;

model charges = Helmet;

run;

proc print data=outreg(where =(_Type_ = 'PARMS'));

var _Imputation_ _DEPVAR_ Intercept Helmet;

run;

proc mianalyze data=outreg;

var intercept Helmet;

run;

 

The first 5 missing link imputations, each with 5 missing value imputations, estimate the effect of helmet use as -3586 with standard error of 4556.

The second 5 missing link imputations, each with 5 missing value imputations, estimate the effect of helmet use as -1283 with standard error of 4262.

The third 5 missing link imputations, each with 5 missing value imputations, estimate the effect of helmet use as -2472 with standard error of 4469.

The standard error values are approximately equal, but there are large differences between estimates (about 0.52 standard error).  These differences are due to random effects when using 5 imputations.  They suggest that we should perform and average more imputations to be confident that our estimate is not randomly very high or very low.  Here are the results with 10 imputations.

SELECT * FROM LinkedPairs1 UNION SELECT * FROM LinkedPairs2 UNION SELECT * FROM LinkedPairs3 UNION SELECT * FROM LinkedPairs4 UNION SELECT * FROM LinkedPairs5 UNION SELECT * FROM LinkedPairs6 UNION SELECT * FROM LinkedPairs7 UNION SELECT * FROM LinkedPairs8 UNION SELECT * FROM LinkedPairs9 UNION SELECT * FROM LinkedPairs10;

SELECT qryUnion1to10.Imputation, IIf(IsNull([Crash].[Hour]),-1,[Crash].[Hour]) AS [Hour], IIf(IsNull([Crash].[Age]),-1,[Crash].[Age]) AS Age, IIf(IsNull([Crash].[Safety]),-1,[Crash].[Safety]) AS Helmet, Hospital.Charges, Crash.CODESNbr AS SortKey, Crash.Vehicle AS Vehicle
FROM (Crash INNER JOIN qryUnion1to10 ON Crash.CODESNbr = qryUnion1to10.CODESNbr) INNER JOIN Hospital ON qryUnion1to10.CODESNbr_B = Hospital.CODESNbr
WHERE (((Crash.Vehicle)='MC') AND ((Hospital.Death)='N'))
ORDER BY qryUnion1to10.Imputation, Crash.CODESNbr;

FILENAME test DDE 

    'MSAccess|HelmetUse.mdb;QUERY qryMCInpatients10!Data';

data mi_input;

    infile test;

    input Imputed Hour Age Helmet Charges SortKey Vehicle $;

    if Hour = -1 then Hour = .;

    if Age = -1 then Age = .;

    if Helmet = -1 then Helmet = .;

    if Charges = -1 then Charges = .;

    hour4 = hour;

    if hour4<4 then hour4 = hour4 + 24;

run;

proc mi

    data = mi_input

    seed = 37851

    out=outmi

    nimpute = 10

    round = 1;

    title 'MI Example - 1st 10 X 10';

    transform log(charges);

    mcmc

        nbiter = 5000

        niter = 2000

        ;

    var charges age hour4 Helmet;

    *by Imputed;

run;

data MI25;

set outmi;

_Imputation_ = _Imputation_ + 10*(Imputed - 1);

Helmet = max(Helmet,0);

Helmet = min(Helmet,1);

hour = hour4;

if hour > 23 then hour = hour - 4;

hour = max(hour,0);

hour = min(hour,23);

run;

proc sort data=MI25;

by _Imputation_;

run;

proc reg data=MI25 outest = outreg covout noprint;

by _Imputation_;

model charges = Helmet;

run;

proc print data=outreg(where =(_Type_ = 'PARMS'));

var _Imputation_ _DEPVAR_ Intercept Helmet;

run;

proc mianalyze data=outreg;

var intercept Helmet;

run;

 

The first 10 missing link imputations, each with 10 missing value imputations, estimate the effect of helmet use as -1993 with standard error of 4409.

The second 10 missing link imputations, each with 10 missing value imputations, estimate the effect of helmet use as -2027 with standard error of 4357.

The third 10 missing link imputations, each with 10 missing value imputations, estimate the effect of helmet use as -2545 with standard error of 4119.

Again, the standard error values are approximately equal.  Differences between estimates due to random effects when using 10 imputations (about 0.13 standard error) are smaller than when using 5 imputations  (about 0.52 standard error), suggesting that we can be more confident that any estimate is not an extreme value.

  1. Effect of Random Numbers for Imputing Missing Values.

The SAS random number generator is reinitialized each time that you run the MI procedure.  You can set the seed value used for the initialization.  Different seed values produce different random numbers and different imputed values.  The results shown above were produced with a seed of 37851.  Here are the results with a seed value of 55417.

The first 10 missing link imputations, each with 10 missing value imputations, estimate the effect of helmet use as -1433 with standard error of 4436.  With a random number seed of 37851, the estimate is -1993 with standard error 4409.  The effect is different by about 0.13 standard error.

The second 10 missing link imputations, each with 10 missing value imputations, estimate the effect of helmet use as -2039 with standard error of 4525.  With a random number seed of 37851, the estimate is -2027 with standard error 4357.  The effect is different by about 0.04 standard error.

The third 10 missing link imputations, each with 10 missing value imputations, estimate the effect of helmet use as -2683 with standard error of 4292.  With a random number seed of 37851, the estimate is -2545 with standard error 4119.  The effect is different by about 0.03 standard error.

Again, the standard error values are approximately equal.  Estimated helmet effects are similar to those obtained with the original random number seed.  Differences between estimates due to random effects (about 0.28 standard error) are somewhat greater than with a seed of 37851 (about 0.13 standard error).

  1. Effect of the Number of Iterations for MCMC Independence.

This SAS MI procedure performs a specified number of iterations of the MCMC algorithm to obtain independent random draws from target posterior distributions.  Changing the number of iterations between draws changes the imputation results.  The above results were produced with 2000 iterations between imputations.  Here are the results with 2300 iterations between imputations.

The first 10 missing link imputations, each with 10 missing value imputations, estimate the effect of helmet use as -2022 with standard error of 4474.  With 2000 iterations, the estimate is -1993 with standard error 4409.  The effect is different by about 0.01 standard error.

The second 10 missing link imputations, each with 10 missing value imputations, estimate the effect of helmet use as -1675 with standard error of 4415.  With 2000 iterations, the estimate is -2027 with standard error 4357.  The effect is different by about 0.08 standard error.

The third 10 missing link imputations, each with 10 missing value imputations, estimate the effect of helmet use as -2633 with standard error of 4277.  With 2000 iterations, the estimate is -2545 with standard error 4119.  The effect is different by about 0.02 standard error.

Again, the standard error values are approximately equal.  Estimated helmet effects are similar to those obtained with the original number of iterations.  Differences between estimates due to random effects are somewhat greater (0.22 standard error) than with 2000 iterations (about 0.13 standard error).

  1. Effect of the Number of Missing Value Imputations.

We can reduce differences between estimates due to random effects when imputing missing values by increasing the number of imputations.  Here are the results when we use 20 missing value imputations.

The first 10 missing link imputations, each with 20 missing value imputations using a random number seed of 37851, estimate the effect of helmet use as -1817 with standard error of 4369.  With a random number seed of 55417, the estimate is -1946 with standard error 4388.  The effect is different by about 0.03 standard error.

The second 10 missing link imputations, each with 20 missing value imputations using a random number seed of 37851, estimate the effect of helmet use as -2158 with standard error of 4326.  With a random number seed of 55417, the estimate is -2395 with standard error 4425.  The effect is different by about 0.06 standard error.

The third 10 missing link imputations, each with 20 missing value imputations using a random number seed of 37851, estimate the effect of helmet use as -2581 with standard error of 4132.  With a random number seed of 55417, the estimate is -2566 with standard error 4253.  The effect is different by less than 0.01 standard error.

Most of the remaining variation (about 0.15 standard error) probably is due to random effects when imputing missing links.

SAS is a registered trademark of SAS Institute, Inc.

 
© Copyright 2000 - 2008 Strategic Matching, Inc. All rights reserved. Microsoft, Windows, and Access are trademarks of Microsoft Corporation. Last modified: Monday January 28, 2008.