The EM algorithm can be used to calculate a maximum likelihood estimate for
all missing values in a dataset. A simple EM algorithm for two binary
variables was used for the Data Network tutorial to provide a complete data
cross tabulation of reported helmet use by police and EMS.
Using the five imputed complete Crash to Inpatient linkages prepared
earlier, compare police reported helmet use to EMS reported helmet use for all
inpatients.
Here is an Access query that counts all agreements and disagreements:
SELECT [Crash].[Vehicle1] AS Vehicle, [HospitalDb].[Death], [Crash].[Safety1] AS CrashHelmet, [Crash].[SafetyEMS] AS EMSHelmet, Count([Crash].[CODESNbr]) AS Freq, CInt(Avg([HospitalDb].[Charges])) AS AvgChgs
FROM (Crash INNER JOIN ImputedMatchPairs1to5 ON [Crash].[CODESNbr]=[ImputedMatchPairs1to5].[CODESNbr]) INNER JOIN HospitalDb ON [ImputedMatchPairs1to5].[CODESNbr_B]=[HospitalDb].[SID]
GROUP BY [Crash].[Vehicle1], [HospitalDb].[Death], [Crash].[Safety1], [Crash].[SafetyEMS]
HAVING (((Crash.Vehicle1)='MC') AND ((HospitalDb.Death)='N'));

The following SAS data step implements Schafer's EM algorithm as listed in
his Analysis of Incomplete Multivariate Data (1997), pp 42-45. No cases
were observed in the linked crash records for Crash Helmet = "N" and
EMS Helmet = "Y" but this is probably not a structural zero because
such cases were observed in the complete crash dataset. Consequently, we
added 0.5 to all counts.
*--------------
Helmet to Helmet Comparison ---------------*
|
Comparison of Crash vs. EMS reported helmet
use..... |
|
Implements Schafer's EM algorithm (1997, pp
42-45) |
*------------------------------------------------------*;
Data
EM Sim;
*Observed
frequencies;
input
x00 x01 x02 x10 x11 x12 x20 x21 x22;
array
thetaOld[2,2];
array
thetaNew[2,2];
array
xA[2,2];
array
xB[2];
array
xC[2];
* Both Y1 and Y2
observed;
xA[1,1]
= x11;
xA[1,2]
= x12;
xA[2,1]
= x21;
xA[2,2]
= x22;
* Y2 missing;
xB[1]
= x10;
xB[2]
= x20;
* Y1 missing;
xC[1]
= x01;
xC[2]
= x02;
* Cases with both
missing are not included because they do not affect the MLE;
n = x01 + x02 + x10 + x11 + x12 +
x20 + x21 + x22;
* Prior
probabilities uniform = 0.25;
t = 0;
Do
i=1 to
2;
Do
j=1 to
2;
thetaOld[i,j] = 0.25;
end;
end;
ProbNN = thetaOld[1,1];
ProbNY = thetaOld[1,2];
ProbYN = thetaOld[2,1];
ProbYY = thetaOld[2,2];
* Post t = 0 values;
keep
t ProbNN ProbNY ProbYN ProbYY;
output
EM;
* Calculate 50 EM
iterations;
Do
t=1 to
50;
Do
i=1 to
2;
Do
j=1 to
2;
thetaI = thetaOld[i,1]
+ thetaOld[i,2];
thetaJ = thetaOld[1,j]
+ thetaOld[2,j];
B = thetaOld[i,j] / thetaI;
C = thetaOld[i,j] / thetaJ;
thetaNew[i,j] = (1/n)
* (xA[i,j] + xB[i] * B + xC[j] * C);
end;
end;
Do
i=1 to
2;
Do
j=1 to
2;
thetaOld[i,j] = thetaNew[i,j];
end;
end;
ProbNN = thetaOld[1,1];
ProbNY = thetaOld[1,2];
ProbYN = thetaOld[2,1];
ProbYY = thetaOld[2,2];
keep
t ProbNN ProbNY ProbYN ProbYY;
output
EM;
end;
* Create dummy
observations for cross tabulation;
Do
rep=1 to
round(1000*ProbNN);
Crash = 'N';
EMS = 'N';
keep
Crash EMS;
output
Sim;
end;
Do
rep=1 to
round(1000*ProbNY);
Crash = 'N';
EMS = 'Y';
keep
Crash EMS;
output
Sim;
end;
Do
rep=1 to
round(1000*ProbYN);
Crash = 'Y';
EMS = 'N';
keep
Crash EMS;
output
Sim;
end;
Do
rep=1 to
round(1000*ProbYY);
Crash = 'Y';
EMS = 'Y';
keep
Crash EMS;
output
Sim;
end;
datalines;
163.5 14.5 36.5 89.5 8.5 0.5 27.5
4.5 19.5
;
proc
print data=EM;
title
'EM of Crash vs EMS Helmet Use for
Inpatients';
run;
proc
freq data=sim;
tables
Crash*EMS;
run;
The SAS print procedure lists EM iterations so that convergence of the EM
algorithm can be assessed.

The SAS freq procedure cross tabulates 1000 simulated observations of the
MLE complete data relationship between police and EMS reported helmet use.
