Best Practices -- Use all of your prior knowledge

Use all of your prior knowledge. Try to use all of your prior knowledge when you set up data specifications and match specifications so that calculated probabilities will be as accurate as possible. The example discussed here linked simulated Crash to Hospital but the practice applies to real data as well. First, Total Matches was set to 1,200 to agree with earlier linkage results even though a select query found 1,830 true links for the new data sets. Second, error probabilities were defaulted to 0.01 when preparing data sources Crash and Hospital even though data fields were simulated with error probabilities = 0.02. Third, it was expected that Hospital County would not always equal Crash County and that Hospital Hour would not always equal Crash Hour but probability of correct but different was set to 0.00 for both fields. Prior estimates of model parameters were not accurate because they did not incorporate all prior knowledge. Consequently, linkage counts by decile showed poor goodness of fit:

 

CrashHospital__Fit10

Chi Square p Value = 0.03

Decile

PairsInDecile

ActualTrue

ExpectedTrue

1

193

17.33

7.64

2

194

128

100.67

3

194

188.67

187.04

4

194

194

193.49

5

194

194

193.92

6

193

193

192.98

7

194

194

193.99

8

194

194

194

9

194

194

194

10

194

194

194

 

The Bayesian Model Check Report provided clues about errors in the linkage model. Observed Actual True equaled about 1,691, much greater than 1,200 specified. Observed combined error probabilities for County and Hour were 0.156 and 0.133, respectively, much greater than 0.02 specified. The linkage model was revised to reflect more prior knowledge. Total Matches was set to 1830. Error probabilities were set to 0.02 for Crash fields and Hospital fields. Probability of correct but different was set to 0.10 for County and Hour fields (based on prior anecdotal evidence, not derived from linkage results). After the revisions, observed Actual True = 1,738 and the linkage model has much better goodness of fit:

 

CrashHospital__Fit10

Chi Square p Value = 0.80

Decile

PairsInDecile

ActualTrue

ExpectedTrue

1

276

1

1.80

2

276

7.33

4.27

3

276

18

12.13

4

276

77.33

84.45

5

276

253.33

259.25

6

276

276

275.71

7

276

276

275.98

8

276

276

276

9

276

276

276

10

277

277

277

 

One strategy to correct the shortfall in Actual True (1738 is 95% of 1830) would be to consider an additional match pass using County and Home Zip as join fields. County and Home Zip are equal on 55 of 92 missing true links.