Best Practices -- Develop Linkage Models that Fit Your Data

Develop Linkage Models that Fit Your Data. It is important to know that if you follow the guidelines then your linkage model will be accurate -- that all true links are included as candidate record pairs and the calculated probability of being a true match is accurate for all candidate pairs. The Bayesian Model Check report applies to both real and simulated data. Use this report to confirm that posterior estimates for model parameters are not very different from your prior estimates. Or, if you find differences then you can make sure they are not due to errors in your linkage model. Linking simulated data can help establish that a linkage model fits the data well because you can tell by inspection which links are true and which links are false. The first test of goodness of fit is to confirm that almost all (say at least 95%) all true links are included as candidate pairs. If at least 95% of all true links are included as candidate pairs then you can proceed to the second goodness of fit test. If not, you should revise your linkage specs to pick up true links that were skipped or were dropped because they were below the cutoff. You might have to add a match pass, add a match field, increase a tolerance, etc. Calculated probabilities must be accurate because they are used to select imputed pairs for outcome studies. Also, CODES2000 uses imputed pairs to revise your prior estimates of match parameters during the Markov Chain process. In the end, high probability links must be almost all true, medium probability links must be an accurate mixture of true and false links, low probability links must be almost all false. Otherwise, imputations will not be accurate. This is the reason for our second goodness of fit test.

To get the following table, all candidate pairs above the cutoff were ranked by probability (low to high) and divided into 10 deciles with approximately the same number of record pairs in each decile. Ties are ranked randomly. It is easy to count the actual number of true links in each decile because UniqueIDs are equal for true links in the simulated data. Expected true links are determined by summing match probabilities for all candidate pairs in each decile. Because we have multiply imputed linked data, the counts shown here are the average of the counts in the tables for each imputation. This is why actual counts are not integers.

 

CrashEMS__Fit10

Decile

PairsInDecile

ActualTrue

ExpectedTrue

1

302

4.8

3.262

2

302

10.4

10.012

3

302

52.6

45.92

4

302

220.6

228.24

5

302

296.6

298.082

6

302

302

301.544

7

302

301.2

301.932

8

302

301.8

301.976

9

302

302

301.99

10

303

303

303

 

The idea is to have a model for which expected true counts (available for both real and simulated data) are close to actual true count (only available for simulated data) so that you can trust the probability when linking real data, as is the case in this example. How close is close enough for the counts? This is a standard statistical question answered by the Chi Squared p value test. If p is not very small, say not less than 0.05, then the differences in counts are considered not statistically significant.