CODES 2000 User Forum -- Data Network Note #11

Multiple Imputation and One-to-One Links (Revised)

Applies to: CODES 2000 Version 2.2.338.
Last updated: Tuesday February 19, 2002.

SUMMARY

For Phase I of the helmet use study, the first step for multiple imputation of linked record pairs was to run several match passes in CODES 2000 at a very low cutoff probability, say 0.001, and then merge the results.  This produced a Match Pairs In Sets table that approximated the posterior distribution of match probabilities for all record pairs given observed agreements, disagreements, and missing values.  A supplementary Linkage Imputation Wizard was introduced to impute multiple complete sets of linked pairs by simulating random draws from this posterior distribution.  The Wizard also reduced many-to-many matches to one-to-one matches so that one Crash record was linked to at most one Inpatient record, and vice versa.

Some Data Network states discovered that this methodology produced many fewer imputed links than expected.  In a few cases, the imputation process produced even fewer links than the original high-probability match.  The reasons behind these unexpected results have been identified, and new versions of CODES 2000 have been developed that address the problems.  This Note describes the current methodology for linkage imputation.

ANALYSIS

  1. CODES 2000 should tabulate maximum weights.

CODES 2000 often finds the same matched pair of records in more than one match pass.  If match specifications for the passes are not identical, then a common matched pair can be assigned different weights in different passes.  This might happen if one pass matches county location while another pass matches town location.  Or, if one pass matches crash date to admit date while another pass matches the day after the crash date.

When record A is linked to record B in two different passes with two different weights, the pre-imputation version of CODES 2000 tabulated the lower weight when the passes were merged. When all match weights were for probabilities above 0.9, this did not make much difference. For imputation, we now accept much lower probabilities. So, if A to B has probability 0.9 in pass 1 and probability 0.5 in pass 2 then the pre-imputation version of CODES 2000 tabulated the weight for 0.5 when you impute but tabulated the weight for 0.9 when you do the traditional linkage.  This resulted in fewer imputed links when drawing from the posterior distribution. This inconsistency was not good, so CODES 2000 was changed to tabulate the highest weight in version 2.1.317.  This new version was distributed to all Data Network states on December 3, 2001.

Utah and Kentucky tested the new version of CODES 2000. Numbers from Mike Singleton illustrate the possible effects. Before the change, he performed the old-style match with a 0.9 cutoff, and got 1,500 matches with a weight of 25 or greater. Then he matched with the same specs at 0.01, and performed the multiple imputation. He estimated there should be a total of 3,000 matches. The imputation routine found about 2,500 matches. He tabulated the match weights for the LinkedPairs1 table, and found only 644 matches with a weight of 25 or greater. Also, the median match probability after imputation was only 0.35.

After the change, the average number of matches found by imputation was 2,829. The increase from 1,500 to 2,829 was 1,329, which is very close to the number of hospital cases E-coded as motor vehicle injuries that failed to link at 0.9. The median match probability after imputation was around 0.85, and most of the high-probability matches survived.

Note that a few high probability matches may be dropped when you do any of the imputations -- a 0.9 probability match has a 0.9 probability of being selected and a 0.1 probability of being dropped in each imputation.  So, a 0.9 probability match will be selected for most imputations while a 0.1 probability match will be selected only rarely.

  1. CODES 2000 should assign set numbers after imputation.

The pre-imputation version of CODES 2000 assigned each matched pair to a set of pairs as part of the merge process.  For example, if Crash record A was linked to both Inpatient record B and Inpatient record C, then matched pairs A-B and A-C were both assigned to the same set.  The purpose of assigning set numbers was to allow identification and review of many-to-many matches in which one Crash record was linked to more than one Inpatient record or more than one Crash record was linked to the same Inpatient record.  Set numbers were designed to allow selection of one-to-one matches from many-to-many matches because any many-to-many match must be reduced to one or more one-to-one matches before you can analyze the linked record pairs.

The process of creating a one-to-one match from a many-to-many match consists of selecting one or more matched pairs from each set.  For example, suppose a set contains three matched pairs A-C, B-C, and B-D.  First, we might select A-C.  Second, we eliminate B-C because C is already linked.  Third, we select the remaining pair, B-D.  The Linkage Imputation Wizard incorporated an existing algorithm designed for use with high cutoff probability matches that looked for up to 10 unique one-to-one pairs in each set.  For some states, we discovered that when merging match passes with very low cutoff probability matches, some sets contained hundreds, or even thousands, of unique one-to-one matches.  This is because adding thousands of very low probability links to the Match Pairs In Sets table can result in very long chains of linked pairs because some records might link to many other records at very low probability.  For such large sets, many of the unique matched pairs were lost when creating the one-to-one matches.

CODES 2000 has been changed in version 2.2.336 to assign set numbers separately for each imputation rather once for all pairs in the Match Pairs In Sets table.  Each imputation contains only a few very low probability links.  Consequently, the potential for very large sets is substantially reduced.  In addition, the Linkage Imputation Wizard has been changed to look for up to 50 unique one-to-one pairs.  In version 2.2.336, the Imputation Wizard has been incorporated into the standard CODES 2000 Perform Match Wizard for ease of use.  It appears when you click on the Merge button.  Version 2.2.336 will be distributed to Data Network states prior to starting Phase II of the helmet study.

Maryland and others tested the new version of CODES 2000. Numbers from Shiu Ho illustrate the possible effects. Before the change, she performed the old-style match with a 0.9 probability cutoff, and found 3,749 matches. Then she matched with the same specs at a 0.001 probability cutoff, and performed the multiple imputation. She estimated there should be a total of 8,200 matches. The imputation routine found only about 4,350 matches (53%).

She tried several minor variations in the match specifications but could do no better than 4,900 matches.  In fact, some rejected trials with very loose join specifications produced as few as 1,300 imputed links.  After some investigation, we found that 3,742 matched pairs had been assigned to set number 229, and that most of these pairs were dropped when the Linkage Imputation Wizard created one-to-one matches.  The new Imputation Wizard was designed to correct this problem.  After installing the new Wizard, over 6,500 matches (about 80%) were found by imputation using the same match specifications as used earlier. 

Because of the potential impact on imputation results, all Data Network states were notified about this problem on January 17, 2002, and asked to count the number matched pairs in each set using a specified SQL command.

  1. New CODES 2000 Linkage Imputation Wizard.

When you click on the usual Merge button, you see the usual confirmation message:

When the Merge is complete, you see the usual information message:

After you acknowledge the message, you see the new Linkage Imputation Wizard:

Enter the number of imputations that you want to the Wizard to create.  Imputations will be tabulated and set numbers assigned in tables named ImputedPairsInSets1, ImputedPairsInSets2, etc.  If you enter 0, the Wizard will not create any imputation tables.  The Imputation Wizard will tabulate imputed one-to-one matches in tables named LinkedPairs1, LinkedPairs2, etc.  If you choose not to impute links, the Wizard will tabulate one-to-one links from the entire MatchPairsInSets table in a table named LinkedPairs0.  In this case, the Wizard picks the highest weight pairs from each set.

 
 
© Copyright 2000 - 2008 Strategic Matching, Inc. All rights reserved. Microsoft, Windows, and Access are trademarks of Microsoft Corporation. Last modified: Monday January 28, 2008.