CODES 2000 User Forum -- Tech Note #8Choosing a Match Strategy
Applies to: CODES 2000.
SUMMARYThis note gives heuristic guidelines for choosing a match strategy that can help insure that your matches are as complete as possible.
PROCEDUREThe odds are very low that a pair of records chosen at random from among all possible record pairs is a true match. Since the odds of a true match can increase to an acceptable level only with one or more high-weight agreements, the only candidate pairs that will be classified as matched by the Fellegi and Sunter methodology must contain at least one high-weight agreement. In other words, we will not miss any high probability links by considering only candidate pairs with at least one high-weight agreement. Choosing which fields must agree for each pass in a series of match passes is a major element in your overall match strategy. The fields that must agree in a given match pass are identified as join fields in the link specifications. First, determine whether your match fields describe a single entity such as a person or multiple entities such as a person plus an event. Without loss of generality, we will assume that there are two distinct entities, a person plus an event, and that they can be treated independently. Knowing about a person tells you nothing about an event, and vice versa. Second, choose a reliable and discriminating field for each entity as join fields. For example, you might choose person age and date of event. When you test the match pass you should have the Link Specification Wizard count the number of candidate pairs. Aim for approximately 10 to 100 times the expected number of true matches. If your count is well below this range then you should substitute a field with less discriminating power. If your count is well above this range then you should substitute a field with more discriminating power or add an additional join field. Third, repeat the process of choosing join fields for each new match pass. Make sure that at least one of your new join fields is nearly independent of your earlier choices. For example, you could join on person age and event date in your first pass, person age and event location in your second pass, person residence zip and event date in your third, and person residence zip and event location in your fourth. Analyze the number of new links found with each new match pass to determine when your match is effectively complete. Fourth, adjust comparison tolerances and weight factors of match fields in each pass to reflect your choices for join fields. For example, if you join on event date then you should match on event date with zero tolerance and full agree weight. Allow a reasonable tolerance for event location and reduce agree weights for event location to reflect that tolerance and any dependency of event location on event date. If you join on event location then you should match on event location with zero tolerance and full agree weight. Allow a reasonable tolerance for event date and reduce agree weights for event date to reflect that tolerance and any dependency of event date on event location. |
© Copyright 2000 - 2008 Strategic Matching, Inc. All rights reserved. Microsoft, Windows, and Access are trademarks of Microsoft Corporation. Last modified: Monday January 28, 2008. |