Understand Goals of Probabilistic Record Linkage. Information from multiple data files can be used in observational outcome studies if records for the same person and event can be linked (associated). Probabilistic record linkage is a technique for comparing data values on pairs of records and calculating the probabilities that candidate pairs are true links given all comparison outcomes. Bayesian statistical models are the basis for the calculations. Agreements increase probabilities and disagreements decrease probabilities. Data values to be compared in linkage models must be standardized so that equivalent information is coded in the same way.
Understand Characteristics of your Data. Become familiar with your data files to be linked and the data fields reported on each record. Review documentation for data files and interview data owners. Learn about coding standards, completeness, and accuracy of reported data. Investigate how you might standardize and compare different data fields.
Understand the Linkage Process. Become familiar with the record linkage process by flying the Auto Pilot and reviewing linkage results.
Create and Link Simulated Data. Become familiar with the data simulator and create simulated (artificial) data similar to your real data. The great advantage of using simulated data is that you can tell by inspection which record pairs are true links because they have matching record id numbers. Investigate different linkage models until you develop one that is effective -- that is, one for which calculated probabilities are approximately equal to actual probabilities. Simulated data will never be exactly the same as real data but this model will be the starting point for your real linkage.
Create and Link Sample Data. Select and link sample records from your data files, say all records for one month from a year of records. You may have to make small revisions to your linkage model because your real data are slightly different than your simulated data. Measure the effects of comparisons with tolerances and comparisons with dependent outcomes. Inaccurate estimates of model parameters can lead to inaccurate values for linkage probabilities. Linking a sample of your records helps you improve parameter estimates used for the full linkage because you can compare prior and posterior estimates for the sample linkage and detect poor model fit.
Create and Link All Data. Select and link all records from your data files. You will have to make small revisions to your linkage model because your complete data files have slightly different characteristics than your sample data files. You will always have to change your estimate of total true links and your search criteria for candidate pairs.
Impute Complete Linked Data Files. Administrative data files linked for observational studies are usually incomplete -- they have unintended missing values in fields of interest. Also, missing values and incorrect values in linkage comparison fields result in low probabilities for some true links and high probabilities for some false links. The best statistical technique for analyzing such incomplete datasets is multiple imputation. CODES2000 treats unknown true link status as missing data in the first step of a hierarchical imputation model. This gives you multiply-imputed linked datasets. Then, impute missing values for each linkage imputation using standard hierarchical models in SAS PROC MI. This gives you multiply-imputed complete linked datasets for analysis.
Analyze Imputed Data Files. Analyze each multiply-imputed complete linked dataset using standard techniques such as SAS PROC REG or SAS PROC LOGISTIC. Combine multiple analysis results such as regression coefficients or population proportions using SAS PROC MIANALYZE.