CODES 2000 User Forum -- Data Network Note #10

Effect of Helmet Use on Inpatient Charges - Phase I 

Applies to: CODES Data Network.
Last updated: Wednesday February 27, 2002.

SUMMARY

In Phase I of this study, CODES Data Network participants reported the effect of helmet use on hospital inpatient charges for motorcycle riders injured in crashes and discharged alive.  They estimated the helmet use effect following two different methodologies in order to compare the results.  The first methodology was a traditional approach.  They found a set of high probability links between police reported crash records and hospital discharge records using a Fellegi and Sunter probabilistic record linkage model as implemented in CODES 2000 software from Strategic Matching.  Then they conducted a linear regression analysis on those high probability linked record pairs that had complete values for all regression variables using either SAS/STAT REG software from the SAS Institute or Excel Data Analysis software from Microsoft.  The major disadvantage of this first methodology is that the population analyzed may not represent the population of interest unless the percentage of missing links and the percentage of missing values are very low, say less than 5%.  All Data Network participants reported rates of missing links and missing values that were much higher than this level.

In order to better describe the population of interest, the second methodology was a two-stage Bayesian multiple imputation approach similar to those described by Rubin, Schafer, and others.  Participants created multiple complete sets of links between crash records and discharge records using an extended Fellegi and Sunter model and Bayesian multiple imputation of missing links as implemented in CODES 2000.  Then, for each imputed set of links, they created multiple complete datasets using Markov Chain Monte Carlo multiple imputation for missing values as implemented in either SAS/STAT MI software or Schafer's NORM software.  Next, they conducted linear regression analysis for each complete dataset using either REG or Data Analysis.  Finally, they combined parameter estimates derived from each imputation into a single estimate for the population of interest using Rubin's algorithms as implemented in either SAS/STAT MIANALYZE or NORM.

Here, we tabulate and compare results reported by CODES Data Network participants.  In addition, we combine the multiple imputation estimates of the effect of helmet use in a meta-analysis and tabulate the results.  The meta-analysis follows the Bayesian approach presented by Rubin for normally distributed effects.  Finally, we discuss apparent strengths and weaknesses of the imputation methodology, and suggest changes to improve the methodology for Phase II of the study.

SAS and SAS/STAT are trademarks of SAS Institute, Inc.

RESULTS

  1. Traditional High-Probability Linkage and Complete Case Regression

Table 1 - Reported High Probability Linkage Results

State Crash Recs. Inpatient Recs. Est. Total Links Act. Links > 0.9
Demo 56,689 117,394 1,200 527
DE 81,928 10,419 4,000 1,646
KY 307,773 39,344 2,800 1,485
MD 224,111 626,955 8,100 3,749
ME 99,712 10,878 1,812 1,006
MN 254,024 39,005 5,000 2,731
NE 83,525 24,907 1,340 940
NV 106,803 12,844   1,034
OK 180,253 355,918 2,500 1,737
PA 2,533,317 114,979 112,500 66,515
SC 270,463 32,530 6,500  
SD 71,080 6,409 700 476
UT 428,165 77,570 7,186 4,380
WI 357,058 621,236 5,400 3,900
 
Table 2 - Reported Complete Case Regression Results
State MC Links > 0.9 Complete Cases Charges Intercept Helmet Effect Std. Error
Demo 56 19 21,323 -8,746 9,107
DE 66 44 26,705 -2,952 11,223
KY 86 53 17,163 -6,247 8,175
MD 229 195 17,083 -236 4,859
ME 52 52 23,568 -8,016 7,353
MN 164 135 26,204 -4,096 5,676
NE 38 21 6,540 18,632 20,345
NV 51 47 51,926 -30,878  
OK 108 98 29,733 -1,406 1,213
PA 4,014 2,664 36,629 -5,128 5,649
SC 11 10 6,808 -3,300 3,919
SD 65 65 21,284 12,242 8,921
UT 338 189 17,953 -6,394 3,277
WI 446 316 25,953 -7,472 4,383

See results for Demo, DE, KY, MD, ME, MN, NE, OK, PA, SC, SD, UT, and WI in the CODES Data Network discussion group

  1. Multiple Imputation Results
Table 3 - Reported Linkage Imputation Results
State Crash Recs. Inpatient Recs. Est. Total Links Total Links Imp 1 Avg. Prob. Imp 1
Demo 56,689 117,394 1,200 1,060 0.70
DE 81,928 10,419 2,000 1,752 0.85
KY 307,773 39,344 2,800 2,470 0.70
MD 224,111 15,419 8,200 6,525 0.75
ME 99,712 10,878 1,812 1,654 0.78
MN 254,024 39,005 5,000 3,045 0.62
NE 83,525 24,907 1,340 1,477 0.75
OK 180,253 355,918 2,500 2,172 0.89
PA 2,533,317 114,979 112,500 67,564 0.85
RI 43,898 126,411 900 710 0.52
SC 270,463 32,530 6,500 3,820 0.88
SD 80,079 6,409 2,000 912 0.70
UT 428,165 77,570 7,186 7,537 0.68
WI 357,058 621,236 5,400 2,635 0.72
 
Table 4 - Reported Value Imputation and Regression Results
State Imputed MC Links (Imp. 1) Charges Intercept Helmet Effect Std. Error
Demo 66 14,101 -1,288 4,096
DE 63 24,523 -5,754 8,986
KY 101 16,437 -3,059 5,088
MD 226 14,774 860 4,474
ME 51 22,049 -7,307 7,187
MN 108 26,482 -10,375 8,406
NE 33 8,682 16,380 20,977
OK 122 29,308 -1,735 1,200
PA 3,308 35,051 -2,498 16,077
RI 16 18,807 9,455 17,537
SD 81 23,671 6,684 10,118
UT 401 17,051 -3,616 2,864
WI 378 24,104 -7,269 4,195

See results for Demo 1 & 2, DE 1 & 2, DE 1 & 2, KY 1 & 2, MD 1 & 2, ME 1 & 2, MN 1 & 2, NE 1 & 2, OK 1 & 2, PA 1 & 2, SD 1 & 2, UT 1 & 2, and WI 1 & 2 in the CODES Data Network discussion group

  1. Meta-Analysis Results

Following Gelman, Carlin, Stern, and Rubin (1995), Section 5.4, our meta-analysis constructs:

A simple hierarchical model based on the normal distribution, in which observed data are normally distributed with a different mean for each [state], with known observation variance, and a normal population distribution for the [state] means.  This model is sometimes termed the one-way normal random-effects model with known data variance and is widely applicable, being an important special case of the hierarchical normal linear model...

For this model, computation of the posterior distribution of [helmet effects] is most conveniently performed via simulation [using 10,000 random draws], following the factorization [given for the joint posterior distribution of model parameters].  The first step, simulating [the population standard deviation] tau, is easily performed numerically using the [given] inverse cdf method on a grid of [100] uniformly spaced values of tau, with [the given posterior distribution of tau].  The second and third steps, simulating [the population mean] mu and then [the vector of state helmet effects] theta, can both be done easily by sampling from [given] normal distributions.

For analytical details, see Data Network Note #12 - Methodology for Meta-Analysis of Helmet Use Effects

Table 5 - Estimated Quantiles for the Normal Population Distribution of Helmet Use Effects
Param 0.05 0.25 0.50 0.75 0.95
Mean -4,735 -3,332 -2,514 -1,725 -554
StdDev 29 501 1,168 2,125 4,114

Here, "Population" means all reporting Data Network states.

Figure 1 - Histogram of Simulated Values for the Mean of the Population Helmet Use Effect

Figure 2 - Histogram of Simulated Values for the Standard Deviation of the Population Helmet Use Effect

Table 6 - Estimated Quantiles for State Helmet Use Effects
State 0.05 0.25 0.50 0.75 0.95
Demo -5,329 -3,424 -2,369 -1,371 548
DE -6,269 -3,666 -2,537 -1,491 413
KY -5,921 -3,593 -2,527 -1,498 378
MD -5,046 -3,213 -2,205 -1,108 1,244
ME -6,636 -3,812 -2,637 -1,599 151
MN -6,976 -3,886 -2,677 -1,619 275
NE -5,927 -3,495 -2,385 -1,302 1,082
NV* -6,509 -3,656 -2,503 -1,427 794
OK -3,807 -2,837 -2,152 -1,428 -391
PA -6,301 -3,621 -2,463 -1,395 857
RI -5,993 -3,532 -2,420 -1,353 1,163
SC* -5,822 -3,621 -2,558 -1,559 164
SD -5,569 -3,381 -2,301 -1,205 1,496
UT -5,593 -3,676 -2,638 -1,694 -270
WI -7,111 -4,075 -2,851 -1,867 -455
*Estimate based on complete case analysis, not imputation

Figure 3 - Histogram of Simulated Values for Demo Helmet Use Effect

Figure 4 - Histogram of Simulated Values for DE Helmet Use Effect

Figure 5 - Histogram of Simulated Values for KY Helmet Use Effect

Figure 6 - Histogram of Simulated Values for MD Helmet Use Effect

Figure 7 - Histogram of Simulated Values for ME Helmet Use Effect

Figure 8 - Histogram of Simulated Values for MN Helmet Use Effect

Figure 9 - Histogram of Simulated Values for NE Helmet Use Effect

Figure 10 - Histogram of Simulated Values for NV Helmet Use Effect

Figure 11 - Histogram of Simulated Values for OK Helmet Use Effect

Figure 12 - Histogram of Simulated Values for PA Helmet Use Effect

Figure 13 - Histogram of Simulated Values for RI Helmet Use Effect

Figure 14 - Histogram of Simulated Values for SC Helmet Use Effect

Figure 15 - Histogram of Simulated Values for SD Helmet Use Effect

Figure 16 - Histogram of Simulated Values for UT Helmet Use Effect

Figure 17 - Histogram of Simulated Values for WI Helmet Use Effect

Following Gelman, Carlin, Stern, and Rubin (1995), Section 8.5, we checked the fit of the statistical model estimated by our meta-analysis by simulating 10,000 repetitions of the reported data.

Figure 18 - Histogram of Simulated Values for Minimum Reported Helmet Use Effect

Figure 19 - Histogram of Simulated Values for Maximum Reported Helmet Use Effect

Figure 20 - Histogram of Simulated Values for Mean Reported Helmet Use Effect

DISCUSSION

  1. For most states, estimates of helmet use effects obtained by analyzing high-probability complete cases were not the same as estimates obtained by multiple imputation of missing links and missing values.  This suggests that high-probability complete cases are not representative of the total study populations.
Table 7 - Comparison of Estimated State Helmet Use Effects
State Complete Cases Complete Effect  Imputed Cases Imputed Effect
Demo 19 -8,746 66 -1,288
DE 44 -2,952 63 -5,754
KY 53 -6,247 101 -3,059
MD 195 -236 226 860
ME 52 -8,016 51 -7,307
MN 135 -4,096 108 -10,375
NE 21 18,632 33 16,380
OK 98 -1,406 122 -1,735
PA 2,664 -5,128 3,308 -2,498
SD 65 12,242 81 6,684
UT 189 -6,394 401 -3,616
WI 316 -7,472 378 -7,269
 
  1. Meta-analysis results suggest that helmet use is protective at 0.9 significance.  That is, helmet users incur lower inpatient charges, on average, although there is wide variation from case to case and state to state.  The 50%-tile estimate for the population-wide helmet use effect is -$2,514.  The symmetric 90% confidence interval for the population-wide helmet use effect is -$4,735 to -$554.

The 50%-tile estimate for the state-to-state standard deviation in mean helmet use effect is $1,168.  Consequently, it is likely that the state-to-state variation in true mean helmet use effect is much less than the apparent variation based on one specific reporting period (-$10,375 to $16,380).  The range of 50%-tile estimates for the state effects is only -$2,152 to -$2,851.  However, only Oklahoma, Utah, and Wisconsin show statistically significant helmet protection with 90% confidence intervals completely below zero.

Based on 10,000 simulated replications of state reports, the statistical model estimated by our meta-analysis fits the data.  For all three test statistics (minimum, maximum, and mean reported state helmet use effect), actual reported values fall near the p=0.5 values of the simulated distributions.

  1. Kentucky reported finding fewer imputed links than high-probability links in preliminary tests.  This was caused by CODES 2000 tabulating the lowest weight when the same record pair was found in multiple passes.  The problem was corrected when CODES 2000 was changed to tabulate the highest weight.  The new software version was distributed to all Data Network states.

See Data Network Note # 11 - Multiple Imputation and One-to-One Links (Revised)

  1. Most states found significantly more links by using the imputation methodology.  However, only Nebraska and Utah were able to impute all of their estimated total links.  Among all other states, only Maine was able to impute over 90% of their estimated links, and some states were below 50%.  Linked datasets must be nearly complete (over 90%) for accurate analysis of study populations.
Table 8 - Comparison of Estimated Versus Actual Link Counts
State Crash Records Estimated Total Links % of Crash Actual Imputed Links % of Est.
Demo 56,689 1,200 2.1 1,060 88
DE 81,928 2,000 2.4 1,752 88
KY 307,773 2,800 0.9 2,470 88
MD 224,111 8,200 3.7 6,525 80
ME 99,712 1,812 1.8 1,654 91
MN 254,024 5,000 2.0 3,045 61
NE 83,525 1,340 1.6 1,477 110
OK 180,253 2,500 1.4 2,172 87
PA 2,533,317 112,500 4.4 67,564 60
RI 43,898 900 2.1 710 79
SC 270,463 6,500 2.4 3,820 59
SD 80,079 2,000 2.5 912 46
UT 428,165 7,186 1.7 7,537 105
WI 357,058 5,400 1.5 2,635 49

Eleven states reported estimated total links as a percent of crash records in a fairly narrow range between 1.4% and 2.5%.  Kentucky, Maryland, and Pennsylvania were outliers.

One reason for the shortfalls may be incomplete linkage strategies.  Not all productive match passes have been explored.  Also, not all shared information has been coded for linkage.  Finding appropriate changes to improve the data preparation and linkage strategies used in Phase I so that they produce more complete linked datasets is an open issue.

Figure 21 - Link Specifications Report for DE

Figure 22 - Link Specifications Report for KY

Figure 23 - Link Specifications Report for MD

Figure 24 - Link Specifications Report for ME

Figure 25 - Link Specifications Report for MN

Figure 26 - Link Specifications Report for OK

Figure 27 - Link Specifications Report for PA

Figure 28 - Link Specifications Report for UT

Another reason for the shortfalls may be a known weakness with the initial CODES 2000 linkage imputation algorithms presented in November.  Arizona, Maryland, and Utah found that sometimes hundreds or thousands of matched pairs were assigned to the same set because of very low probability links.  Many of these pairs were dropped by the Imputation Wizard when one-to-one matches were selected from the sets.  Most states did not find this problem.  CODES 2000 was changed to avoid the problem by assigning set numbers after linkage imputation rather than before.  The new software version was distributed to those states reporting high-count sets, but other states may have similar but less severe problems.

See Data Network Note # 11 - Multiple Imputation and One-to-One Links (Revised)

  1. Only Delaware and Maine reported adjusting their linkage probability models to account for field dependencies or comparison tolerances.  This suggests that the models used in Phase I by most states could be improved to produce more accurate probability estimates.  It also suggests that the current mechanisms in CODES 2000 for making such adjustments should be simplified or automated to encourage broader use.

Finding appropriate changes to improve the linkage probability models used in Phase I so that they produce more accurate estimates is an open issue.

  1. Utah reported sensitivity to the random number sequences producing multiple imputations for the test data.  A sensitivity analysis suggested that 10 linkage imputations and 10 values imputations would produce more stable results for these data.  However, the appropriate number of imputations for each state's data must be determined individually through a similar sensitivity analysis.

See Data Network Note # 8 - Variations in Multiple Imputation Results

  1. Kentucky reported sensitivity to unusual inpatient charges ($0, outliers).  The appropriate way to handle such charges is an open issue.

See KY Regression Excluding Charges = $0

As expected, reported hospital inpatient charges are highly skewed for all states.  Consequently, a logarithmic transformation would be appropriate in the Phase II regression analysis.

Table 9 - Hospital Inpatient Charges for Motorcycle Riders
State Min. Charges Mean Charges Max. Charges
Demo 532 14,748 64,089
DE 775 21,029 203,472
KY 43 9,125 165,391
MD 794 15,599 155,344
ME 1,016 20,256 111,976
MN 1,724 23,149 262,605
NE 1,733 23,917 140,209
OK 527 23,454 442,378
PA 16 32,198 1,150,294
SC 1,740 26,589 439,356
SD 1,489 24,214 161,989
UT 432 20,429 201,680
WI 807 22,237 284,821

Mean charges for 9 states fall in a fairly narrow range from $20,256 to $26,589.  Kentucky, Maryland, Pennsylvania, and the Demo data are outliers.

  1. Maryland reported sensitivity to the definition of helmet use.  Utah reported concerns about the definition of helmet use given available information.  Rhode Island reported no missing values for helmet use because their reporting system defaults to "No."  The appropriate way to define helmet use is an open issue.

See MD Alternate Helmet Use Definition

See UT Helmet Use Definition

  1. Missing data values contributed to uncertainty about the true effect of helmet use on inpatient charges.  States reported various levels of missing helmet use data ranging from 0% to 55%.
Table 10 - Missing Helmet Use Data in Linked Datasets
State Imputed MC Links (Imp. 1)  Helmet Use Missing % Missing
Demo 71 39 55
DE 62 18 29
KY 97 13 13
MD 226 30 13
ME 51 0 0
MN 108 22 20
NE 33 14 42
OK 122 12 10
PA 941 350 37
SC 266 9 3
SD 81 1 1
UT 387 181 47
WI 338 117 31

Maine, South Carolina, and South Dakota had nearly complete helmet use reporting.  Only the Demo data had over 50% missing helmet use values.  Consequently, multiple imputation and simulation algorithms are likely to be perform well for most state analyses.  Schaffer notes on page 137, that if "rates of missing information are moderate, say 40% or less, we may expect the simulations to proceed without much difficulty."

  1. Most states were able to use CODES 2000 to construct an adequate linkage probability model for the Phase I analysis.  Arizona and South Carolina reported difficulties completing their linkage imputation processes.  Arizona's imputed one-to-one links included several times as many matched pairs as their estimated total.  Most of the pairs were tabulated with high probabilities and assigned to the same set, even after installing the software fixes mentioned earlier.  In addition, doing linkage imputations sometimes caused the PC to crash.

South Carolina reported doing linkage imputation as a two-step process.  First, they linked only crash records with names to all hospital records.  Second, they linked only crash records without names to unlinked hospital records.  Combining and analyzing these separate linkage imputations added more complexity to the process.

  1. Maine and Rhode Island reported that using the SAS MI procedure for value imputation produced errors when there were no missing values.  South Dakota reported that examples of required text file formats would be useful in the instructions for Schafer's NORM procedure.  No other states reported value imputation issues with either SAS MI or NORM.

Rhode Island and South Carolina reported errors when using DDE and ODBC procedures for directly exchanging data between Microsoft Access and SAS.  They had to resort to creating ancillary files in order to transfer data between these systems.

RECOMMENDATIONS

  1. Upgrade to CODES 2000 Version 2.2.350 from a new distribution CD.  This will provide all states with the latest linkage imputation algorithms and other enhancements that have been developed to address reported problems.
  2. Revise link join specifications to obtain at least 90% coverage of estimated total links.  This should reduce sensitivity to any remaining missing links.
  3. Revise tables and match specifications to add new match fields so that all available information is used for matching.  For example:  injury date, crash type, vehicle type, driver flag, injured flag, fatality flag, etc.  This should improve the accuracy of linkage probability models as well as increase the number of links found.
  4. Revise match specifications to reflect important field reliabilities, field dependencies, and comparison tolerances.  This should improve the accuracy of linkage probability models.
  5. Revise imputation methodology to do 10 linkage imputations and 10 missing value imputations.  This should reduce sensitivity to the random number sequences used for imputation.  Conduct sensitivity analysis of random number in selected states.
  6. Treat $0 charges as missing values.  Otherwise, accept all reported charges.  For high outlier charges, identify specific procedures that contribute most to the total charges.  Conduct sensitivity analysis of outlier charges in selected states.
  7. Revise regression models to use rider age, rider sex, and logarithm of inpatient charges.  The regression model used for Phase I was intentionally simplistic.  In addition, transformed charges should have closer to normal distributions.
  8. Revise imputation models to use rider sex.  Imputation models should be at least as complex as regression models.
  9. Revise the meta-analysis to combine regressions on logarithm of inpatient charges.  Transformed charges should have closer to normal distributions.
  10. Conduct sensitivity analyses of helmet use definitions in selected states.  We cannot correct poor helmet use information, but at least we can describe how it affects analysis results. 
  11. Revise any non-standard approach to linkage or imputation that is not consistent with the recommended approaches.  
 
 
© Copyright 2000 - 2008 Strategic Matching, Inc. All rights reserved. Microsoft, Windows, and Access are trademarks of Microsoft Corporation. Last modified: Monday January 28, 2008.