CODES 2000 User Forum -- Data Network Note #9

Using Schafer’s Stand-Alone NORM Software 

Applies to: NORM 2.03.
Last updated: Thursday January 03, 2002.

SUMMARY

You should analyze any incomplete CODES linked dataset by multiply imputing missing links and missing data values, analyzing each of the resulting complete CODES linked datasets, and then combining all separate parameter estimates to reflect the uncertainty due to having an incomplete dataset. You can impute missing links using CODES 2000 and the new Linkage Imputation Wizard. If you have access to SAS/STAT*, you can impute missing data values with MI, conduct regression analysis with REG, and combine parameter estimates with MIANALYZE. If you do not have SAS, you can impute missing data values with Schafer's NORM software, conduct regression analysis with Microsoft Excel Data Analysis, and combine parameter estimates with NORM. The procedures presented cover the latter approach.

*SAS and SAS/STAT are trademarks of SAS Institute, Inc.

PROCEDURES

  1. Download Software and Test Data.

Schafer’s Stand-Alone NORM Program – Download NORM from Schafer’s website free of charge: www.stat.psu.edu/~jls/misoftwa.html.  Scroll down to Stand-alone packages for Windows 95/98/NT. Click on Download NORM Version 2.03 for Windows. Among the downloaded files is a help file, NORM.HLP.

CODES 2000 Imputation Wizard – Follow instructions in Data Network Note #2 -- Multiple Imputation and One-to-One Links. Save downloaded file to desired folder.

Test Data – Follow instructions in Data Network Note #3 -- Test Data for Multiple Imputation to download testdata.mdb. Save downloaded file to desired folder.

Note that you can save time later by downloading the software in one location.

  1. Prepare Data for Multiple Imputation.

Follow instructions in Data Network Note #3 -- Test Data for Multiple Imputation to the end. Note that you must create five tables: Crash-linkpairs1-hospital, Crash-linkpairs2-hospital, Crash-linkpairs3-hospital, Crash-linkpairs4-hospital, and Crash-linkpairs5-hospital.

For each join query created to link Crash to Inpatient data you must:

Convert or remove non-numeric test data.

Denote all missing values by a single numeric code, such as -9, -99, or 10000, not by blank spaces, periods, or other non-numeric characters.

Export the data as a text file named as *.dat.

Note the following information when preparing CODES 2000 data for use in NORM:

Limitations. There is essentially no limit to the number of variables or the number of cases in NORM. Very large test data sets are not a problem, provided that your computer has enough memory (RAM) to process the test data. The only firm limit is that each line of the test data file must be less than 2000 characters long, including spaces.

Non-numeric test data. Non-numeric or character test data are not allowed in the test data file. If your test data set contains non-numeric variables, you should either (a) convert them to numeric codes or (b) remove them from the test data file before using NORM.

Number format. The numbers in the *.dat file may be integers, decimals, or in exponential format (such as 1.56E-02). Embedded commas (as in 10,042) are not allowed.

  1. Run NORM.EXE.

Navigate to the folder where you downloaded the NORM software.  Double click on the NORM application (not the NORM203 application).

The following steps are outlined from Schafer’s NORM help file. You may access the help file at any time. Schafer’s help file includes more detailed information when you click on an underlined topic.

  1. Open a NORM Session.

After starting the NORM program, you may either begin a new session or open an existing session.

New Session. Select "New" from the File menu. You will be prompted for the name of the file that contains your data (presumably named *.dat).

  1. Import CODES 2000 Data into NORM.

NORM reads data from any specified text data file. The data file is displayed in a small window on the Data file tab in your NORM session. As soon as the data file is displayed, you should make sure that the numeric missing value code is correct.

The variables in your NORM data set are managed by the variables grid. You may provide names for your variables by typing them into NORM after your data file has been read. Alternatively, you may provide them through a variable names file, allowing NORM to read them automatically when your data file is read. On this grid you may do the following:

Edit variable names. Select a variable, double click or press Enter to edit the name.

Apply Transformations. When variables are not normally distributed, it often helps to apply transformations before imputing. If a variable is right-skewed, for example, it may be sensible to impute the variable on a square root or log scale, and then transform it back to the original scale. NORM can perform these functions automatically.

For variables with limited ranges, NORM suggests using logit transformations to ensure that imputed data will have the same minimum and maximum values. Double-click onto "none".

Double click on "none" corresponding to the variable "charges". Select power transformation and then select log for log transformation. Power transformations are useful for correcting skewness.

Apply Rounding. Double click on "integer" for rounding for each variable. Select "to nearest observed value".

Using this option, each imputed value will be rounded to the nearest observed value for that variable. This option is very useful for imputing binary and ordinal variables. For example, suppose a variable takes values 1, 2, 3, 4, and 5. Rounding to the nearest integer could occasionally produce imputed values of 0 or 6. Rounding to the nearest observed value, however, will ensure that imputed values are 1, 2, 3, 4, or 5.

  1. Summarize your data.

Important features of any one variable can be seen by tabulating and plotting the variable from the variables grid. But NORM can also report important features of all variables at once, including means, standard deviations, rates and patterns of missingness.

To produce a summary, go to the summarize sheet by clicking on the "Summarize" tab. Select an appropriate name for the file where the output is to be stored and press the "Run" button. The file will be created and displayed in a small window. This summary contains information on all the variables currently in the model.

  1. Run the Expectation Maximization (EM) algorithm.

The EM algorithm in NORM calculates maximum likelihood estimates of means, variances and covariances using all of the cases in your dataset, including those that are partially missing. Before using NORM to impute missing data values, it’s almost always a good idea to run EM first. Running EM first will provide a good estimate of starting values for all model parameters.

To run EM, go to the EM sheet by clicking on the "EM algorithm" tab in your NORM session. Then click on the "Run" button.

Any run of the EM algorithm will create two files: an output (*.out) file reporting the results of EM, and a parameter (*.prm) file where the resulting parameter estimates are stored. When EM is finished running, the output file is automatically displayed but the parameter file is not.

  1. Run the Data Augmentation (DA) algorithm.

The test data augmentation (DA) algorithm in NORM simulates random values of parameters and missing data values from their posterior distribution. It is the method by which NORM creates proper multiple imputations for the missing data values.

Before running DA, it’s a good idea to run EM first. Running EM first will provide a good estimate of starting values for all model parameters (see Step 7).

To run DA, go to the DA sheet by clicking on the "Data augmentation" tab in your NORM session. Then click on the "Run" button.

Any run of DA will create two files: an output (*.out) file reporting the results of DA, and a parameter (*.prm) file where the final simulated values of the parameters are stored. When DA is finished running, the output file is automatically displayed but the parameter file is not.

The number of DA cycles and various other computing options may be set via the "Computing…" button. Imputation options may be set via the "Imputation…" button.

DA Computing Options.  Number of iterations determines the number of cycles of data augmentation to be performed. Change the number of iterations to 5000.

Imputation Options.  These options determine how imputed datasets (*.imp files) are generated and stored. Select the option "Impute at every kth iteration". This option saves the imputations from every kth cycle of data augmentation. By setting k large enough to ensure convergence and independence, you can produce any number of proper multiple imputations. Click on "Once at every kth iteration", set k=1000 and click "Run".

  1. Create multiple imputations.

In NORM, proper multiple imputations are created through test data augmentation. Running data augmentation for k iterations, where k is large enough to guarantee convergence and independence, produces random draws of parameters from their posterior distribution. Imputing missing data values under these random parameter values results in one imputation. Repeating the whole process m times produces m proper multiple imputations.

First guess a value for k (1000). A reasonable guess can be obtained by running EM first and setting k equal to or greater than the number of iterations needed for EM to converge. Run data augmentation for a total of N = mk (5 x 1000) iterations or cycles, producing an imputation at every kth cycle (creating m (5) imputations).

  1. Save your NORM session.

You may save your NORM session at any time by selecting "Save" or "Save As" from the File menu. The session is saved to a file called *.nrm.

  1. Prepare imputed values for analysis.

Assuming 5 imputed data files were created for each imputed link data set, a total of 25 imputed link data sets (see Step 8). Import m (5) imputed data files, *_1.imp, *_2.imp... *_5.imp, into MS Excel spreadsheets to perform data analysis. This process creates 25 spreadsheets.

  1. Analyze 25 imputed link datasets.

In MS Excel, go to "Tools" and select "Data Analysis". Select "regression" and input Y variable and X variable ranges (in our example, charges = Y and safety = X). Perform data analysis for each (25) imputed data set.

Below are the results from one of the 25 imputed data sets:

 

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Lower 95.0%

Upper 95.0%

Intercept

17404.23

2254.236

7.720678

5.97E-11

12908.3

21900.15

12908.3

21900.15

X Variable 

-7195.98

3614.819

-1.99069

0.05042

-14405.5

13.54592

-14405.5

13.54592

On a separate spreadsheet, copy the estimates of regression coefficient and standard error for the intercept and X-variable. Applying this procedure for each data set creates a spreadsheet with estimates of 50 regression coefficients (intercept and X variable) and their corresponding standard errors. Save this file as a text file (*.txt).

  1. Combine separate regression parameters into a single estimate.

Using NORM, run MI Inference. Go to Analyze and select "MI Inference: Scalar". Select the text file that contains estimates and standard errors. Select "Stacked column", Number of estimands: 2 and Number of imputations: 25 (m), and click "Run". The output gives the estimate of the regression parameter and the corresponding standard error that takes into account the uncertainty due to imputation of missing links and missing values.

 
 
© Copyright 2000 - 2008 Strategic Matching, Inc. All rights reserved. Microsoft, Windows, and Access are trademarks of Microsoft Corporation. Last modified: Monday January 28, 2008.