CODES 2000 User Forum -- Data Network Note #9
Using Schafer’s Stand-Alone NORM Software

Applies to: NORM 2.03.
Last updated: Thursday January 03, 2002.

SUMMARY
You should analyze any incomplete CODES linked dataset by multiply imputing
missing links and missing data values, analyzing each of the resulting complete
CODES linked datasets, and then combining all separate parameter estimates to
reflect the uncertainty due to having an incomplete dataset. You can impute
missing links using CODES 2000 and the new Linkage Imputation Wizard. If you
have access to SAS/STAT*, you can impute missing data values with MI, conduct
regression analysis with REG, and combine parameter estimates with MIANALYZE. If
you do not have SAS, you can impute missing data values with Schafer's NORM
software, conduct regression analysis with Microsoft Excel Data Analysis, and
combine parameter estimates with NORM. The procedures presented cover the latter
approach.
*SAS and SAS/STAT are trademarks of SAS Institute, Inc.

PROCEDURES
- Download Software and Test Data.
Schafer’s Stand-Alone NORM Program – Download NORM from
Schafer’s website free of charge: www.stat.psu.edu/~jls/misoftwa.html.
Scroll down to Stand-alone packages for Windows
95/98/NT. Click on Download NORM
Version 2.03 for Windows. Among the downloaded files is a help file,
NORM.HLP.
CODES 2000 Imputation Wizard – Follow instructions in Data Network Note #2 -- Multiple Imputation and One-to-One
Links. Save
downloaded file to desired folder.
Test Data – Follow instructions in Data Network Note #3
-- Test Data for Multiple Imputation to download testdata.mdb.
Save downloaded file to desired folder.
Note that you can save time later by downloading the software in one
location.
- Prepare Data for Multiple Imputation.
Follow instructions in Data Network Note #3 -- Test
Data for Multiple Imputation to the end. Note that
you must create five tables: Crash-linkpairs1-hospital,
Crash-linkpairs2-hospital, Crash-linkpairs3-hospital,
Crash-linkpairs4-hospital, and Crash-linkpairs5-hospital.
For each join query created to link Crash to Inpatient data you
must:
Convert or remove non-numeric test data.
Denote all missing values by a single numeric code, such as -9, -99, or
10000, not by blank spaces, periods, or other non-numeric characters.
Export the data as a text file named as *.dat.
Note the following information when preparing CODES 2000 data
for use in NORM:
Limitations. There is essentially no limit to the number of
variables or the number of cases in NORM. Very large test data sets are
not a problem, provided that your computer has enough memory (RAM) to
process the test data. The only firm limit is that each line of the test
data file must be less than 2000 characters long, including spaces.
Non-numeric test data. Non-numeric or character test data are not
allowed in the test data file. If your test data set contains
non-numeric variables, you should either (a) convert them to numeric
codes or (b) remove them from the test data file before using NORM.
Number format. The numbers in the *.dat file may be integers,
decimals, or in exponential format (such as 1.56E-02). Embedded commas
(as in 10,042) are not allowed.
- Run NORM.EXE.
Navigate to the folder where you downloaded the NORM software. Double
click on the NORM application (not the NORM203 application).
The following steps are outlined from Schafer’s NORM help file. You may
access the help file at any time. Schafer’s help file includes more detailed
information when you click on an underlined topic.
- Open a NORM Session.
After starting the NORM program, you may either begin a new session or open
an existing session.
New Session. Select "New" from the File menu. You will be
prompted for the name of the file that contains your data (presumably named
*.dat).
- Import CODES 2000 Data into NORM.
NORM reads data from any specified text data file. The data
file is displayed in a small window on the Data file tab in your NORM session. As soon as
the data file is displayed, you should make sure that the numeric
missing value code is correct.

The variables in your NORM data set are managed by the variables grid. You
may provide names for your variables by typing them into NORM after your data file has been read. Alternatively, you may provide them through a variable
names file, allowing NORM to read them automatically when your data file is
read. On this grid you may do the following:
Edit variable names. Select a variable, double click or press Enter to edit
the name.

Apply Transformations. When variables are not normally distributed, it often helps to apply
transformations before imputing. If a variable is right-skewed, for example,
it may be sensible to impute the variable on a square root or log scale, and
then transform it back to the original scale. NORM can perform these
functions automatically.
For variables with limited ranges, NORM suggests using logit
transformations to ensure that imputed data will have the same minimum and
maximum values. Double-click onto "none".

Double click on "none" corresponding to the variable
"charges". Select power transformation and then select log for log
transformation. Power transformations are useful for correcting skewness.

Apply Rounding. Double click on "integer" for rounding
for each variable. Select "to nearest observed value".
Using this option, each imputed value will be rounded to the nearest
observed value for that variable. This option is very useful for imputing
binary and ordinal variables. For example, suppose a variable takes values
1, 2, 3, 4, and 5. Rounding to the nearest integer could occasionally
produce imputed values of 0 or 6. Rounding to the nearest observed value,
however, will ensure that imputed values are 1, 2, 3, 4, or 5.

- Summarize your data.
Important features of any one variable can be seen by tabulating and plotting
the variable from the variables grid. But NORM can also report important
features of all variables at once, including means, standard deviations, rates
and patterns of missingness.
To produce a summary, go to the summarize sheet by clicking on the
"Summarize" tab. Select an appropriate name for the file where the
output is to be stored and press the "Run" button. The file will be
created and displayed in a small window. This summary contains information on
all the variables currently in the model.
- Run the Expectation Maximization (EM) algorithm.
The EM algorithm in NORM calculates maximum likelihood estimates of means, variances and covariances using all
of the cases in your dataset, including those that are partially missing.
Before using NORM to impute missing data values, it’s almost always a good idea
to run EM first. Running EM first will
provide a good estimate of starting values for all model parameters.
To run EM, go to the EM sheet by clicking on the "EM algorithm" tab
in your NORM session. Then click on the "Run" button.

Any run of the EM algorithm will create two files: an output (*.out) file
reporting the results of EM, and a parameter (*.prm) file where the resulting
parameter estimates are stored. When EM is finished running, the output file is
automatically displayed but the parameter file is not.
- Run the Data Augmentation (DA) algorithm.
The test data augmentation (DA) algorithm in NORM simulates random values of
parameters and missing data values from their posterior distribution. It is the
method by which NORM creates proper multiple imputations for the missing data
values.
Before running DA, it’s a good idea to run EM first. Running EM first will
provide a good estimate of starting values for all model parameters (see Step
7).
To run DA, go to the DA sheet by clicking on the "Data
augmentation" tab in your NORM session. Then click on the "Run"
button.

Any run of DA will create two files: an output (*.out) file reporting the
results of DA, and a parameter (*.prm) file where the final simulated values of
the parameters are stored. When DA is finished running, the output file is
automatically displayed but the parameter file is not.
The number of DA cycles and various other computing options may be set via
the "Computing…" button. Imputation options may be set via the
"Imputation…" button.
DA Computing Options. Number of iterations determines the
number of cycles of data augmentation to
be performed. Change the number of iterations to 5000.
Imputation Options. These options determine how imputed datasets (*.imp files) are
generated and stored. Select the option "Impute at every kth
iteration". This option saves the imputations from every kth
cycle of data augmentation. By setting k large enough to ensure convergence
and independence, you
can produce any number of proper multiple imputations. Click on
"Once at every kth iteration", set k=1000 and click
"Run".
- Create multiple imputations.
In NORM, proper multiple imputations are created through test data
augmentation. Running data augmentation for k iterations, where k is
large enough to guarantee convergence and independence, produces random draws of parameters from
their posterior distribution. Imputing missing data values under these random
parameter values results in one imputation. Repeating the whole process m times
produces m proper multiple imputations.
First guess a value for k (1000). A reasonable guess can be obtained
by running EM first and setting k equal to or greater than the number of
iterations needed for EM to converge. Run data augmentation for a total of
N = mk (5 x 1000) iterations or cycles, producing an imputation at every kth
cycle (creating m (5) imputations).
- Save your NORM session.
You may save your NORM session at any time by selecting "Save" or
"Save As" from the File menu. The session is saved to a file called *.nrm.
- Prepare imputed values for analysis.
Assuming 5 imputed data files were created for each imputed link data set, a
total of 25 imputed link data sets (see Step 8). Import m (5) imputed
data files, *_1.imp, *_2.imp... *_5.imp, into MS Excel spreadsheets to
perform data analysis. This process creates 25 spreadsheets.
- Analyze 25 imputed link datasets.
In MS Excel, go to "Tools" and select "Data Analysis".
Select "regression" and input Y variable and X variable ranges (in our
example, charges = Y and safety = X). Perform data analysis for each (25)
imputed data set.
Below are the results from one of the 25 imputed data sets: