Data Screening and Missing Value Analysis James H. Steiger
Theorem (The Fundamental Theorem of Modern Data Analysis). Storage media are cheap. Data are expensive.
Corollary 1. In some cases, the data are priceless. Corollary 2. You will suffer a major data loss event. There are many ways this can happen. The only questions are (a) when, and (b) whether the data will be restorable from a backup.
Don’t Lose Data • Own a CD-DVD burner. • Have a backup plan. Back up all your research related manuscripts and data files every week. • Every time you work on a data file, start by saving it under a new name, according to a precise code, like this: PROJECT5_2006_01_30_A.SAV
• After several weeks, you will have a succession of data files that will allow you to restore to any earlier time. This can be useful in case one of your “data analysis” sessions is discovered, after the fact, to be a “data destruction” session!
Don’t Throw Away Data • Never dichotomize or categorize continuous data. You are simply throwing away information and increasing error variance. • Use missing data imputation to replace missing data, rather than deleting data casewise. • As data become available, screen them immediately for univariate and multivariate outliers and missing values. You may be able to recover some lost information and correct recording errors.
Plan Your Data Files Carefully • Add descriptive information to the variable names • Use long variable names • Remember that multivariate data are indeed multivariate! Never destroy the multivariate structure by splitting the file unless you make very careful use of unique ID values so that you can reconstruct the complete file.
Document All Analysis Activities • You will be amazed how quickly you will forget what you did during a complex data analysis. • Yet, what you did determines the appropriate probability model for analyzing your data! (Not all textbook authors recognize this, and we will comment extensively on this later in the course.) • Analysis documentation can be done in SPSS by logging commands and by placing text fields
with notes about the analysis in the SPSS output files you create. • There is an overwhelming tendency for most people to put insufficient detail in such comments. Resist this tendency. You will almost always save time in the long run by putting extremely careful (“overly detailed”) notes in your data analyses.
• Advanced, modern systems (like R combined with SWeave) allow you to produce one file that contains a narrative about your data analysis, along with all the commands that produce your analysis! If you discover an error, you can redo months of work in seconds by simply rerunning the file.
Dealing with Missing Data Many basic multivariate procedures assume complete data. But many data sets have some missing values. What to do? One approach is to delete cases with missing data, and perform the analysis on the reduced data set. Another approach is to try to replace missing data to “fill out” the data set.
Casewise Deletion Often the default procedure is “casewise deletion.” Casewise deletion simply discards any case that has missing data on any of the variables in the analysis. Casewise deletion can, in many studies reduce the operational sample size to only 70-75% of its initial value.
Missing Data Imputation An alternative is to “impute” or reconstruct missing data values, using procedures that range from extraordinarily simple to very complex.
Dangers of Casewise Deletion Casewise deletion seems, at first glance, to be “unbiased” and “fair” since it does not involve “making up data” to replace the missing values. However, it can actually be quite dangerous, resulting in biased estimates that are far worse than any problem resulting from “making up data.”
Missing Data Theory When missing data are a problem in an analysis, we try to develop an understanding of an underlying mechanism that led to the data being missing. The classic reference is Little & Rubin (1987), Statistical Analysis with Missing Data . For a superb introductory chapter, see Chapter 3 of Frank Harrell’s Regression Modeling Strategies.
Missing Data Theory Several potential mechanisms for missing data can be discussed, and several technical terms are commonly employed to discuss them:
MCAR ( Missing Completely At Random) • Data are missing for reasons completely unrelated to the values of any variable in the study or characteristic of the subject, including the value of the missing data . • Examples: A sensor failed at random during a trial. A survey response was lost in the mail. • This is the easiest type of missing data to deal with.
MAR (Missing at Random) • The probability that a value is missing depends on values of the variables that were actually measured. • Given the values of other observed variables, subjects having missing values are only randomly different from other subjects.
MAR (Missing at Random) • Example (Harrell, 2001, p.41). Consider a survey in which females are less likely to provide their personal income than males, but the likelihood of responding is independent of a woman’s actual income. If we have the sexes of all subjects, and we have income data for some females, then we still can construct unbiased income estimates.
IM (Informative Missing) • Data are more likely to be missing if their true values are higher or lower. • Example: People with very high incomes in a survey are more likely to refuse to provide income data. • This is the most difficult kind of missing data to deal with.
Modeling and Dealing with Missing Data Some potential approaches and issues (Harrell, 2001, p. 45): • Imputation of missing values for one of the variables can ignore all other information. For example, mvs can be replaced with a constant such as the mean or the median of nonmissing values on that variable • Imputation can be based on information not otherwise used.
• Imputations can be based on information obtained only by analyzing interrelationships among the X ’s. • Imputations can be based on relationships between X’ s, and between X and Y . • Imputations can take into account the reason for non-response, if known. • Ignoring known relationship between X and Y for non-missing variables during imputation can bias the regression coefficient toward zero.
Imputation Algorithms Single Imputation of Conditional Means For a single X that is unrelated to other X ’s, the mean or median may be substituted with little loss of efficiency. Multiple Imputation Uses random draws from the estimated conditional distribution of the X value given the other X values
and (possibly) Y . Usually these draws are repeated (and the analysis repeated) several times.
General Guidelines ≤ Proportion of missing .05 Doesn’t matter much how you impute missings or whether you adjust variance of regression coefficient estimates for missingness. Casewise deletion analysis is a reasonable option
Proportion of missing .05 to .15 If predictor is unrelated to other predictors, you can use a reasonable constant value, otherwise impute using a customized model to predict the predictor from all other predictors. Variance estimates of coefficients may need to be adjusted Proportion of missing greater than .15 Same as above, but the need to adjust variance estimates is even greater.
Generating a Correlation Matrix with Pairwise Deletion One common procedure is to compute each correlation on the complete cases for those two variables only . This procedure generates correlations that are based generally on larger sample sizes than would be obtained with casewise deletion. One problem is that a matrix of such correlations may not be “positive definite.”
Recommend
More recommend