Distributed Practice! • Emil would like to examine cortisol as a function of both external (temperature in � C) and internal factors (excitement seeking, on a scale of 1 to 7). The initial model: model1 <- lmer(CortisolNMol ~ 1 + TempC + ExcitementSeeking + (1|Subject), data=stress) yields the following results: • Possible to center TempC here as well, but this variable at least has a meaningful 0 already (0 � C is possible)
Rank Deficiency • We also have some other measures • Let’s try adding temperature in Fahrenheit to the model ( TempF ) model2 <- lmer(CortisolNMol ~ 1 + TempC + • TempF + ExcitementSeeking + (1|Subject), data=stress) • This looks scary!
Rank Deficiency • Our model: + = + γ 000 E(Y i(j) ) γ 100 x 1 i ( j ) γ 200 x 2 i ( j ) Temperature in Temperature Stress Baseline Celsius in Fahrenheit • What does γ 100 represent here? • The effect of a 1-unit change in degrees Celsius … while holding degrees Fahrenheit constant • This makes no sense—if � C changes, so does � F • In fact, one change perfectly predicts the other • � F = (9/5)C + 32 • Problem if one column is a linear combination of other(s)—it can be perfectly formed from other columns by adding, subtracting, multiplying, or dividing
Rank Deficiency • Linear ● combinations 90 ● result in a ● ● ● ● 80 ● ● ● ● ● ● ● ● ● ● ● perfect ● ● ● ● ● ● ● ● ● Temperature in Fahrenheit ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 70 correlation ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Here: ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Correlation ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● between ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 TempC and ● ● ● ● TempF 0 10 20 30 Temperature in Celsius
Rank Deficiency • You: “R won’t do what I want! I need to fit this model, but I can’t! This program is broken!” • Scott’s view: This really isn’t a coherent research question • “Effect of changing � C while holding constant � F” is a nonsensical question— doesn’t make sense to ask • We would have this same problem in any software package, or even if we computed the regression by hand
Rank Deficiency • In fact, it’s mathematically impossible to perform this regression • A matrix in which some columns are linear combinations of others is “rank deficient” • If a matrix is rank-deficient, you can’t compute the inverse of it (because the inverse doesn’t exist) • But, you need to calculate an inverse to perform linear regression: ( X’X ) -1 • Therefore, regression can’t be performed
Rank Deficiency: Solutions • Is it unexpected that columns would be related in this way?: • Check your experiment script & data processing pipeline (e.g., is something saved in the wrong column) • Or is it expected ? ( TempF should be predictable from TempC ): • Rethink analysis—would it ever be sensible to try to distinguish these effects? • Maybe this is just not a coherent research question • Or, a different design might make these independent
Rank Deficiency • We have some other personality variables: Gregariousness • Assertiveness • Extraversion , the sum of the gregariousness, • assertiveness, and excitement seeking facets • Emil’s next model: modelExtra <- lmer(CortisolNMol ~ 1 + TempC • + Gregariousness + Assertiveness + ExcitementSeeking + Extraversion + (1|Subject), data=stress) • Where is the rank deficiency in this case?
Rank Deficiency “ Problem if one column is a linear combination of other(s)—it • can be perfectly formed from other columns by adding, subtracting, multiplying, or dividing” • Similar problem: One predictor variable is the mean or sum of other predictors in the model • We already know exactly what Extraversion is when we know the three facet scores • Doesn’t make sense to ask about the effect of Extraversion while holding constant Gregariousness , Assertiveness , and ExcitementSeeking
Rank Deficiency “ Problem if one column is a linear combination of other(s)—it • can be perfectly formed from other columns by adding, subtracting, multiplying, or dividing” • Similar problem: One predictor variable is the mean or sum of other predictors in the model • Can also get this with the average of other variables • Exam 1 score, Exam 2 score, Average score • Average = (Exam1 + Exam2) / 2 • Again, can be perfectly predicted from the other columns
Week 11: Missing Data l Unbalanced Factors l Rank Deficiency l Linear Combinations l Incomplete Designs l Empirical Logit l Missing Data (NA values) l Types of Missingness l Non-Ignorable l Ignorable l Summary l Possible Solutions l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation
Incomplete Designs • The last variable that Emil is interested in is anxiety • Two relevant columns: Anxiety : Does this person have a diagnosis of • GAD (Generalized Anxiety Disorder) or not? Severity : Is the anxiety severe or not? • • Let’s look at these two factors and their interaction: model4 <- lmer(CortisolNMol ~ 1 + Anxiety * Severity • + (1|Subject), data=stress)
Incomplete Designs • Why is this model rank-deficient? • Some cells of the interaction are not represented in our current design: xtabs( ~ Anxiety + Severity, • data=stress) No observations for people with No anxiety and Severe symptoms
Incomplete Designs • Again, not a bug. Doesn’t make sense to ask about anxiety*severity interaction here Anxiety: No Severity: No Anxiety effect Anxiety: Yes Anxiety: Yes Severity: No Severity: Yes Severity effect
Incomplete Designs • Again, not a bug. Doesn’t make sense to ask about anxiety*severity interaction here Anxiety: No Severity: No Anxiety effect Anxiety: Yes Anxiety: Yes Severity: No Severity: Yes Severity effect • Interaction of Anxiety & Severity: “Effect of Anxiety and Severe anxiety over and above the effects of Anxiety alone and Severe anxiety alone.” • But, no effect of severe anxiety “alone.” Have to have anxiety to have severe anxiety.
Incomplete Designs • In fact, it’s mathematically impossible to calculate an interaction here! Anxiety * Anxiety Severity Severity 1 0 0 1 1 1 0 0 0 • Interaction column is identical to Severity column • We can’t distinguish them! • So, this is really just another example of the same linear combination problem
Incomplete Designs: Solutions • If missing cell is intended : • Often makes more sense to think of this as a single factor with > 2 categories • “Not anxious,” “Moderate anxiety,” “Severe anxiety” • Can use Contrast 1 to compare Moderate & Severe to Not Anxious, and Contrast 2 to compare Moderate to Severe • If missing cell is accidental : • Might need to collect more data! • Check your experimental lists
Week 11: Missing Data l Unbalanced Factors l Rank Deficiency l Linear Combinations l Incomplete Designs l Empirical Logit l Missing Data (NA values) l Types of Missingness l Non-Ignorable l Ignorable l Summary l Possible Solutions l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation
Source Confusion • So far, we’ve handled cases where combinations of the predictor variables are missing • Another scenario: With a categorical DV , particular categories of the DV might be rare/non- existent • Overall or in certain conditions • A relevant categorical DV might be memory: • Lots of theoretical interest in source memory • Remembering the context or source where you learned something Odds are not the Odds are not the WHO same thing as same thing as SAID probabilities! probabilities! IT?
Source Confusion sourceconfusion.csv : Source confusions in the • cued recall task Study: Test: VIKING— vodka VIKING—COLLEGE SCOTCH—VODKA • Two independent variables: AssocStrength AssocStrength : within-subjects but between-items • Strategy (maintenance or elaborative rehearsal): • within-items but between-subjects Here, items = WordPair s •
Source Confusion • Two independent variables: AssocStrength : within-subjects but between-items • Strategy (maintenance or elaborative rehearsal): • within-items but between-subjects Here, items = WordPair s • • Factorial design … apply the coding scheme you think would be most appropriate • Effects coding contrasts(sourceconfusion$AssocStrength) • <- c(0.5, -0.5) contrasts(sourceconfusion$Strategy) • <- c(0.5, -0.5)
Source Confusion • Two independent variables: AssocStrength : within-subjects but between-items • Strategy (maintenance or elaborative rehearsal): • within-items but between-subjects Here, items = WordPair s • • Now try model SourceConfusion (0=no, 1=yes) as a function of these 2 variables & their interaction • Use the maximum random effects structure • Hint 1: Is this glmer() or lmer() ? • Hint 2: Remember the maximum random effects includes: • By-subjects slope for only the within-subject variables • By-items slopes for only the within-item variables
Source Confusion Model • Let’s model what causes people to make source confusions : model.Source <- glmer(SourceConfusion ~ • AssocStrength*Strategy + (1+AssocStrength|Subject) + (1+Strategy|WordPair), data=sourceconfusion, family=binomial) • This looks bad! L
Low & High Probabilities • Problem: These are low frequency events • In fact, lots of theoretically interesting things have low frequency • Clinical diagnoses that are not common • Various kinds of cognitive errors • Language production, memory, language comprehension… • Learners’ errors in math or other educational domains
Low & High Probabilities • A problem for our model: • Model was trying to find the odds of making a source confusion within each study condition • But: Source confusions were never observed with elaborative rehearsal, ever! • How small are the odds? They are infinitely small in this dataset! • Note that not all failures to converge reflect low frequency events. But when very low frequencies exist, they are likely to cause convergence problems.
Low & High Probabilities • Logit is undefined if probability = 1 [ ] [ ] p(confusion) 1 logit = log = log 1-p(confusion) 0 Division by zero!! • Logit is also undefined if probability = 0 [ ] [ ] p(confusion) 0 logit = log = log = log (0) 1-p(confusion) 1 • Log 0 is undefined e ??? = 0 • • But there is nothing to which you can raise e to get 0
Low & High Probabilities • When close to 0 or 1, logit is defined but unstable p(0.98) -> 3.89 4 p(0.95) -> 2.94 Fast change at extreme p(0.8) ->1.39 2 probabilities LOG ODDS of recall p(0.6) -> 0.41 0 Relatively gradual change -2 at moderate probabilities -4 0.0 0.2 0.4 0.6 0.8 1.0 PROBABILITY of recall
Low & High Probabilities • A problem for our model: • Question was how much less common source confusions become with elaborative rehearsal • But: Source confusions were never observed with elaborative rehearsal, ever! • Why we think this happened: Probability = 1% • In theory , elaborative subjects would probably make at least one of these N = ∞ errors eventually ( given infinite trials ) • Not impossible • But, empirically , probability was low enough that we didn’t see the error in our sample ( limited sample size ) N = 8
Empirical Logit • Empirical logit : An adjustment to the regular logit to deal with probabilities near (or at) 0 or 1 [ ] p(confusion) logit = log 1-p(confusion)
Empirical Logit • Empirical logit : An adjustment to the regular logit to deal with probabilities near (or at) 0 or 1 [ ] Num of “A”s A = Source confusion logit = log Num of “B”s occurred [ ] B = Source confusion did not occur Num of “A”s + 0.5 emp. logit = log Num of “B”s + 0.5 • Makes extreme values e (close to 0 or 1) less extreme
Empirical Logit • Empirical logit : An adjustment to the regular logit to deal with probabilities near (or at) 0 or 1 Num of As: 10 Num of As: 9 Num of As: 6 [ ] Num of Bs: 0 Num of Bs: 1 Num of Bs: 4 Num of “A”s 2.20 0.41 logit = log Num of “B”s [ ] Num of “A”s + 0.5 3.04 1.85 0.37 emp. logit = log Num of “B”s + 0.5 • Makes extreme values e (close to 0 or 1) less extreme
Empirical logit doesn’t go as high or as low as the “true” logit At moderate values, they’re essentially the same
With larger samples, difference gets much smaller (as long as probability isn’t 0 or 1)
Empirical Logit: Implementation • Empirical logit requires summing up events and then adding 0.5 to numerator & denominator: [ ] Num of As for subject S10 in Num A + 0.5 Low associative strength, empirical logit = log Maintenance rehearsal Num B + 0.5 condition • Thus, we have to (1) sum across individual trials, and then (2) calculate empirical logit Single empirical logit value for each subject in each condition Not a sequence of one YES or NO for every item
Empirical Logit: Implementation • Empirical logit requires summing up events and then adding 0.5 to numerator & denominator: [ ] Num of As for subject S10 in Num A + 0.5 Low associative strength, empirical logit = log Maintenance rehearsal Num B + 0.5 condition • Thus, we have to (1) sum across individual trials, and then (2) calculate empirical logit • Can’t have multiple random effects with empirical logit • Would have to do separate by-subjects and by- items analyses • Collecting more data can be another solution
Empirical Logit: Implementation • Scott’s psycholing package can help calculate the empirical logit & run the model • Example script on CourseWeb • Two notes: No longer using glmer() with family=binomial . • We’re now running the model on the empirical logit value, which isn’t just a 0 or 1. Here, the value of the DV is -1.49
Empirical Logit: Implementation • Scott’s psycholing package can help calculate the empirical logit & run the model • Example script on CourseWeb • Two notes: No longer using glmer() with family=binomial . • We’re now running the model on the empirical logit value, which isn’t just a 0 or 1. • Because we calculate the empirical logit beforehand, model doesn’t know how many observations went into that value -2.46 could be the • Want to appropriately average across 10 trials or across 100 weight the model trials
Week 11: Missing Data l Unbalanced Factors l Rank Deficiency l Linear Combinations l Incomplete Designs l Empirical Logit l Missing Data (NA values) l Types of Missingness l Non-Ignorable l Ignorable l Summary l Possible Solutions l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation
Missing Data • So far, we’ve looked at cases where: • Two predictor variables are always confounded • A particular category of the dependent variable never (or almost never ) appears in a particular cell • But lots of cases where some observations are missing more haphazardly
Missing Data • Lots of cases where a model makes sense in principle, but part of the dataset is missing • Computer crashes • Some people didn’t entirely fill out questionnaire • Participants dropped out • Implausible values that we excluded • Non-codable data (e.g., we’re looking at whether L2 learners produce the correct plural, but someone says fish ) • Remember that missing data is indicated in R with NA NA
Missing Data • Big issue: Is our sample still representative ? • Basic goal in inferential statistics is to generalize from limited sample to a population • If sample is truly random , this is justified
Missing Data • Problem: If data from certain types of people always go missing, our sample will no longer be representative of the broader population • It’s not a random sample if we systematically lose certain kinds of data
Missing Data • In fact, it’s a problem even if certain types of people are somewhat more likely to have missing data • Still not a fully random sample
Big Issue: WHY data is missing l We will see several techniques for dealing with missing data l The degree to which these techniques are appropriate depends on how the missing data relates to the other variables l Missingness of data may not be arbitrary (and often isn’t) l Let’s first look at some hypothetical patterns of missing data l This is a conceptual distinction l In any given actual data set, we might not know the pattern for certain
Week 11: Missing Data l Unbalanced Factors l Rank Deficiency l Linear Combinations l Incomplete Designs l Empirical Logit l Missing Data (NA values) l Types of Missingness l Non-Ignorable l Ignorable l Summary l Possible Solutions l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation
Hypothetical Scenario #1 l Sometimes, the fact that a data point is NA may be related to what the value would have been … if we’d been able to measure it
Hypothetical Scenario #1 l Sometimes, the fact that a data point is NA may be related to what the value would have been … if we’d been able to measure it l A health psychologist is surveying high school students about their marijuana use l Students who’ve tried marijuana may be more likely to leave this question blank than those who haven’t l Remaining data is a biased sample ACTUAL STATE OF WORLD WHAT WE SEE Yes Yes No No NA No = 50% = 33% NA Yes No No NA Yes
Hypothetical Scenario #1 l Sometimes, the fact that a data point is NA may be related to what the value would have been … if we’d been able to measure it l In other words, some values are more likely than others to end up as NA s l All that’s relevant here is that there is a statistical contingency l The actual causal chain might be more complex l e.g., marijuana use à fear of legal repercussions à omitted response
Hypothetical Scenario #1 l Further examples: l Clinical study where we’re measuring health outcome. People who are very ill might drop out of the study. l Experiment where you have to press a key within 3 seconds or the trial ends without a response time being recorded l People with low high school GPA decline to report it l These are all examples of nonignorable missingness
Hypothetical Scenario #1 l Nonignorable missingness is bad L l Remaining observations (those without NA s) are not representative of the full population l We can’t fully account for what the missing data were, or why they’re missing l We simply don’t know what the missing RT what would have been if people had been allowed more time to respond
Week 11: Missing Data l Unbalanced Factors l Rank Deficiency l Linear Combinations l Incomplete Designs l Empirical Logit l Missing Data (NA values) l Types of Missingness l Non-Ignorable l Ignorable l Summary l Possible Solutions l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation
Hypothetical Scenario #2 l In other cases, data might go missing at “random” or for reasons completely unrelated to the study l Computer crash l Inclement weather l Experimenter error l Random subsampling of people for a follow-up
Hypothetical Scenario #2 l In other cases, data might go missing at “random” or for reasons completely unrelated to the study l Computer crash l Inclement weather l Experimenter error l Random subsampling of people for a follow-up l In these cases, there is no reason to think that the missing data would look any different from the remaining data l Ignorable missingness
Hypothetical Scenario #2 l Ignorable missingness isn’t nearly as bad! J l It’s a bummer that we lost some data, but what’s left is still representative of the population l Our data is still a random sample of the population— it’s just a smaller random sample l Still valid to make inferences about the population
Hypothetical Scenario #3 l Another case of ignorable missingness is when the fact that the data is NA can be fully explained by other, known variables l Examples: l People who score high on a pretest are excluded from further participation in an intervention study l We’re looking at child SES as a predictor of physical growth, but lower SES families are less likely to return for the post-test l DV is whether people say a plural vs singular noun; we discard ambiguous words (e.g., “fish”). Rate of ambiguous words differs across conditions
Hypothetical Scenario #3 l Another case of ignorable missingness is when the fact that the data is NA can be fully explained by other, known variables l This is also ignorable because there’s no mystery about why the data is missing, nor the values of the variables associated with NA ness l We know why the high-pretest people were excluded from the intervention. Has nothing to do with unobserved variables l Again, referring to statistical contingencies not direct causal links l Low SES à Transporation less affordable à NA
Week 11: Missing Data l Unbalanced Factors l Rank Deficiency l Linear Combinations l Incomplete Designs l Empirical Logit l Missing Data (NA values) l Types of Missingness l Non-Ignorable l Ignorable l Summary l Possible Solutions l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation
Wrap-Up l Ignorableness is more like a continuum All of the All of the missingness missingness can be depends on unknown accounted for by values known variables (IGNORABLE) (NONIGNORABLE) l As long as we’re relatively ignorable, we’re OK “We should expect departures from [ignorable missingness] … in many l realistic cases … may often have only a minor impact on estimates and standard errors.” (Schafer & Graham, 2002, p. 152) “In many psychological situations the departures … are probably not l serious .” (Schafer & Graham, 2002, p. 154)
Ignorable or Non-ignorable? l Where is my dataset on this continuum? "In general, there is no way to test whether [ignorable missingness] holds in l a data set.” (Schafer & Graham, 2002, p. 152) l Definitely ignorable if you used known variable(s) to decide to discard data or to stop data collection Kids who get a low score on Task 1 aren’t given Task 2 l People who are sufficiently healthy are excluded from a clinical study l We realized we made a typo in one of our stimulus items and discarded l all trials using that item l Other cases: Use your knowledge of the domain l Are certain values less likely to be measured & recorded; e.g., poor health in a clinical study? (non-ignorable) l Or, is the missingness basically happening at random? Can it be accounted for by things we measured? (ignorable) l Big picture: Most departures from ignorable missingness aren’t terrible, but be aware of the possibility that certain values are systematically more likely to be missing
Ignorable or Non-ignorable? l The post-experiment manipulation-check questionnaires for five participants were accidentally thrown away. l In a 2-day memory experiment, people who know they would do poorly on the memory test are discouraged and don’t want to return for the second session. l There was a problem with one of the auditory stimulus files in the “Passive Sentence” condition (but not the corresponding version in the “Active” condition); we discarded data from those trials.
Ignorable or Non-ignorable? l The post-experiment manipulation-check questionnaires for five participants were accidentally thrown away. l Ignorable—not related to any variable l In a 2-day memory experiment, people who know they would do poorly on the memory test are discouraged and don’t want to return for the second session. l Non-ignorable. Missingness depends on what your memory score would have been if we had observed it. l There was a problem with one of the auditory stimulus files in the “Passive Sentence” condition (but not the corresponding version in the “Active” condition); we discarded data from those trials. l Ignorable; this depends on a known variable (condition)
Ignorable or Non-ignorable? l We are comparing life satisfaction among a sample of students known to live on-campus vs. a sample of students known to live off-campus. But students off- campus are less likely to return their surveys because it’s more inconvenient for them to do so.
Ignorable or Non-ignorable? l We are comparing life satisfaction among a sample of students known to live on-campus vs. a sample of students known to live off-campus. But students off- campus are less likely to return their surveys because it’s more inconvenient for them to do so. l Ignorable if missingness depends only on this known variable. Fewer off-campus students might return their surveys, but the off-campus students from whom we have data don’t differ from the off-campus students for whom we don’t have data. l If we think that there is also a relation to the unmeasured life- satisfaction variable (e.g., people unhappy with their lives don’t return the survey), although not mentioned above, then this would be non-ignorable. l Assumptions we are willing to make about the missing data (and why it’s missing) affect how we can then use it.
Week 11: Missing Data l Rank Deficiency l Linear Combinations l Incomplete Designs l Empirical Logit l Missing Data (NA values) l Intro l Types of Missingness l Non-Ignorable l Ignorable l Summary l Possible Solutions l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation
Casewise Deletion l In each comparison, delete only observations if the missing data is relevant to this comparison l Correlating Extraversion & Conscientiousness à delete/ignore the red rows
Casewise Deletion l In each comparison, delete only observations if the missing data is relevant to this comparison l Correlating Extraversion & ReadingSpan à delete/ignore the blue row
Casewise Deletion l Avoids data loss l But, results not completely consistent / comparable because they’re based on different observations l e.g., possible to have A > B > C > A cor.test(stress$ReadingSpan, l stress$Extraversion) d.f.s don’t match because they’re based on different subsets of the data cor.test(stress$Conscientiousness, l stress$Extraversion) df=453
Week 11: Missing Data l Rank Deficiency l Linear Combinations l Incomplete Designs l Empirical Logit l Missing Data (NA values) l Intro l Types of Missingness l Non-Ignorable l Ignorable l Summary l Possible Solutions l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation
Listwise Deletion l Delete any observation where data is missing anywhere l e.g., stress2 <- na.omit(stress) l Default in lmer and many other programs
Listwise Deletion l Avoids inconsistency l In some cases, could result in a lot of data loss l However, mixed effects models do well even with moderate data loss (25%; Quene & van den Bergh, 2004 ) l Unlike ANOVA, MEMs properly account for some subjects or conditions having fewer observations
Listwise Deletion l Avoids inconsistency l In some cases, could result in a lot of data loss l However, mixed effects models do well even with moderate data loss (25%; Quene & van den Bergh, 2004 ) l Unlike ANOVA, MEMs properly account for some subjects or conditions having fewer observations l Produces the correct parameter estimates if missingness is ignorable l Although some other things ( R 2 ) may be incorrect l Estimates will be wrong if missingness is non- ignorable
Week 11: Missing Data l Rank Deficiency l Linear Combinations l Incomplete Designs l Empirical Logit l Missing Data (NA values) l Intro l Types of Missingness l Non-Ignorable l Ignorable l Summary l Possible Solutions l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation
Unconditional Imputation l Replace missing values with the mean of the observed values 5, 8, 3, ?, ? 5, 8, 3, 5.33, 5.33 • M = 5.33 • M = 5.33 S 2 = 12.5 S 2 = 3.17 • • l Imputing the mean reduces the variance l This increases chance of detecting spurious effects l Also distorts the correlations with other variables l Bad. Don’t do this!
Week 11: Missing Data l Rank Deficiency l Linear Combinations l Incomplete Designs l Empirical Logit l Missing Data (NA values) l Intro l Types of Missingness l Non-Ignorable l Ignorable l Summary l Possible Solutions l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation
Conditional Imputation l Replace missing values with the values predicted by a model using known variable(s) Reading ~ 1 + Reading Operation Span Span Span = NA NA interimModel <- lmer(ReadingSpan ~ 1 + OperationSpan + l (1|Subject), data=stress) Create a model that predicts RSpan from OSpan l predictedValues <- predict(interimModel, l stress[is.na(stress$ReadingSpan),] Use the model to predict the missing ReadingSpan values l stress[is.na(stress$ReadingSpan),'ReadingSpan'] <- l predictedValues l Replace the missing ReadingSpan values with the predicted values
Conditional Imputation l Replace missing values with the values predicted by a model using known variable(s) Reading ~ 1 + Reading Operation Span Span Span = NA NA l If ignorable missingness, get the correct parameter estimates l And, standard errors not as distorted l Especially if we add some noise to the fitted values l predictedValues <- predictedValues + rnorm(length(predictedValues), mean=0, sd=ResidualSDFromTheModel)
Conditional Imputation l Replace missing values with the values predicted by a model using known variable(s) Reading ~ 1 + Reading Operation Span Span Span = NA NA l Where this is useful? l Many observations have a small amount of missing data, but which column it is varies l Listwise deletion would wipe out every row with a NA anywhere
Week 11: Missing Data l Rank Deficiency l Linear Combinations l Incomplete Designs l Empirical Logit l Missing Data (NA values) l Intro l Types of Missingness l Non-Ignorable l Ignorable l Summary l Possible Solutions l Casewise Deletion l Listwise Deletion l Unconditional Imputation l Conditional Imputation l Multiple Imputation
Multiple Imputation l Like doing conditional imputation several times l Replace missing data with one possible set of values l Run the model l Repeat it l Final result averages these Dataset } missing data { with Model results imputation 1 1 Final Dataset Dataset Results with Model results with imputation 2 2 Dataset Model results with 3 imputation 3
Recommend
More recommend