Dealing with missing values – part 2 Applied Multivariate Statistics – Spring 2012
Overview More on Single Imputation: Shortcomings Multiple Imputation: Accounting for uncertainty Appl. Multivariate Statistics - Spring 2012 2
Single Imputation Easy / Inaccurate Unconditional Mean Unconditional Distribution Conditional Mean Conditional Distribution Hard / Accurate Appl. Multivariate Statistics - Spring 2012 3
Example: Blood Pressure - Revisited 30 participants in January (X) and February (Y) MCAR: Delete 23 Y values randomly MAR: Keep Y only where X > 140 (follow-up) MNAR: Record Y only where Y > 140 (test everybody again but only keep values of critical participants) Appl. Multivariate Statistics - Spring 2012 4
Black points are missing (MAR) Example: Blood Pressure Appl. Multivariate Statistics - Spring 2012 5
+ Mean of Y ok - Variance of Y wrong Unconditional Mean Appl. Multivariate Statistics - Spring 2012 6
+ Mean of Y ok, Variance better - Correlation btw X and Y wrong Unconditional Distribution Appl. Multivariate Statistics - Spring 2012 7
+ Conditional Mean of Y ok + Correlation ok Conditional Mean - (Conditional) Variance wrong Y = 84 + 0.3*X Appl. Multivariate Statistics - Spring 2012 8
+ Conditional Mean of Y ok + Correlation ok Conditional Distribution + Conditional Variance of Y ok Y = 84 + 0.3*X + e e ~ N(0, 23 2 ) Appl. Multivariate Statistics - Spring 2012 9
Problem: We ignore uncertainty Conditional Distribution Y = 84 + 0.3*X + e e ~ N(0, 23 2 ) 95%-CI: [-234; 402] 95%-CI: [-1.7; 2.4] Appl. Multivariate Statistics - Spring 2012 10
Problem of Single Imputation Too optimistic: Imputation model (e.g. in Y = a + bX) is just estimated, but not the true model Thus, imputed values have some uncertainty Single Imputation ignores this uncertainty Coverage probability of confidence intervals is wrong Solution: Multiple Imputation Incorporates both - residual error - model uncertainty (excluding model mis-specification) Appl. Multivariate Statistics - Spring 2012 11
Multiple Imputation: Idea ? ? Aggregate results Do standard analysis for each imputed data set; Impute several times get estimate and std.error Appl. Multivariate Statistics - Spring 2012 12
Multiple Imputation: Idea Need special imputation schemes that include both - uncertainty of residuals - uncertainty of model (e.g. values of intercept a and slope b) Rough idea: - Fill in random values - Iteratively predict values for each variable until some convergence is reached (as in missForest) - Sample values for residuals AND for (a,b) Gibbs sampler is used Excellent for intuition (by one of the big guys in the field): http://sites.stat.psu.edu/~jls/mifaq.html Appl. Multivariate Statistics - Spring 2012 13
Multiple Imputation: Intuition Predict missing values accounting for - Uncertainty of residuals - Uncertainty of parameter estimates Appl. Multivariate Statistics - Spring 2012 14
Multiple Imputation: Intuition Predict missing values accounting for - Uncertainty of residuals - Uncertainty of parameter estimates Appl. Multivariate Statistics - Spring 2012 15
Multiple Imputation: Intuition Predict missing values accounting for - Uncertainty of residuals - Uncertainty of parameter estimates Appl. Multivariate Statistics - Spring 2012 16
Multiple Imputation: Intuition Predict missing values accounting for - Uncertainty of residuals - Uncertainty of parameter estimates Appl. Multivariate Statistics - Spring 2012 17
Multiple Imputation: Intuition Predict missing values accounting for - Uncertainty of residuals - Uncertainty of parameter estimates Appl. Multivariate Statistics - Spring 2012 18
Multiple Imputation: Intuition Predict missing values accounting for - Uncertainty of residuals - Uncertainty of parameter estimates Appl. Multivariate Statistics - Spring 2012 19
Multiple Imputation: Gibbs sampler (Not for exam) Iteration t; repeat until convergence: Intuition For each variable i: µ ¤ ( t ) ; Y ( t ) » P ( µ i j Y obs ¡ i ) Sample (a,b) i i Y ¤ ( t ) ¡ i ;µ ¤ ( t ) ; Y ( t ) » P ( Y i j Y obs ) i i i Predict missings using y = a + bx + e ;Y ¤ ( t ) Y ( t ) = ( Y obs where ) i i j Appl. Multivariate Statistics - Spring 2012 20
R package: MICE Multiple Imputation with Chained Equations MICE has good default settings; don’t worry about the data type Defaults for data types of columns: - numeric: Predictive Mean Matching (pmm) (like fancy linear regression; faster alternative: linear regression) - factor, 2 lev: Logistic Regression (logreg) - factor, >2 lev: Multinomial logit model (polyreg) - ordered, >2 lev: Ordered logit model (polr) Appl. Multivariate Statistics - Spring 2012 21
Aggregation of estimates : Estimate of imputation i ^ Q i : Variance of estimate (= square of std. error) U i ^ Assume: Q ¡ Q ¼ N (0 ; 1) p U P m Average estimate: ¹ j =1 ^ Q = 1 Q j m P m Within-imputation variance: ¹ j =1 ^ U = 1 U j m P m Between-imputation variance: j =1 ( ^ Q j ¡ ¹ 1 Q ) 2 B = m ¡ 1 Total variance: T = ¹ 1 U + m ¡ 1 B ³ ´ 2 ¹ Q ¡ Q m ¹ Approximately: with T » t º U p º = ( m ¡ 1) 1 + (1+ m ) B p 95%-CI: ¹ Q § t º ;0 : 975 T Appl. Multivariate Statistics - Spring 2012 22
Do manually, if you have Multiple Imputation with MICE non standard analysis Appl. Multivariate Statistics - Spring 2012 23
How much uncertainty due to missings? Relative increase in variance due to nonrespose: r = (1+ 1 m ) B ¹ U Fraction (or rate) of missing information fmi: (!! Not the same as fraction of missing OBSERVATIONS) 2 r + º +3 fmi = r +1 Proportion of the total variance that is attributed to the missing data: ¸ = B (1+ 1 m ) Returned by mice T Appl. Multivariate Statistics - Spring 2012 24
Rule of thumb: How many imputations? - Preliminary analysis: m = 5 - Paper: m = 20 or even m = 50 Surprisingly few! m = 1 Efficiency compared to depends on fmi: ³ ´ ¡ 1 1 + fmi eff = m Oftentimes OK Examples (eff in %): Perfect ! M fmi=0.1 fmi=0.3 fmi=0.5 fmi=0.7 fmi=0.9 3 97 91 86 81 77 5 98 94 91 88 85 10 99 97 95 93 92 20 100 99 98 97 96 Appl. Multivariate Statistics - Spring 2012 25
Concepts to know Idea of mice How to aggregate results from imputed data sets? How many imputations? Appl. Multivariate Statistics - Spring 2012 26
R functions to know mice, with, pool Appl. Multivariate Statistics - Spring 2012 27
Next time Multidimensional Scaling Distance metrics Appl. Multivariate Statistics - Spring 2012 28
Recommend
More recommend