dealing with missing values part 2
play

Dealing with missing values part 2 Applied Multivariate Statistics - PowerPoint PPT Presentation

Dealing with missing values part 2 Applied Multivariate Statistics Spring 2012 Overview More on Single Imputation: Shortcomings Multiple Imputation: Accounting for uncertainty Appl. Multivariate Statistics - Spring 2012 2 Single


  1. Dealing with missing values – part 2 Applied Multivariate Statistics – Spring 2012

  2. Overview  More on Single Imputation: Shortcomings  Multiple Imputation: Accounting for uncertainty Appl. Multivariate Statistics - Spring 2012 2

  3. Single Imputation Easy / Inaccurate  Unconditional Mean  Unconditional Distribution  Conditional Mean  Conditional Distribution Hard / Accurate Appl. Multivariate Statistics - Spring 2012 3

  4. Example: Blood Pressure - Revisited  30 participants in January (X) and February (Y)  MCAR: Delete 23 Y values randomly  MAR: Keep Y only where X > 140 (follow-up)  MNAR: Record Y only where Y > 140 (test everybody again but only keep values of critical participants) Appl. Multivariate Statistics - Spring 2012 4

  5. Black points are missing (MAR) Example: Blood Pressure Appl. Multivariate Statistics - Spring 2012 5

  6. + Mean of Y ok - Variance of Y wrong Unconditional Mean Appl. Multivariate Statistics - Spring 2012 6

  7. + Mean of Y ok, Variance better - Correlation btw X and Y wrong Unconditional Distribution Appl. Multivariate Statistics - Spring 2012 7

  8. + Conditional Mean of Y ok + Correlation ok Conditional Mean - (Conditional) Variance wrong Y = 84 + 0.3*X Appl. Multivariate Statistics - Spring 2012 8

  9. + Conditional Mean of Y ok + Correlation ok Conditional Distribution + Conditional Variance of Y ok Y = 84 + 0.3*X + e e ~ N(0, 23 2 ) Appl. Multivariate Statistics - Spring 2012 9

  10. Problem: We ignore uncertainty Conditional Distribution Y = 84 + 0.3*X + e e ~ N(0, 23 2 ) 95%-CI: [-234; 402] 95%-CI: [-1.7; 2.4] Appl. Multivariate Statistics - Spring 2012 10

  11. Problem of Single Imputation  Too optimistic: Imputation model (e.g. in Y = a + bX) is just estimated, but not the true model  Thus, imputed values have some uncertainty  Single Imputation ignores this uncertainty  Coverage probability of confidence intervals is wrong  Solution: Multiple Imputation Incorporates both - residual error - model uncertainty (excluding model mis-specification) Appl. Multivariate Statistics - Spring 2012 11

  12. Multiple Imputation: Idea ? ? Aggregate results Do standard analysis for each imputed data set; Impute several times get estimate and std.error Appl. Multivariate Statistics - Spring 2012 12

  13. Multiple Imputation: Idea  Need special imputation schemes that include both - uncertainty of residuals - uncertainty of model (e.g. values of intercept a and slope b)  Rough idea: - Fill in random values - Iteratively predict values for each variable until some convergence is reached (as in missForest) - Sample values for residuals AND for (a,b)  Gibbs sampler is used  Excellent for intuition (by one of the big guys in the field): http://sites.stat.psu.edu/~jls/mifaq.html Appl. Multivariate Statistics - Spring 2012 13

  14. Multiple Imputation: Intuition Predict missing values accounting for - Uncertainty of residuals - Uncertainty of parameter estimates Appl. Multivariate Statistics - Spring 2012 14

  15. Multiple Imputation: Intuition Predict missing values accounting for - Uncertainty of residuals - Uncertainty of parameter estimates Appl. Multivariate Statistics - Spring 2012 15

  16. Multiple Imputation: Intuition Predict missing values accounting for - Uncertainty of residuals - Uncertainty of parameter estimates Appl. Multivariate Statistics - Spring 2012 16

  17. Multiple Imputation: Intuition Predict missing values accounting for - Uncertainty of residuals - Uncertainty of parameter estimates Appl. Multivariate Statistics - Spring 2012 17

  18. Multiple Imputation: Intuition Predict missing values accounting for - Uncertainty of residuals - Uncertainty of parameter estimates Appl. Multivariate Statistics - Spring 2012 18

  19. Multiple Imputation: Intuition Predict missing values accounting for - Uncertainty of residuals - Uncertainty of parameter estimates Appl. Multivariate Statistics - Spring 2012 19

  20. Multiple Imputation: Gibbs sampler (Not for exam)  Iteration t; repeat until convergence: Intuition For each variable i: µ ¤ ( t ) ; Y ( t ) » P ( µ i j Y obs ¡ i ) Sample (a,b) i i Y ¤ ( t ) ¡ i ;µ ¤ ( t ) ; Y ( t ) » P ( Y i j Y obs ) i i i Predict missings using y = a + bx + e ;Y ¤ ( t ) Y ( t ) = ( Y obs where ) i i j Appl. Multivariate Statistics - Spring 2012 20

  21. R package: MICE Multiple Imputation with Chained Equations  MICE has good default settings; don’t worry about the data type  Defaults for data types of columns: - numeric: Predictive Mean Matching (pmm) (like fancy linear regression; faster alternative: linear regression) - factor, 2 lev: Logistic Regression (logreg) - factor, >2 lev: Multinomial logit model (polyreg) - ordered, >2 lev: Ordered logit model (polr) Appl. Multivariate Statistics - Spring 2012 21

  22. Aggregation of estimates  : Estimate of imputation i ^ Q i : Variance of estimate (= square of std. error) U i ^  Assume: Q ¡ Q ¼ N (0 ; 1) p U P m  Average estimate: ¹ j =1 ^ Q = 1 Q j m P m  Within-imputation variance: ¹ j =1 ^ U = 1 U j m P m  Between-imputation variance: j =1 ( ^ Q j ¡ ¹ 1 Q ) 2 B = m ¡ 1  Total variance: T = ¹ 1 U + m ¡ 1 B ³ ´ 2 ¹ Q ¡ Q m ¹  Approximately: with T » t º U p º = ( m ¡ 1) 1 + (1+ m ) B p  95%-CI: ¹ Q § t º ;0 : 975 T Appl. Multivariate Statistics - Spring 2012 22

  23. Do manually, if you have Multiple Imputation with MICE non standard analysis Appl. Multivariate Statistics - Spring 2012 23

  24. How much uncertainty due to missings?  Relative increase in variance due to nonrespose: r = (1+ 1 m ) B ¹ U  Fraction (or rate) of missing information fmi: (!! Not the same as fraction of missing OBSERVATIONS) 2 r + º +3 fmi = r +1  Proportion of the total variance that is attributed to the missing data: ¸ = B (1+ 1 m ) Returned by mice T Appl. Multivariate Statistics - Spring 2012 24

  25. Rule of thumb: How many imputations? - Preliminary analysis: m = 5 - Paper: m = 20 or even m = 50  Surprisingly few! m = 1  Efficiency compared to depends on fmi: ³ ´ ¡ 1 1 + fmi eff = m Oftentimes OK  Examples (eff in %): Perfect ! M fmi=0.1 fmi=0.3 fmi=0.5 fmi=0.7 fmi=0.9 3 97 91 86 81 77 5 98 94 91 88 85 10 99 97 95 93 92 20 100 99 98 97 96 Appl. Multivariate Statistics - Spring 2012 25

  26. Concepts to know  Idea of mice  How to aggregate results from imputed data sets?  How many imputations? Appl. Multivariate Statistics - Spring 2012 26

  27. R functions to know  mice, with, pool Appl. Multivariate Statistics - Spring 2012 27

  28. Next time  Multidimensional Scaling  Distance metrics Appl. Multivariate Statistics - Spring 2012 28

Recommend


More recommend