machine learning for healthcare hst 956 6 s897
play

Machine Learning for Healthcare HST.956, 6.S897 Lecture 15: Causal - PowerPoint PPT Presentation

Machine Learning for Healthcare HST.956, 6.S897 Lecture 15: Causal Inference Part 2 David Sontag Acknowledgement: adapted from slides by Uri Shalit (Technion) Reminder: Potential Outcomes Each unit (individual) " has two potential


  1. Machine Learning for Healthcare HST.956, 6.S897 Lecture 15: Causal Inference Part 2 David Sontag Acknowledgement: adapted from slides by Uri Shalit (Technion)

  2. Reminder: Potential Outcomes β€’ Each unit (individual) 𝑦 " has two potential outcomes: – 𝑍 $ (𝑦 " ) is the potential outcome had the unit not been treated: β€œ control outcome ” – 𝑍 ' (𝑦 " ) is the potential outcome had the unit been treated: β€œ treated outcome ” β€’ Conditional average treatment effect for unit 𝑗 : π·π΅π‘ˆπΉ 𝑦 " = 𝔽 / 0 |4 5 ) [𝑍 ' |𝑦 " ] βˆ’ 𝔽 / : |4 5 ) [𝑍 $ |𝑦 " ] 0 ~2(/ : ~2(/ β€’ Average Treatment Effect: π΅π‘ˆπΉ = 𝔽 4~2(4) π·π΅π‘ˆπΉ 𝑦

  3. Two common approaches for counterfactual inference Covariate adjustment Propensity scores

  4. Covariate adjustment (reminder) Explicitly model the relationship between treatment, confounders, and outcome: Covariates Regression Outcome (Features) model 𝑦 ' 𝑦 ; 𝑔(𝑦, π‘ˆ) … 𝑧 𝑦 < π‘ˆ

  5. Covariate adjustment (reminder) β€’ Under ignorability, π·π΅π‘ˆπΉ 𝑦 = 𝔽 4~2 4 𝔽 𝑍 ' π‘ˆ = 1, 𝑦 βˆ’ 𝔽 𝑍 $ π‘ˆ = 0, 𝑦 β€’ Fit a model 𝑔 𝑦, 𝑒 β‰ˆ 𝔽 𝑍 D π‘ˆ = 𝑒, 𝑦 , then: J 𝑦 " = 𝑔 𝑦 " , 1 βˆ’ 𝑔(𝑦 " , 0). π·π΅π‘ˆπΉ

  6. Covariate adjustment with linear models β€’ Assume that: Blood pressure age medication 𝑍 D 𝑦 = 𝛾𝑦 + 𝛿 β‹… 𝑒 + πœ— D 𝔽 πœ— D = 0 β€’ Then: π·π΅π‘ˆπΉ(𝑦): = 𝔽[𝑍 ' 𝑦 βˆ’ 𝑍 $ 𝑦 ] = 𝔽[(𝛾𝑦 + 𝛿 + πœ— ' ) βˆ’ 𝛾𝑦 + πœ— $ ] = 𝛿

  7. Covariate adjustment with linear models β€’ Assume that: Blood pressure age medication 𝑍 D 𝑦 = 𝛾𝑦 + 𝛿 β‹… 𝑒 + πœ— D 𝔽 πœ— D = 0 β€’ Then: π·π΅π‘ˆπΉ(𝑦): = 𝔽[𝑍 ' 𝑦 βˆ’ 𝑍 $ 𝑦 ] = 𝔽[(𝛾𝑦 + 𝛿 + πœ— ' ) βˆ’ 𝛾𝑦 + πœ— $ ] = 𝛿 π΅π‘ˆπΉ: = 𝔽 2 4 π·π΅π‘ˆπΉ 𝑦 = 𝛿

  8. Covariate adjustment with linear models β€’ Assume that: Blood pressure age medication 𝑍 D 𝑦 = 𝛾𝑦 + 𝛿 β‹… 𝑒 + πœ— D 𝔽 πœ— D = 0 π΅π‘ˆπΉ: = 𝔽 2 4 π·π΅π‘ˆπΉ 𝑦 = 𝛿 β€’ For causal inference, need to estimate 𝛿 well, not 𝑍 D 𝑦 - Identification, not prediction β€’ Major difference between ML and statistics

  9. What happens if true model is not linear? β€’ True data generating process, 𝑦 ∈ ℝ : D 𝑦 = 𝛾𝑦 + 𝛿 β‹… 𝑒 + πœ€ β‹… 𝑦 ; 𝑍 π΅π‘ˆπΉ = 𝔽 𝑍 ' βˆ’ 𝑍 $ = 𝛿 β€’ Hypothesized model: U𝑦 + 𝛿 T 𝑦 = 𝛾 𝑍 V β‹… 𝑒 D V = 𝛿 + πœ€ 𝔽 𝑦𝑒 𝔽 𝑦 ; βˆ’ 𝔽[𝑒 ; ]𝔽[𝑦 ; 𝑒] 𝛿 𝔽 𝑦𝑒 ; βˆ’ 𝔽[𝑦 ; ]𝔽[𝑒 ; ] Depending on 𝜺 , can be made to be arbitrarily large or small!

  10. Covariate adjustment with non-linear models β€’ Random forests and Bayesian trees Hill (2011), Athey & Imbens (2015), Wager & Athey (2015) β€’ Gaussian processes Hoyer et al. (2009), Zigler et al. (2012) β€’ Neural networks Beck et al. (2000), Johansson et al. (2016), Shalit et al. (2016), Lopez-Paz et al. (2016)

  11. Example: Gaussian processes 𝑍 ' 𝑦 𝑍 ' 𝑦 Separate treated and Joint treated and GP βˆ’ Independent GP βˆ’ Grouped control models ● control model ● 120 120 ● ● ● ● ● ● ● ● ● ● 110 110 ● ● ● ● 𝑧 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 𝑍 $ 𝑦 𝑍 $ 𝑦 90 ● ● 90 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 80 80 ● ● ● ● ● ● ● ● 𝑦 𝑦 10 20 30 40 50 60 10 20 30 40 50 60 Treated Control Figures: Vincent Dorie & Jennifer Hill

  12. Example: Neural networks Neural network layers Predicted potential outcomes Inte % ) … & " Ξ¦ … % ( … ' Covariates Shared representation Learning objective Shalit, Johansson, Sontag. Estimating Individual Treatment Effect: Generalization Bounds and Algorithms . ICML, 2017

  13. Matching β€’ Find each unit’s long-lost counterfactual identical twin, check up on his outcome

  14. Matching β€’ Find each unit’s long-lost counterfactual identical twin, check up on his outcome Obama, had he gone to law school Obama, had he gone to business school

  15. Matching β€’ Find each unit’s long-lost counterfactual identical twin, check up on his outcome β€’ Used for estimating both ATE and CATE

  16. Match to nearest neighbor from opposite group Charleson comorbidity index Treated Control Age

  17. Match to nearest neighbor from opposite group Charleson comorbidity index Treated Control Age

  18. 1-NN Matching β€’ Let 𝑒 β‹…,β‹… be a metric between 𝑦 ’s β€’ For each 𝑗 , define π‘˜ 𝑗 = argmin 𝑒(𝑦 _ , 𝑦 " ) _ `.D. D a bD 5 π‘˜ 𝑗 is the nearest counterfactual neighbor of 𝑗 β€’ 𝑒 " = 1 , unit 𝑗 is treated: J 𝑦 " = 𝑧 " βˆ’ 𝑧 _ " π·π΅π‘ˆπΉ β€’ 𝑒 " = 0, unit 𝑗 is control: J 𝑦 " = 𝑧 _(") βˆ’ 𝑧 " π·π΅π‘ˆπΉ

  19. 1-NN Matching β€’ Let 𝑒 β‹…,β‹… be a metric between 𝑦 ’s β€’ For each 𝑗 , define π‘˜ 𝑗 = argmin 𝑒(𝑦 _ , 𝑦 " ) _ `.D. D a bD 5 π‘˜ 𝑗 is the nearest counterfactual neighbor of 𝑗 J 𝑦 " = (2𝑒 " βˆ’ 1)(𝑧 " βˆ’π‘§ _ " ) β€’ π·π΅π‘ˆπΉ ' J = J 𝑦 " d d βˆ‘ β€’ π΅π‘ˆπΉ π·π΅π‘ˆπΉ "f'

  20. Matching β€’ Interpretable, especially in small-sample regime β€’ Nonparametric β€’ Heavily reliant on the underlying metric β€’ Could be misled by features which don’t affect the outcome

  21. Covariate adjustment and matching β€’ Matching is equivalent to covariate adjustment with two 1-nearest neighbor classifiers: g g 𝑍 ' 𝑦 = 𝑧 hh 0 4 , 𝑍 $ 𝑦 = 𝑧 hh : 4 where 𝑧 hh i 4 is the nearest-neighbor of 𝑦 among units with treatment assignment 𝑒 = 0,1 β€’ 1-NN matching is in general inconsistent, though only with small bias (Imbens 2004)

  22. Two common approaches for counterfactual inference Covariate adjustment Propensity scores

  23. Propensity scores β€’ Tool for estimating ATE β€’ Basic idea: turn observational study into a pseudo-randomized trial by re-weighting samples, similar to importance sampling

  24. Inverse propensity score re-weighting π‘ž(𝑦|𝑒 = 0) β‰  π‘ž 𝑦 𝑒 = 1 control treated 𝑦 ; = Charlson comorbidity index Treated 𝑦 ' = 𝑏𝑕𝑓 Control

  25. Inverse propensity score re-weighting π‘ž 𝑦 𝑒 = 0 β‹… π‘₯ $ (𝑦) β‰ˆ π‘ž 𝑦 𝑒 = 1 β‹… π‘₯ ' (𝑦) reweighted control reweighted treated 𝑦 ; = Charlson comorbidity index Treated 𝑦 ' = 𝑏𝑕𝑓 Control

  26. Propensity score β€’ Propensity score: π‘ž π‘ˆ = 1 𝑦 , using machine learning tools β€’ Samples re-weighted by the inverse propensity score of the treatment they received

  27. Propensity scores – algorithm Inverse probability of treatment weighted estimator How to calculate ATE with propensity score for sample 𝑦 ' , 𝑒 ' , 𝑧 ' , … , (𝑦 d , 𝑒 d , 𝑧 d ) 1. Use any ML method to estimate π‘ž V π‘ˆ = 𝑒 𝑦 ATE = 1 p ( t i = 1 | x i ) βˆ’ 1 y i y i 2. Λ† X X Λ† p ( t i = 0 | x i ) Λ† n n i s.t. t i =1 i s.t. t i =0

  28. Propensity scores – algorithm Inverse probability of treatment weighted estimator How to calculate ATE with propensity score for sample 𝑦 ' , 𝑒 ' , 𝑧 ' , … , (𝑦 d , 𝑒 d , 𝑧 d ) 1. Randomized trial π‘ž(π‘ˆ = 𝑒|𝑦) = 0.5 ATE = 1 p ( t i = 1 | x i ) βˆ’ 1 y i y i 2. Λ† X X Λ† p ( t i = 0 | x i ) Λ† n n i s.t. t i =1 i s.t. t i =0

  29. Propensity scores – algorithm Inverse probability of treatment weighted estimator How to calculate ATE with propensity score for sample 𝑦 ' , 𝑒 ' , 𝑧 ' , … , (𝑦 d , 𝑒 d , 𝑧 d ) 1. Randomized trial π‘ž(π‘ˆ = 𝑒|𝑦) = 0.5 y i y i ATE = 1 0 . 5 βˆ’ 1 2. X X Λ† 0 . 5 = n n i s.t. t i =1 i s.t. t i =0 X X

  30. Propensity scores – algorithm Inverse probability of treatment weighted estimator How to calculate ATE with propensity score for sample 𝑦 ' , 𝑒 ' , 𝑧 ' , … , (𝑦 d , 𝑒 d , 𝑧 d ) 1. Randomized trial π‘ž = 0.5 y i y i ATE = 1 0 . 5 βˆ’ 1 2. X X Λ† 0 . 5 = n n i s.t. t i =1 i s.t. t i =0 2 y i βˆ’ 2 X X y i n n i s.t. t i =1 i s.t. t i =0

Recommend


More recommend