Machine Learning for Healthcare HST.956, 6.S897 Lecture 15: Causal Inference Part 2 David Sontag Acknowledgement: adapted from slides by Uri Shalit (Technion)
Reminder: Potential Outcomes β’ Each unit (individual) π¦ " has two potential outcomes: β π $ (π¦ " ) is the potential outcome had the unit not been treated: β control outcome β β π ' (π¦ " ) is the potential outcome had the unit been treated: β treated outcome β β’ Conditional average treatment effect for unit π : π·π΅ππΉ π¦ " = π½ / 0 |4 5 ) [π ' |π¦ " ] β π½ / : |4 5 ) [π $ |π¦ " ] 0 ~2(/ : ~2(/ β’ Average Treatment Effect: π΅ππΉ = π½ 4~2(4) π·π΅ππΉ π¦
Two common approaches for counterfactual inference Covariate adjustment Propensity scores
Covariate adjustment (reminder) Explicitly model the relationship between treatment, confounders, and outcome: Covariates Regression Outcome (Features) model π¦ ' π¦ ; π(π¦, π) β¦ π§ π¦ < π
Covariate adjustment (reminder) β’ Under ignorability, π·π΅ππΉ π¦ = π½ 4~2 4 π½ π ' π = 1, π¦ β π½ π $ π = 0, π¦ β’ Fit a model π π¦, π’ β π½ π D π = π’, π¦ , then: J π¦ " = π π¦ " , 1 β π(π¦ " , 0). π·π΅ππΉ
Covariate adjustment with linear models β’ Assume that: Blood pressure age medication π D π¦ = πΎπ¦ + πΏ β π’ + π D π½ π D = 0 β’ Then: π·π΅ππΉ(π¦): = π½[π ' π¦ β π $ π¦ ] = π½[(πΎπ¦ + πΏ + π ' ) β πΎπ¦ + π $ ] = πΏ
Covariate adjustment with linear models β’ Assume that: Blood pressure age medication π D π¦ = πΎπ¦ + πΏ β π’ + π D π½ π D = 0 β’ Then: π·π΅ππΉ(π¦): = π½[π ' π¦ β π $ π¦ ] = π½[(πΎπ¦ + πΏ + π ' ) β πΎπ¦ + π $ ] = πΏ π΅ππΉ: = π½ 2 4 π·π΅ππΉ π¦ = πΏ
Covariate adjustment with linear models β’ Assume that: Blood pressure age medication π D π¦ = πΎπ¦ + πΏ β π’ + π D π½ π D = 0 π΅ππΉ: = π½ 2 4 π·π΅ππΉ π¦ = πΏ β’ For causal inference, need to estimate πΏ well, not π D π¦ - Identification, not prediction β’ Major difference between ML and statistics
What happens if true model is not linear? β’ True data generating process, π¦ β β : D π¦ = πΎπ¦ + πΏ β π’ + π β π¦ ; π π΅ππΉ = π½ π ' β π $ = πΏ β’ Hypothesized model: Uπ¦ + πΏ T π¦ = πΎ π V β π’ D V = πΏ + π π½ π¦π’ π½ π¦ ; β π½[π’ ; ]π½[π¦ ; π’] πΏ π½ π¦π’ ; β π½[π¦ ; ]π½[π’ ; ] Depending on πΊ , can be made to be arbitrarily large or small!
Covariate adjustment with non-linear models β’ Random forests and Bayesian trees Hill (2011), Athey & Imbens (2015), Wager & Athey (2015) β’ Gaussian processes Hoyer et al. (2009), Zigler et al. (2012) β’ Neural networks Beck et al. (2000), Johansson et al. (2016), Shalit et al. (2016), Lopez-Paz et al. (2016)
Example: Gaussian processes π ' π¦ π ' π¦ Separate treated and Joint treated and GP β Independent GP β Grouped control models β control model β 120 120 β β β β β β β β β β 110 110 β β β β π§ β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β 100 100 β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β π $ π¦ π $ π¦ 90 β β 90 β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β 80 80 β β β β β β β β π¦ π¦ 10 20 30 40 50 60 10 20 30 40 50 60 Treated Control Figures: Vincent Dorie & Jennifer Hill
Example: Neural networks Neural network layers Predicted potential outcomes Inte % ) β¦ & " Ξ¦ β¦ % ( β¦ ' Covariates Shared representation Learning objective Shalit, Johansson, Sontag. Estimating Individual Treatment Effect: Generalization Bounds and Algorithms . ICML, 2017
Matching β’ Find each unitβs long-lost counterfactual identical twin, check up on his outcome
Matching β’ Find each unitβs long-lost counterfactual identical twin, check up on his outcome Obama, had he gone to law school Obama, had he gone to business school
Matching β’ Find each unitβs long-lost counterfactual identical twin, check up on his outcome β’ Used for estimating both ATE and CATE
Match to nearest neighbor from opposite group Charleson comorbidity index Treated Control Age
Match to nearest neighbor from opposite group Charleson comorbidity index Treated Control Age
1-NN Matching β’ Let π β ,β be a metric between π¦ βs β’ For each π , define π π = argmin π(π¦ _ , π¦ " ) _ `.D. D a bD 5 π π is the nearest counterfactual neighbor of π β’ π’ " = 1 , unit π is treated: J π¦ " = π§ " β π§ _ " π·π΅ππΉ β’ π’ " = 0, unit π is control: J π¦ " = π§ _(") β π§ " π·π΅ππΉ
1-NN Matching β’ Let π β ,β be a metric between π¦ βs β’ For each π , define π π = argmin π(π¦ _ , π¦ " ) _ `.D. D a bD 5 π π is the nearest counterfactual neighbor of π J π¦ " = (2π’ " β 1)(π§ " βπ§ _ " ) β’ π·π΅ππΉ ' J = J π¦ " d d β β’ π΅ππΉ π·π΅ππΉ "f'
Matching β’ Interpretable, especially in small-sample regime β’ Nonparametric β’ Heavily reliant on the underlying metric β’ Could be misled by features which donβt affect the outcome
Covariate adjustment and matching β’ Matching is equivalent to covariate adjustment with two 1-nearest neighbor classifiers: g g π ' π¦ = π§ hh 0 4 , π $ π¦ = π§ hh : 4 where π§ hh i 4 is the nearest-neighbor of π¦ among units with treatment assignment π’ = 0,1 β’ 1-NN matching is in general inconsistent, though only with small bias (Imbens 2004)
Two common approaches for counterfactual inference Covariate adjustment Propensity scores
Propensity scores β’ Tool for estimating ATE β’ Basic idea: turn observational study into a pseudo-randomized trial by re-weighting samples, similar to importance sampling
Inverse propensity score re-weighting π(π¦|π’ = 0) β π π¦ π’ = 1 control treated π¦ ; = Charlson comorbidity index Treated π¦ ' = πππ Control
Inverse propensity score re-weighting π π¦ π’ = 0 β π₯ $ (π¦) β π π¦ π’ = 1 β π₯ ' (π¦) reweighted control reweighted treated π¦ ; = Charlson comorbidity index Treated π¦ ' = πππ Control
Propensity score β’ Propensity score: π π = 1 π¦ , using machine learning tools β’ Samples re-weighted by the inverse propensity score of the treatment they received
Propensity scores β algorithm Inverse probability of treatment weighted estimator How to calculate ATE with propensity score for sample π¦ ' , π’ ' , π§ ' , β¦ , (π¦ d , π’ d , π§ d ) 1. Use any ML method to estimate π V π = π’ π¦ ATE = 1 p ( t i = 1 | x i ) β 1 y i y i 2. Λ X X Λ p ( t i = 0 | x i ) Λ n n i s.t. t i =1 i s.t. t i =0
Propensity scores β algorithm Inverse probability of treatment weighted estimator How to calculate ATE with propensity score for sample π¦ ' , π’ ' , π§ ' , β¦ , (π¦ d , π’ d , π§ d ) 1. Randomized trial π(π = π’|π¦) = 0.5 ATE = 1 p ( t i = 1 | x i ) β 1 y i y i 2. Λ X X Λ p ( t i = 0 | x i ) Λ n n i s.t. t i =1 i s.t. t i =0
Propensity scores β algorithm Inverse probability of treatment weighted estimator How to calculate ATE with propensity score for sample π¦ ' , π’ ' , π§ ' , β¦ , (π¦ d , π’ d , π§ d ) 1. Randomized trial π(π = π’|π¦) = 0.5 y i y i ATE = 1 0 . 5 β 1 2. X X Λ 0 . 5 = n n i s.t. t i =1 i s.t. t i =0 X X
Propensity scores β algorithm Inverse probability of treatment weighted estimator How to calculate ATE with propensity score for sample π¦ ' , π’ ' , π§ ' , β¦ , (π¦ d , π’ d , π§ d ) 1. Randomized trial π = 0.5 y i y i ATE = 1 0 . 5 β 1 2. X X Λ 0 . 5 = n n i s.t. t i =1 i s.t. t i =0 2 y i β 2 X X y i n n i s.t. t i =1 i s.t. t i =0
Recommend
More recommend