causal inference
play

Causal Inference An introduction based on S. Wagers course on Causal - PowerPoint PPT Presentation

Causal Inference An introduction based on S. Wagers course on Causal Inference (OIT 661) Imke Mayer November 23, 2018 Group Meeting, CMAP Outline 1. Treatment effect estimation in randomized experiments 2. Beyond a single randomized


  1. Causal Inference An introduction based on S. Wager’s course on Causal Inference (OIT 661) Imke Mayer November 23, 2018 Group Meeting, CMAP

  2. Outline 1. Treatment effect estimation in randomized experiments 2. Beyond a single randomized controlled trial 3. Inverse-propensity weighting 4. Double robustness property 5. Cross-fitting and machine learning for ATE estimation 6. Conclusion 1

  3. Treatment effect estimation in randomized experiments

  4. Definitions and notations Given a treatment, define the causal effect via potential outcomes: Causal effect Binary treatment w ∈ { 0 , 1 } on i-th individual with potential outcomes Y i (1) and Y i (0). Individual causal effect of the treatment: ∆ i = Y i (1) − Y i (0) 2

  5. Definitions and notations Given a treatment, define the causal effect via potential outcomes: Causal effect Binary treatment w ∈ { 0 , 1 } on i-th individual with potential outcomes Y i (1) and Y i (0). Individual causal effect of the treatment: ∆ i = Y i (1) − Y i (0) • Problem: ∆ i never observed. • (Partial) Solution: randomized experiments to learn certain properties of ∆ i . • Average treatment effect τ = E [∆ i ] = E [ Y i (1) − Y i (0)]. 2

  6. Average treatment effect (ATE) Average treatment effect τ = E [∆ i ] = E [ Y i (1) − Y i (0)] Idea: estimate τ using large randomized experiments. Assumptions: Random variables ( Y , W ) having values in R × { 0 , 1 } . Observe n iid samples ( Y i , W i ) each satisfying: • Y i = Y i ( W i ) (SUTVA) • W i ⊥ ⊥ { Y i (0) , Y i (1) } (random treatment assignment) 3

  7. Average treatment effect (ATE) Average treatment effect τ = E [∆ i ] = E [ Y i (1) − Y i (0)] Idea: estimate τ using large randomized experiments. Assumptions: Random variables ( Y , W ) having values in R × { 0 , 1 } . Observe n iid samples ( Y i , W i ) each satisfying: • Y i = Y i ( W i ) (SUTVA) • W i ⊥ ⊥ { Y i (0) , Y i (1) } (random treatment assignment) Difference-in-means estimator τ DM = 1 Y i − 1 � � ˆ Y i , n 1 n 0 W 1 =1 W 1 =0 where n w = # { i : W i = w } . 3

  8. Average treatment effect estimation Properties of ˆ τ DM • Using the previous assumptions (iid, SUTVA, random treatment τ DM is unbiased and √ n -consistent. assignment), we can prove that ˆ √ n (ˆ d τ DM − τ ) − n →∞ N (0 , V DM ) , − − → where V DM = Var ( Y i (0)) P ( W i =0) + Var ( Y i (1)) P ( W i =1) 4

  9. Average treatment effect estimation Properties of ˆ τ DM • Using the previous assumptions (iid, SUTVA, random treatment τ DM is unbiased and √ n -consistent. assignment), we can prove that ˆ √ n (ˆ d τ DM − τ ) − n →∞ N (0 , V DM ) , − − → where V DM = Var ( Y i (0)) P ( W i =0) + Var ( Y i (1)) P ( W i =1) • Using plug-in estimators we also get confidence intervals   �   ˆ V DM  = 1 − α, τ DM ± Φ − 1 (1 − α/ 2) lim  τ ∈  ˆ n →∞ P  n where Φ is the standard Gaussian cdf. 4

  10. Average treatment effect estimation with Difference-of-Means Difference-of-Means estimator • conceptually simple estimator and simple to estimate, • consistent estimator with asymptotically valid inference, • but is it the optimal way to use the data for fixed finite n ? 5

  11. Average treatment effect estimation with Difference-of-Means Difference-of-Means estimator • conceptually simple estimator and simple to estimate, • consistent estimator with asymptotically valid inference, • but is it the optimal way to use the data for fixed finite n ? Average Treatment effect τ is a causal parameter , i.e. property we wish to know about a population. It is not related to the study design or the estimation method. 5

  12. Randomized trials in the linear model Idea: assume linearity of the responses Y i (0) and Y i (1) in the covariates. Assumptions • n iid samples ( X i , Y i , W i ), • Y i ( w ) = c ( w ) + X i β ( w ) + ε i ( w ), w ∈ { 0 , 1 } , • E [ ε i ( w ) | X i ] = 0 and Var ( ε i ( w ) | X i ) = σ 2 . and without loss of generality we additionally assume: • P ( W i = 0) = P ( W i = 1) = 1 2 , • E [ X ] = 0. 6

  13. Randomized trials in the linear model OLS estimator c (0) + ¯ X (ˆ β (1) − ˆ c (1) − ˆ τ OLS := ˆ ˆ β (0) ) n = 1 � � � c (1) + X i ˆ c (0) − X i ˆ (ˆ β (1) ) − (ˆ β (0) ) , n i =1 where ¯ � n X = 1 i =1 X i and the estimators are obtained by OLS for the n two linear models. 7

  14. Randomized trials in the linear model OLS estimator c (0) + ¯ X (ˆ β (1) − ˆ c (1) − ˆ τ OLS := ˆ ˆ β (0) ) n = 1 � � � c (1) + X i ˆ c (0) − X i ˆ (ˆ β (1) ) − (ˆ β (0) ) , n i =1 where ¯ � n X = 1 i =1 X i and the estimators are obtained by OLS for the n two linear models. Properties of ˆ τ OLS c ( w ) , ˆ β ( w ) and ¯ • Asymptotic independence of ˆ X and also c (0) − c (0) )+ ¯ X ( β (1) − β (0) )+ ¯ X (ˆ β (1) − ˆ τ OLS − τ = (ˆ c (1) − c (1) ) − (ˆ β (0) − β (1) + β (0) ) . ˆ • Noting V OLS = 4 σ 2 + ( β (0) − β (1) ) T Var ( X )( β (0) − β (1) ), by central limit theorem we get √ n (ˆ d τ OLS − τ ) − n →∞ N (0 , V OLS ) . − − → 7

  15. Randomized trials in the linear model Properties of ˆ τ OLS • Noting V OLS = 4 σ 2 + � β (0) − β (1) � 2 A , by central limit theorem we get √ n (ˆ d τ OLS − τ ) − n →∞ N (0 , V OLS ) . − − → Remark • Under the linearity assumption, V DM = 4 σ 2 + � β (0) − β (1) � 2 A + � β (0) + β (1) ) � 2 A . ⇒ ˆ τ OLS is always at least as good as ˆ τ DM in terms of asymptotic variance. • This still holds in case of model mis-specification. (proof uses Huber-White linear regression analysis) 8

  16. Beyond a single randomized controlled trial

  17. How to combine different experiments or data sets Study the effect of a cash incentive to discourage teenagers from smoking in two different cities. 9

  18. How to combine different experiments or data sets Study the effect of a cash incentive to discourage teenagers from smoking in two different cities. 9

  19. How to combine different experiments or data sets Study the effect of a cash incentive to discourage teenagers from smoking in two different cities. 9

  20. How to combine different experiments or data sets Study the effect of a cash incentive to discourage teenagers from smoking in two different cities. Correct aggregation of the two studies: 9

  21. Aggregating several ATE estimators How to combine several trials testing the same treatment but on different populations? Assumptions • n iid samples ( X i , Y i , W i ), • Covariates X i take values in a finite discrete space X (i.e. |X| = p ). • Treatment assignment is random conditionally on X i : { Y i (0) , Y i (1) } ⊥ ⊥ W i | X i = x , ∀ x ∈ X . Bucket-wise ATE τ ( x ) = E [ Y i (1) − Y i (0) | X i = x ] 10

  22. Results for aggregated difference-in-means estimators Aggregated difference-in-means estimator   n x n x  1 Y i − 1 � � � � ˆ τ := n ˆ τ ( x ) = Y i  n n x 1 n x 0 x ∈X x ∈X { X i = x , W i =1 } { X i = x , W i =0 } • Denoting e ( x ) = P ( W i = 1 | X i = x ) and adding simplifying assumption Var ( Y ( w ) | X = x ) = σ 2 ( x ) we can show that σ 2 ( x ) √ n x (ˆ � � d τ ( x ) − τ ( x )) − n →∞ N − − → 0 , e ( x )(1 − e ( x )) � σ 2 ( X ) � • Finally, denoting V BUCKET = Var ( τ ( X )) + E , e ( X )(1 − e ( X )) √ n (ˆ d τ − τ ) − n →∞ N (0 , V BUCKET ) − − → no dependence in p , # of buckets! 11

  23. Inverse-propensity weighting

  24. Continuous X and the propensity score Observation from discrete X with finite number of buckets: the number of buckets p does not affect the accuracy of inference. How to transpose the analysis and results to the continuous case? 1. Modify assumptions 2. Define analogue of ”buckets” Assumptions • n iid samples ( X i , Y i , W i ), • Covariates X i take values in a continuous space X . • Treatment assignment is random conditionally on X i : { Y i (0) , Y i (1) } ⊥ ⊥ W i | X i ≡ unconfoundedness assumption . 12

  25. Unconfoundedness and the propensity score Observation from discrete X with finite number of buckets: the number of buckets p does not affect the accuracy of inference. How to transpose the analysis and results to the continuous case? 1. Modify assumpions 2. Define analogue of ”buckets” Propensity score e ( x ) = P ( W i = 1 | X i = x ) ∀ x ∈ X . 13

  26. Unconfoundedness and the propensity score Propensity score e ( x ) = P ( W i = 1 | X i = x ) ∀ x ∈ X . Key property e is a balancing score, i.e. under unconfoundedness, it satisfies { Y i (0) , Y i (1) } ⊥ ⊥ W i | e ( X i ) As a consequence, it suffices to control for e ( X ) (rather than X ), to remove biases associated with non-random treatment assignment. 14

  27. Unconfoundedness and the propensity score: finite number of strata If the data falls in J strata ( S j ) 1 ≤ j ≤ J , with J < ∞ and such that e ( x ) = e j in each stratum, then we have a consistent estimator for ATE:   J J n j n j  1 Y i − 1 � � � � ˆ τ := n ˆ τ j = Y i  n n j 1 n j 0 j =1 j =1 { X i ∈ S j , W i =1 } { X i ∈ S j , W i =0 } 15

Recommend


More recommend