gov 2002 causal inference iii regression discontinuity
play

Gov 2002 - Causal Inference III: Regression Discontinuity Designs - PowerPoint PPT Presentation

Gov 2002 - Causal Inference III: Regression Discontinuity Designs Matthew Blackwell Arthur Spirling October 16th, 2014 Introduction Causal for us so far: selection of observables, instrumental variables for when this doesnt hold


  1. Comparison to traditional setup ◮ Note that ignorability here hold by design, because condition on the forcing variable, the treatment is deterministic. Y i (1) , Y i (0) ⊥ ⊥ A i | X i ◮ Again, we can’t directly use this because we know that the usual posivity assumption is violated. Remember that positivity is an overlap condition: 0 < Pr[ A i = 1 | X i = x ] < 1 ◮ Here, obviously, the propensity score is only 0 or 1, depending on the value of the forcing variable.

  2. Comparison to traditional setup ◮ Note that ignorability here hold by design, because condition on the forcing variable, the treatment is deterministic. Y i (1) , Y i (0) ⊥ ⊥ A i | X i ◮ Again, we can’t directly use this because we know that the usual posivity assumption is violated. Remember that positivity is an overlap condition: 0 < Pr[ A i = 1 | X i = x ] < 1 ◮ Here, obviously, the propensity score is only 0 or 1, depending on the value of the forcing variable. ◮ Thus, we need to extrapolate from the treated to the control group and vice versa.

  3. Extrapolation and smoothness ◮ Remember the quantity of interest here is the effect at the threshold: τ SRD = E [ Y i (1) − Y i (0) | X i = c ] = E [ Y i (1) | X i = c ] − E [ Y i (0) | X i = c ]

  4. Extrapolation and smoothness ◮ Remember the quantity of interest here is the effect at the threshold: τ SRD = E [ Y i (1) − Y i (0) | X i = c ] = E [ Y i (1) | X i = c ] − E [ Y i (0) | X i = c ] ◮ But we don’t observe E [ Y i (0) | X i = c ] ever due to the design, so we’re going to extrapolate from E [ Y i (0) | X i = c − ε ].

  5. Extrapolation and smoothness ◮ Remember the quantity of interest here is the effect at the threshold: τ SRD = E [ Y i (1) − Y i (0) | X i = c ] = E [ Y i (1) | X i = c ] − E [ Y i (0) | X i = c ] ◮ But we don’t observe E [ Y i (0) | X i = c ] ever due to the design, so we’re going to extrapolate from E [ Y i (0) | X i = c − ε ]. ◮ Extrapolation, even at short distances, requires a certain smoothness in the functions we are extrapolating.

  6. Continuity of the CEFs Assumption 1: Continuity The functions E [ Y i (0) | X i = x ] and E [ Y i (1) | X i = x ] are continuous in x .

  7. Continuity of the CEFs Assumption 1: Continuity The functions E [ Y i (0) | X i = x ] and E [ Y i (1) | X i = x ] are continuous in x . ◮ This continuity implies the following: E [ Y i (0) | X i = c ] = lim x ↑ c E [ Y i (0) | X i = x ] (continuity)

  8. Continuity of the CEFs Assumption 1: Continuity The functions E [ Y i (0) | X i = x ] and E [ Y i (1) | X i = x ] are continuous in x . ◮ This continuity implies the following: E [ Y i (0) | X i = c ] = lim x ↑ c E [ Y i (0) | X i = x ] (continuity)

  9. Continuity of the CEFs Assumption 1: Continuity The functions E [ Y i (0) | X i = x ] and E [ Y i (1) | X i = x ] are continuous in x . ◮ This continuity implies the following: E [ Y i (0) | X i = c ] = lim x ↑ c E [ Y i (0) | X i = x ] (continuity) = lim x ↑ c E [ Y i (0) | A i = 0 , X i = x ] (SRD)

  10. Continuity of the CEFs Assumption 1: Continuity The functions E [ Y i (0) | X i = x ] and E [ Y i (1) | X i = x ] are continuous in x . ◮ This continuity implies the following: E [ Y i (0) | X i = c ] = lim x ↑ c E [ Y i (0) | X i = x ] (continuity) = lim x ↑ c E [ Y i (0) | A i = 0 , X i = x ] (SRD) = lim x ↑ c E [ Y i | X i = x ] (consistency/SRD)

  11. Continuity of the CEFs Assumption 1: Continuity The functions E [ Y i (0) | X i = x ] and E [ Y i (1) | X i = x ] are continuous in x . ◮ This continuity implies the following: E [ Y i (0) | X i = c ] = lim x ↑ c E [ Y i (0) | X i = x ] (continuity) = lim x ↑ c E [ Y i (0) | A i = 0 , X i = x ] (SRD) = lim x ↑ c E [ Y i | X i = x ] (consistency/SRD) ◮ Note that this is the same for the treated group: E [ Y i (1) | X i = c ] = lim x ↓ c E [ Y i | X i = x ]

  12. Identification results ◮ Thus, under the ignorability assumption, the sharp RD assumption, and the continuity assumption, we have: τ SRD = E [ Y i (1) − Y i (0) | X i = c ]

  13. Identification results ◮ Thus, under the ignorability assumption, the sharp RD assumption, and the continuity assumption, we have: τ SRD = E [ Y i (1) − Y i (0) | X i = c ]

  14. Identification results ◮ Thus, under the ignorability assumption, the sharp RD assumption, and the continuity assumption, we have: τ SRD = E [ Y i (1) − Y i (0) | X i = c ] = E [ Y i (1) | X i = c ] − E [ Y i (0) | X i = c ]

  15. Identification results ◮ Thus, under the ignorability assumption, the sharp RD assumption, and the continuity assumption, we have: τ SRD = E [ Y i (1) − Y i (0) | X i = c ] = E [ Y i (1) | X i = c ] − E [ Y i (0) | X i = c ] = lim x ↓ c E [ Y i | X i = x ] − lim x ↑ c E [ Y i | X i = x ]

  16. Identification results ◮ Thus, under the ignorability assumption, the sharp RD assumption, and the continuity assumption, we have: τ SRD = E [ Y i (1) − Y i (0) | X i = c ] = E [ Y i (1) | X i = c ] − E [ Y i (0) | X i = c ] = lim x ↓ c E [ Y i | X i = x ] − lim x ↑ c E [ Y i | X i = x ] ◮ Note that each of these is identified at least with infinite data, as long as X i has positive density around the cutpoint

  17. Identification results ◮ Thus, under the ignorability assumption, the sharp RD assumption, and the continuity assumption, we have: τ SRD = E [ Y i (1) − Y i (0) | X i = c ] = E [ Y i (1) | X i = c ] − E [ Y i (0) | X i = c ] = lim x ↓ c E [ Y i | X i = x ] − lim x ↑ c E [ Y i | X i = x ] ◮ Note that each of these is identified at least with infinite data, as long as X i has positive density around the cutpoint ◮ Why? With arbitrarily high N , we’ll get an arbitrarily good approximations to the expectation of the line

  18. Identification results ◮ Thus, under the ignorability assumption, the sharp RD assumption, and the continuity assumption, we have: τ SRD = E [ Y i (1) − Y i (0) | X i = c ] = E [ Y i (1) | X i = c ] − E [ Y i (0) | X i = c ] = lim x ↓ c E [ Y i | X i = x ] − lim x ↑ c E [ Y i | X i = x ] ◮ Note that each of these is identified at least with infinite data, as long as X i has positive density around the cutpoint ◮ Why? With arbitrarily high N , we’ll get an arbitrarily good approximations to the expectation of the line ◮ How to estimate these nonparametrically is difficult as we’ll see (endpoints are a big problem)

  19. What can go wrong? ◮ If the potential outcomes change at the discontinuity for reasons other than the treatment, then smoothness will be violated.

  20. What can go wrong? ◮ If the potential outcomes change at the discontinuity for reasons other than the treatment, then smoothness will be violated. ◮ For instance, if people sort around threshold, then you might get jumps other than the one you care about.

  21. What can go wrong? ◮ If the potential outcomes change at the discontinuity for reasons other than the treatment, then smoothness will be violated. ◮ For instance, if people sort around threshold, then you might get jumps other than the one you care about. ◮ If things other than the treatment change at the threshold, then that might cause discontinuities in the potential outcomes.

  22. Estimation in the SRD

  23. Graphical approaches ◮ Simple plot of mean outcomes within bins of the forcing variable: N � Y k = 1 Y i · I ( b k < X i ≤ b k +1 ) N k i =1 where N k is the number of units within bin k and b k are the bin cutpoints.

  24. Graphical approaches ◮ Simple plot of mean outcomes within bins of the forcing variable: N � Y k = 1 Y i · I ( b k < X i ≤ b k +1 ) N k i =1 where N k is the number of units within bin k and b k are the bin cutpoints. ◮ Obvious discontinuity at the threshold?

  25. Graphical approaches ◮ Simple plot of mean outcomes within bins of the forcing variable: N � Y k = 1 Y i · I ( b k < X i ≤ b k +1 ) N k i =1 where N k is the number of units within bin k and b k are the bin cutpoints. ◮ Obvious discontinuity at the threshold? ◮ Are there other, unexplained discontinuities?

  26. Graphical approaches ◮ Simple plot of mean outcomes within bins of the forcing variable: N � Y k = 1 Y i · I ( b k < X i ≤ b k +1 ) N k i =1 where N k is the number of units within bin k and b k are the bin cutpoints. ◮ Obvious discontinuity at the threshold? ◮ Are there other, unexplained discontinuities? ◮ As Imbens and Lemieux say:

  27. Graphical approaches ◮ Simple plot of mean outcomes within bins of the forcing variable: N � Y k = 1 Y i · I ( b k < X i ≤ b k +1 ) N k i =1 where N k is the number of units within bin k and b k are the bin cutpoints. ◮ Obvious discontinuity at the threshold? ◮ Are there other, unexplained discontinuities? ◮ As Imbens and Lemieux say: The formal statistical analyses discussed below are essentially just sophisticated versions of this, and if the basic plot does not show any evidence of a discontinuity, there is relatively little chance that the more sophisticated analyses will lead to robust and credible estimates with statistically and substantially significant magnitudes.

  28. Example from RD on extending unemployment

  29. Other graphs to include ◮ Next, it’s a good idea to plot covariates by the forcing variable to see if these covariates also jump at the discontinuity.

  30. Other graphs to include ◮ Next, it’s a good idea to plot covariates by the forcing variable to see if these covariates also jump at the discontinuity. ◮ Same binning strategy: N � Z km = 1 Z im · I ( b k < X i ≤ b k +1 ) N k i =1

  31. Other graphs to include ◮ Next, it’s a good idea to plot covariates by the forcing variable to see if these covariates also jump at the discontinuity. ◮ Same binning strategy: N � Z km = 1 Z im · I ( b k < X i ≤ b k +1 ) N k i =1 ◮ Intuition: our key assumption is that the potential outcomes are smooth in the forcing variable.

  32. Other graphs to include ◮ Next, it’s a good idea to plot covariates by the forcing variable to see if these covariates also jump at the discontinuity. ◮ Same binning strategy: N � Z km = 1 Z im · I ( b k < X i ≤ b k +1 ) N k i =1 ◮ Intuition: our key assumption is that the potential outcomes are smooth in the forcing variable. ◮ Discontinuities in covariates unaffected by the threshold could be indications of discontinuities in the potential outcomes.

  33. Other graphs to include ◮ Next, it’s a good idea to plot covariates by the forcing variable to see if these covariates also jump at the discontinuity. ◮ Same binning strategy: N � Z km = 1 Z im · I ( b k < X i ≤ b k +1 ) N k i =1 ◮ Intuition: our key assumption is that the potential outcomes are smooth in the forcing variable. ◮ Discontinuities in covariates unaffected by the threshold could be indications of discontinuities in the potential outcomes. ◮ Similar to balance tests in matching

  34. Checking covariates at the discontinuity

  35. General estimation strategy ◮ The main goal in RD is to estimate the limits of various CEFs such as: lim x ↑ c E [ Y i | X i = x ]

  36. General estimation strategy ◮ The main goal in RD is to estimate the limits of various CEFs such as: lim x ↑ c E [ Y i | X i = x ] ◮ It turns out that this is a hard problem because we want to estimate the regression at a single point and that point is a boundary point.

  37. General estimation strategy ◮ The main goal in RD is to estimate the limits of various CEFs such as: lim x ↑ c E [ Y i | X i = x ] ◮ It turns out that this is a hard problem because we want to estimate the regression at a single point and that point is a boundary point. ◮ As a result, the usual kinds of nonparametric estimators perform poorly.

  38. General estimation strategy ◮ The main goal in RD is to estimate the limits of various CEFs such as: lim x ↑ c E [ Y i | X i = x ] ◮ It turns out that this is a hard problem because we want to estimate the regression at a single point and that point is a boundary point. ◮ As a result, the usual kinds of nonparametric estimators perform poorly. ◮ In general, we are going to have to choose some way of estimating the regression functions around the cutpoint.

  39. General estimation strategy ◮ The main goal in RD is to estimate the limits of various CEFs such as: lim x ↑ c E [ Y i | X i = x ] ◮ It turns out that this is a hard problem because we want to estimate the regression at a single point and that point is a boundary point. ◮ As a result, the usual kinds of nonparametric estimators perform poorly. ◮ In general, we are going to have to choose some way of estimating the regression functions around the cutpoint. ◮ Using the entire sample on either side will obviously lead to bias because those values that are far from the cutpoint are clearly different than those nearer to the cutpoint.

  40. General estimation strategy ◮ The main goal in RD is to estimate the limits of various CEFs such as: lim x ↑ c E [ Y i | X i = x ] ◮ It turns out that this is a hard problem because we want to estimate the regression at a single point and that point is a boundary point. ◮ As a result, the usual kinds of nonparametric estimators perform poorly. ◮ In general, we are going to have to choose some way of estimating the regression functions around the cutpoint. ◮ Using the entire sample on either side will obviously lead to bias because those values that are far from the cutpoint are clearly different than those nearer to the cutpoint. ◮ → restrict our estimation to units close to the threshold.

  41. Example of misleading trends 300 200 y 100 0 -10 -5 0 5 10 x

  42. Nonparametric and semiparametric approaches ◮ Let’s define µ R ( x ) = lim z ↓ x E [ Y i (1) | X i = z ] µ L ( x ) = lim z ↑ x E [ Y i (0) | X i = z ]

  43. Nonparametric and semiparametric approaches ◮ Let’s define µ R ( x ) = lim z ↓ x E [ Y i (1) | X i = z ] µ L ( x ) = lim z ↑ x E [ Y i (0) | X i = z ] ◮ For the SRD, we have τ SRD = µ 1 ( x ) − µ 0 ( x ).

  44. Nonparametric and semiparametric approaches ◮ Let’s define µ R ( x ) = lim z ↓ x E [ Y i (1) | X i = z ] µ L ( x ) = lim z ↑ x E [ Y i (0) | X i = z ] ◮ For the SRD, we have τ SRD = µ 1 ( x ) − µ 0 ( x ). ◮ One nonparametric approach is to estimate nonparametrically µ L ( x ) with a uniform kernel : � N i =1 Y i · I { c − h ≤ X i < c } µ L ( c ) = � � N i =1 I { c − h ≤ X i < c }

  45. Nonparametric and semiparametric approaches ◮ Let’s define µ R ( x ) = lim z ↓ x E [ Y i (1) | X i = z ] µ L ( x ) = lim z ↑ x E [ Y i (0) | X i = z ] ◮ For the SRD, we have τ SRD = µ 1 ( x ) − µ 0 ( x ). ◮ One nonparametric approach is to estimate nonparametrically µ L ( x ) with a uniform kernel : � N i =1 Y i · I { c − h ≤ X i < c } µ L ( c ) = � � N i =1 I { c − h ≤ X i < c } ◮ Here, h is a bandwidth parameter, selected by you.

  46. Nonparametric and semiparametric approaches ◮ Let’s define µ R ( x ) = lim z ↓ x E [ Y i (1) | X i = z ] µ L ( x ) = lim z ↑ x E [ Y i (0) | X i = z ] ◮ For the SRD, we have τ SRD = µ 1 ( x ) − µ 0 ( x ). ◮ One nonparametric approach is to estimate nonparametrically µ L ( x ) with a uniform kernel : � N i =1 Y i · I { c − h ≤ X i < c } µ L ( c ) = � � N i =1 I { c − h ≤ X i < c } ◮ Here, h is a bandwidth parameter, selected by you. ◮ Basically, calculate means among units no more than h away from the threshold.

  47. Bandwidth equal to 7 300 200 y 100 0 -10 -5 0 5 10 x

  48. Bandwidth equal to 5 300 200 y 100 0 -10 -5 0 5 10 x

  49. Bandwidth equal to 1 300 200 y 100 0 -10 -5 0 5 10 x

  50. Local averages ◮ Estimate mean of Y i when X i ∈ [ c , c + h ] and when X i ∈ [ c − h , c ).

  51. Local averages ◮ Estimate mean of Y i when X i ∈ [ c , c + h ] and when X i ∈ [ c − h , c ). ◮ Can do this with the following approach regression on those units less than h away from c : � ( Y i − α − τ A i ) 2 ( � α, � τ ) = arg min α,τ i : X i ∈ [ c − h , c + h ]

  52. Local averages ◮ Estimate mean of Y i when X i ∈ [ c , c + h ] and when X i ∈ [ c − h , c ). ◮ Can do this with the following approach regression on those units less than h away from c : � ( Y i − α − τ A i ) 2 ( � α, � τ ) = arg min α,τ i : X i ∈ [ c − h , c + h ] ◮ Here, � τ SRD = � τ .

  53. Local averages ◮ Estimate mean of Y i when X i ∈ [ c , c + h ] and when X i ∈ [ c − h , c ). ◮ Can do this with the following approach regression on those units less than h away from c : � ( Y i − α − τ A i ) 2 ( � α, � τ ) = arg min α,τ i : X i ∈ [ c − h , c + h ] ◮ Here, � τ SRD = � τ . ◮ This turns out to have very large bias as the we increase the bandwidth.

  54. Local linear regression ◮ Instead of a local constant, we can use a local linear regression.

  55. Local linear regression ◮ Instead of a local constant, we can use a local linear regression. ◮ Run a linear regression of Y i on X i − c in the group X i ∈ [ c , c + h ] to estimate µ 1 ( x ) and the same regression for group with X i ∈ [ c − h , c ): � α L , � ( Y i − α − β ( X i − c )) 2 ( � β L ) = arg min α,β i : X i ∈ [ c − h , c ) � α R , � ( Y i − α − β ( X i − c )) 2 ( � β R ) = arg min α,β i : X i ∈ [ c , c + h ]

  56. Local linear regression ◮ Instead of a local constant, we can use a local linear regression. ◮ Run a linear regression of Y i on X i − c in the group X i ∈ [ c , c + h ] to estimate µ 1 ( x ) and the same regression for group with X i ∈ [ c − h , c ): � α L , � ( Y i − α − β ( X i − c )) 2 ( � β L ) = arg min α,β i : X i ∈ [ c − h , c ) � α R , � ( Y i − α − β ( X i − c )) 2 ( � β R ) = arg min α,β i : X i ∈ [ c , c + h ] ◮ Our estimate is τ SRD = � � µ R ( c ) − � µ L ( c ) α R + � α L − � = � β R ( c − c ) − � β L ( c − c ) = � α R − � α L

  57. More practical estimation ◮ We can estimate this local linear regression by dropping observations more than h away from c and then running the following regression: Y i = α + β ( X i − c ) + τ A i + γ ( X i − c ) A i + η i

  58. More practical estimation ◮ We can estimate this local linear regression by dropping observations more than h away from c and then running the following regression: Y i = α + β ( X i − c ) + τ A i + γ ( X i − c ) A i + η i ◮ Here we just have an interaction term between the treatment status and the forcing variable.

  59. More practical estimation ◮ We can estimate this local linear regression by dropping observations more than h away from c and then running the following regression: Y i = α + β ( X i − c ) + τ A i + γ ( X i − c ) A i + η i ◮ Here we just have an interaction term between the treatment status and the forcing variable. ◮ Here, � τ SRD = � τ which is the coefficient on the treatment.

  60. More practical estimation ◮ We can estimate this local linear regression by dropping observations more than h away from c and then running the following regression: Y i = α + β ( X i − c ) + τ A i + γ ( X i − c ) A i + η i ◮ Here we just have an interaction term between the treatment status and the forcing variable. ◮ Here, � τ SRD = � τ which is the coefficient on the treatment. ◮ Yields numerically the same as the separate regressions.

  61. Bandwidth equal to 10 (Global) 300 200 y 100 0 -10 -5 0 5 10 x

  62. Bandwidth equal to 7 300 200 y 100 0 -10 -5 0 5 10 x

  63. Bandwidth equal to 5 300 200 y 100 0 -10 -5 0 5 10 x

  64. Bandwidth equal to 1 300 200 y 100 0 -10 -5 0 5 10 x

  65. Odds and ends for the SRD ◮ Standard errors: robust standard errors from local OLS are valid.

  66. Odds and ends for the SRD ◮ Standard errors: robust standard errors from local OLS are valid. ◮ Covariates: shouldn’t matter, but can include them for increased precision.

  67. Odds and ends for the SRD ◮ Standard errors: robust standard errors from local OLS are valid. ◮ Covariates: shouldn’t matter, but can include them for increased precision. ◮ ALWAYS REPORT MODELS WITHOUT COVARIATES FIRST

Recommend


More recommend