synthetic difference in differences
play

Synthetic Difference in Differences Dmitry Arkhangelsky Susan Athey - PowerPoint PPT Presentation

Synthetic Difference in Differences Dmitry Arkhangelsky Susan Athey David Hirshberg Guido Imbens Stefan Wager JSM. August 3rd, 2020. 1 When Berkeley implemented the first soda tax, we compared to San Francisco. While Berkeley, the first


  1. Synthetic Difference in Differences Dmitry Arkhangelsky Susan Athey David Hirshberg Guido Imbens Stefan Wager JSM. August 3rd, 2020. 1

  2. When Berkeley implemented the first soda tax, we compared to San Francisco. While Berkeley, the first U.S. city to pass a “soda tax,” saw a substantial decline of 0.13 times/day in the consumption of soda in the months following implementation of the tax in March 2015, neighboring San Francisco, where a soda-tax measure was defeated, and Oakland, saw a 0.03 times/day increase, according to a study published today in the American Journal of Public Health. 2

  3. This is how we did it. San Francisco } } Hallucinated Parallel Berkeley Berkeley 3

  4. This is a “Difference-in-Differences” estimate • We compare Berkeley’s change in consumption to San Francisco’s. τ = Y (1) BK,post − Y (0) BK,post . τ = [ Y (1) BK,post − Y (0) BK,pre ] − [ Y (0) SF,post − Y (0) ˆ SF,pre ] • Subtracting SF’s change adjusts for a trend in absence of intervention. • It works if the cities follow parallel trends in absence of intervention. Y (0) city,time ≈ α city + β · 1 { time = post } . • This assumption is strong, but we need it (or more data) • We can’t distinguish a treatment effect from a difference in trend 4

  5. Difference in Differences Things get interesting when we observe many units over many time periods . We focus on simultaneous adoption . 1 , . . . , T 0 T 0 + 1 , . . . , T = T 0 + T 1 1 . . . no treatment no treatment N 0 N 0 + 1 . . . no treatment treatment N = N 0 + N 1 • We could still use a parallel trends model: Y ( w ) ∼ α i + β t + wτ it • Least squares in this model is equivalent to 2 × 2 diff-in-diff applied to the averages our 4 ‘blocks’ • But we can see that trends in absence of treatment aren’t parallel 5

  6. California’s anti-smoking legislation (Proposition 99) Average Control California 120 90 60 1970 1980 1990 2000 A 25 cents/pack excise tax increase took effect in 1989. 1 1 California �≈ 49 Alaska + 49 Alabama + . . . 6

  7. California’s anti-smoking legislation: Difference-in-Differences Average Control California 120 } 𝝊 90 60 1970 1980 1990 2000 If we average and hallucinate a line, it obviously doesn’t fit. 7

  8. Synthetic Controls • If California’s pre-treatment trend doesn’t match the average state’s, compare it to something else. • For example, a weighted average of states with a trend that does match. • This weighted average of units is called a synthetic control [Abadie, Diamond, and Hainmueller, 2010] • Construction: weight the control units to match pre-treatment outcomes, � ¯ ω n Y nt ˆ ≈ Y treated,t for all t ≤ T 0 . � �� � n ≤ N 0 treated unit average at time t � �� � control unit average at time t • Treatment effects are typically estimated by cross-sectional comparison: the mean post-treatment difference between treated and synthetic control. � � τ = 1 � � ¯ ˆ Y treated,t − ω n Y nt ˆ . T 1 t>T 0 n ≤ N 0 8

  9. California’s anti-smoking legislation: Synthetic Control Synthetic Control 120 California 100 80 60 40 1970 1980 1990 2000 When comparing to a synthetic control, trends line up better. California ≈ . 3 Utah + . 2 Nevada + . 15 Montana + . . . 9

  10. Improving on Synthetic Control Instead of constructing a unit for a cross-sectional comparison, construct a unit and time period for a diff-in-diff comparison. This is a double robust version of synthetic control. If the before/after comparison is good, the unit comparison doesn’t have to be. And it’s easier to make them good. Constants shifts get differenced out, so constructed parallel trends are as good as overlaid. 10

  11. California’s anti-smoking legislation: Constructed Parallel Trends 160 ● treated sdid ● ● sc ● 120 ● ● ● ● 80 ● ● ● ● ● 40 1970 1980 1990 2000 11

  12. California’s anti-smoking legislation: Constructed Parallel Trends 160 ● treated sdid ● ● sc 120 ● ● ● ● 80 ● ● ● ● ● ● 40 1970 1980 1990 2000 11

  13. California’s anti-smoking legislation: Constructed Parallel Trends 160 ● treated sdid ● ● sc 120 ● ● 80 ● ● ● ● 40 1970 1980 1990 2000 11

  14. California’s anti-smoking legislation: Constructed Parallel Trends 160 ● treated sdid ● ● sc 120 ● ● 80 ● ● ● ● ● ● 40 1970 1980 1990 2000 11

  15. California’s anti-smoking legislation: Constructed Parallel Trends 160 ● treated sdid ● ● sc 120 ● ● 80 ● ● ● ● ● ● ● 40 1970 1980 1990 2000 11

  16. −80 −40 California’s anti-smoking legislation: Double Robustness 0 ● ● Alabama ● ● ● Arkansas ● ● Colorado ● Connecticut ● ● Delaware ● ● Georgia ● ● Idaho ● ● Illinois ● ● Indiana ● ● Iowa ● ● Kansas ● ● Kentucky ● ● Louisiana ● ● ● Maine ● Minnesota ● ● Mississippi ● ● ● ● Missouri ● ● Montana Nebraska ● ● ● Nevada ● New Hampshire ● ● New Mexico ● ● North Carolina ● ● ● North Dakota ● Ohio ● ● Oklahoma ● ● Pennsylvania ● ● Rhode Island ● ● South Carolina ● ● South Dakota ● ● Tennessee ● ● Texas ● ● Utah ● ● Vermont ● ● Virginia ● ● West Virginia ● ● Wisconsin ● ● Wyoming estimator unit.weight ● ● ● ● ● ● ● sdid sc 0.25 0.20 0.15 0.10 0.05 12

  17. Implementation 1. Estimate synthetic control weights ˆ ω by simplex-constrained least squares on the pre-treatment data. � � 2 � ω 0 + ω T Y control,t − ¯ + ζ 2 T 0 � ω � 2 ˆ ω = arg min Y treated,t ω 0 ,ω t ≤ T 0 N 0 � subject to ω 1 . . . ω N 0 ≥ 0 , ω n = 1 n =1 Use an intercept. We want parallel lines, not overlaid ones. Use a ridge penalty; multicollinearity is typical. Shrinkage helps control variance and own-observation bias. 2. Estimate time series regression weights, ˆ λ , on the control units. 3. Estimate τ by ( 2 × 2 ) diff-in-diff on weighted block averages. 4. Form confidence intervals using the jackknife estimate of standard error. 13

  18. Synthetic Difference-in-Differences synthetic pre-treatment average post-treatment � � � � ω n ˆ ω n T − 1 synthetic control ˆ λ t Y nt ˆ Y nt 1 n ≤ N 0 t ≤ T 0 n ≤ N 0 t>T 0 � � � � N − 1 ˆ N − 1 T − 1 average treated λ t Y nt Y nt 1 1 1 n>N 0 t ≤ T 0 n>N 0 t>T 0 DID uses equal weights ω n = 1 /N 0 , λ t = 1 /T 0 . SC only take one difference (uses zero time weights λ t = 0 ). 14

  19. Theory

  20. A General Setting Y nt = L nt + W nt τ nt + ε nt , E [ ε | W ] = 0 • L : Matrix of noiseless control potential outcomes • τ : Matrix of treatment effects • ε : Noise matrix with iid subgaussian rows • We have autocorrelation over time • But no correlation between units. • W indicates the treated block We estimate the ATT 1 � τ = ¯ W nt τ nt N 1 T 1 nt • Typical sample sizes are small, but the setting is ‘high dimensional’. • We see various dimension ratios T/N , T 1 /T 0 , N 1 /N 0 . • We lose the essence in asymptotics with too many fixed dimensions. • The signal L tends to be multicollinear: no restricted eigenvalue condition! � • For simplicity, we’ll assume rank( L ) ≪ min( N 0 , T 0 ) . 15

  21. What can go wrong? Underfitting We don’t create parallel trends in pre-treatment outcomes. Overfitting We do, but by predicting signal from noise. Failed identification We adjust as intended, but we’re still confounded. 16

  22. Underfitting It happens, but it tends to be something we can see. e.g., California cigarette consumption with southeastern states as controls. 150 synth. california california 100 50 1970 1980 1990 2000 California �≈ . 82 Louisiana + . 10 Mississippi + . . . 17

  23. Overfitting We prove concentration around an oracle estimator to rule out overfitting. 1. Consider the limits of our unit and time weights: the minimizers of expected (as opposed to empirical) mean squared error. � � 2 � ω 0 + ω T L control,t − ¯ + [trace(Σ) + ζ 2 T 0 ] � ω � 2 ω = arg min ˜ L treated,t ω 0 ,ω ∈ R ×S t ≤ T 0 � � 2 + N 0 � Σ 1 / 2 ( λ − ψ ) � 2 . � ˜ λ 0 + L n,pre λ − ¯ λ = arg min L n,post λ 0 ,λ ∈ R ×S n ≤ N 0 We’re in an error-in-variables model, so implicit ridge penalty terms arise as the expectation of quadratics in the noise matrix ε . Σ = E ε T n,pre ε n,pre pre-treatment autocovariance matrix ε n,post ) 2 ψ = arg min E ( ε n,pre v − ¯ post-on-pre autoregression vector v ∈ R T 0 2. The oracle estimator ˜ τ uses these in place of the empirical minimizers. 3. Its error is easy to characterize because these weights are non-random. 18

Recommend


More recommend