Gov 2000: 13. Panel Data and Clustering Matthew Blackwell Fall 2016 1 / 55
1. Panel Data 2. First Difgerencing Methods 3. Fixed Efgects Methods 4. Clustering 5. What’s next for you? 2 / 55
Where are we? Where are we going? and violations of those assumptions 3 / 55 • Up until now: the linear regression model, its assumptions, • This week: what can we do with panel data?
1/ Panel Data 4 / 55
Motivation but… ways that we can’t measure? outcomes progress in spite of these problems? 5 / 55 • Relationship between democracy and infant mortality? • Compare levels of democracy with levels of infant mortality, • Democratic countries are difgerent from non-democracies in ▶ they are richer or developed earlier ▶ provide benefjts more effjciently ▶ posses some cultural trait correlated with better health • If we have data on countries over time, can we make any
Ross data NA 215 0 ## 6 Afghanistan 1970 NA 0 ## 5 Afghanistan 1969 NA 0 ## 4 Afghanistan 1968 0 ross <- foreign::read.dta("../data/ross-democracy.dta") ## 3 Afghanistan 1967 NA 0 ## 2 Afghanistan 1966 230 0 ## 1 Afghanistan 1965 cty_name year democracy infmort_unicef ## head(ross[, c("cty_name", "year", "democracy", "infmort_unicef")]) 6 / 55
Notation for panel data (a political science term, mostly) 7 / 55 • Units, 𝑗 = 1, … , 𝑜 • Time, 𝑢 = 1, … , 𝑈 • Time is a typical application, but applies to other groupings: ▶ counties within states ▶ states within countries ▶ people within coutries, etc. • Panel data: large 𝑜 , relatively short 𝑈 • Time series, cross-sectional (TSCS) data: smaller 𝑜 , large 𝑈
𝑗𝑢 𝜸 + 𝑤 𝑗𝑢 Model model: 𝔽[𝑣 𝑗𝑢 |𝐲 𝑗𝑢 , 𝑏 𝑗 ] = 0 𝔽[𝑣 𝑗𝑢 |𝐲 𝑗𝑢 ] = 0 . 8 / 55 𝑧 𝑗𝑢 = 𝐲 ′ 𝑗𝑢 𝜸 + 𝑏 𝑗 + 𝑣 𝑗𝑢 • 𝐲 𝑗𝑢 is a vector of covariates (possibly time-varying) • 𝑏 𝑗 is an unobserved time-constant unit efgect (“fjxed efgect”) • 𝑣 𝑗𝑢 are the unobserved time-varying “idiosyncratic” errors • 𝑤 𝑗𝑢 = 𝑏 𝑗 + 𝑣 𝑗𝑢 is the combined unobserved error: 𝑧 𝑗𝑢 = 𝐲 ′ • Assume that if we could measure 𝑏 𝑗 , we would have the right ▶ Note that this implies, 𝑣 𝑗𝑢 uncorrelated with 𝐲 𝑗𝑢 , so that
Pooled OLS 1. Variance is wrong 2. Possible violation of zero conditional mean errors 9 / 55 • Pooled OLS: pool all observations into one regression • Treats all unit-periods (each 𝑗𝑢 ) as an iid unit. • Has two problems: • Both problems arise out of ignoring the unmeasured heterogeneity inherent in 𝑏 𝑗
Pooled OLS with Ross data ## <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.795 on 646 degrees of freedom (5773 observations deleted due to missingness) 0.0155 ## Multiple R-squared: 0.504, Adjusted R-squared: 0.503 ## F-statistic: 329 on 2 and 646 DF, p-value: <2e-16 -14.8 -0.2283 pooled.mod <- lm(log(kidmort_unicef) ~ democracy + log(GDPcur), 9.7640 data = ross) summary(pooled.mod) ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.3449 ## log(GDPcur) 28.3 <2e-16 *** ## democracy -0.9552 0.0698 -13.7 <2e-16 *** 10 / 55
Unmeasured heterogeneity consistency! aspects of health outcomes, like quality of health system or a lack of ethnic confmict. error and the independent variables. conditional mean error fails for the combined error. 11 / 55 • If unit-efgect, 𝑏 𝑗 is uncorrelated with 𝐲 𝑗𝑢 , no problem for ▶ � 𝔽[𝑤 𝑗𝑢 |𝐲 𝑗𝑢 ] = 𝔽[𝑏 𝑗 + 𝑣 𝑗𝑢 |𝐲 𝑗𝑢 ] = 0 . ▶ Just run pooled OLS (but worry about SEs). • But 𝑏 𝑗 often correlated with 𝐲 𝑗𝑢 so that 𝔽[𝑏 𝑗 |𝐲 𝑗𝑢 ] ≠ 0 . ▶ Example: democratic institutions correlated with unmeasured ▶ Ignore the heterogeneity � correlation between the combined ▶ � 𝔽[𝑤 𝑗𝑢 |𝐲 𝑗𝑢 ] = 𝔽[𝑏 𝑗 + 𝑣 𝑗𝑢 |𝐲 𝑗𝑢 ] ≠ 0 • Pooled OLS will be biased and inconsistent because zero
Panel data consistently even when zero conditional mean error is violated. confounding. 12 / 55 • Panel data (sometimes) allows us to estimate coeffjcients • Two approaches that leverage repeated observations: ▶ Difgerencing: look at changes over time. ▶ Fixed efgects: look at relationships within units. • These approaches can help address time-constant unmeasured
2/ First Differencing Methods 13 / 55
First differencing = (𝐲 ′ 𝑗 𝜸 + Δ𝑣 𝑗 = Δ𝐲 ′ = (𝐲 ′ 14 / 55 unobserved heterogeneity • One approach: compare changes over time • Intuitively, changes over time will be free of time-constant • Two time periods: 𝑧 𝑗1 = 𝐲 ′ 𝑗1 𝜸 + 𝑏 𝑗 + 𝑣 𝑗1 𝑧 𝑗2 = 𝐲 ′ 𝑗2 𝜸 + 𝑏 𝑗 + 𝑣 𝑗2 • Look at the change in 𝑧 over time: Δ𝑧 𝑗 = 𝑧 𝑗2 − 𝑧 𝑗1 𝑗2 𝜸 + 𝑏 𝑗 + 𝑣 𝑗2 ) − (𝐲 ′ 𝑗1 𝜸 + 𝑏 𝑗 + 𝑣 𝑗1 ) 𝑗2 − 𝐲 ′ 𝑗1 )𝜸 + (𝑏 𝑗 − 𝑏 𝑗 ) + (𝑣 𝑗2 − 𝑣 𝑗1 )
First differences model 𝑗 𝜸 + Δ𝑣 𝑗 Δ𝐲 𝑗 conditional mean error holds. units the difgerences 15 / 55 Δ𝑧 𝑗 = Δ𝐲 ′ • Coeffjcient on the levels 𝐲 𝑗𝑢 = the coeffjcient on the changes • Time-constant unobserved heterogeneity 𝑏 𝑗 drops out • Zero conditional mean error: 𝔽[Δ𝑣 𝑗 |Δ𝐲 𝑗 ] = 0 and zero ▶ Stronger than 𝔽[𝑣 𝑗𝑢 |𝐲 𝑗𝑢 , 𝑏 𝑗 ] because requires assumptions about relationships between 𝑣 𝑗2 and 𝐲 𝑗1 . • No perfect collinearity: 𝐲 𝑗𝑢 has to change over time for some • Under these modifjed assumptions, we can run regular OLS on
First differences in R ## --- <2e-16 *** ## democracy -0.0449 0.0242 -1.85 0.064 . ## log(GDPcur) -0.1718 0.0138 -12.49 <2e-16 *** ## Signif. codes: 0.0113 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Total Sum of Squares: 23.5 ## Residual Sum of Squares: 17.8 ## R-Squared : 0.246 ## Adj. R-Squared : 0.244 ## F-statistic: 78.1367 on 2 and 480 DF, p-value: <2e-16 -13.26 -0.1495 library(plm) ## (intercept) index = c("id", "year"), model = "fd") summary(fd.mod) ## Oneway (individual) effect First-Difference Model ## ## Call: ## plm(formula = log(kidmort_unicef) ~ democracy + log(GDPcur), ## data = ross, model = "fd", index = c("id", "year")) ## ## Unbalanced Panel: n=166, T=1-7, N=649 ## ## Residuals : ## Min. 1st Qu. Median 3rd Qu. Max. ## -0.9060 -0.0956 0.0468 0.1410 0.3950 ## ## Coefficients : ## Estimate Std. Error t-value Pr(>|t|) 16 / 55 fd.mod <- plm(log(kidmort_unicef) ~ democracy + log(GDPcur), data = ross,
Differences-in-differences 17 / 55 • Often called “difg-in-difg”, it is a special kind of FD model • Let 𝑦 𝑗𝑢 be an indicator of a unit being “treated” at time 𝑢 . • Focus on two-periods where: ▶ 𝑦 𝑗1 = 0 for all 𝑗 ▶ 𝑦 𝑗2 = 1 for the “treated group” • Here is the basic model: 𝑧 𝑗𝑢 = 𝛾 0 + 𝜀 0 𝑒 𝑢 + 𝛾 1 𝑦 𝑗𝑢 + 𝑏 𝑗 + 𝑣 𝑗𝑢 • 𝑒 𝑢 is a dummy variable for the second time period ▶ 𝑒 2 = 1 and 𝑒 1 = 0 • 𝛾 1 is the quantity of interest: it’s the efgect of being treated
𝜀 0 ) associated with being in the treatment group. Diff-in-diff mechanics 18 / 55 • Let’s take difgerences: (𝑧 𝑗2 − 𝑧 𝑗1 ) = 𝜀 0 + 𝛾 1 (𝑦 𝑗2 − 𝑦 𝑗1 ) + (𝑣 𝑗2 − 𝑣 𝑗1 ) • (𝑦 𝑗2 − 𝑦 𝑗1 ) = 1 only for the treated group • (𝑦 𝑗2 − 𝑦 𝑗1 ) = 0 only for the control group • 𝜀 0 : the difgerence in the average outcome from period 1 to period 2 in the untreated group • 𝛾 1 represents the additional change in 𝑧 over time (on top of
Diff-in-diff interpretation group to the changes over time in the treated group. the causal efgect: treatment/control difgerences in period 2? lower outcomes than the control group 19 / 55 • Key idea: comparing the changes over time in the control • The difgerences between these difgerences is our estimate of 𝛾 1 = Δ𝑧 treated − Δ𝑧 control • Why more credible than simply looking at the 𝑧 𝑗2 = (𝛾 0 + 𝜀 0 ) + 𝛾 1 𝑦 𝑗2 + 𝑏 𝑗 + 𝑣 𝑗2 • 𝑏 𝑗 might be correlated with the treatment • Unmeasured reasons why the treated group has higher or • � bias due to violation of zero conditional mean error
Example: Lyall (2009) 20 / 55
Example: Lyall (2009) to places where the insurgency is the strongest with whether or not shelling occurs, 𝑦 𝑗𝑢 over time for shelled and non-shelled villages: 21 / 55 • Does Russian shelling of villages cause insurgent attacks? attacks 𝑗𝑢 = 𝛾 0 + 𝜀 0 𝑒 𝑢 + 𝛾 1 shelling 𝑗𝑢 + 𝑏 𝑗 + 𝑣 𝑗𝑢 • We might think that artillery shelling by Russians is targeted • That is, part of the village fjxed efgect, 𝑏 𝑗 might be correlated • This would cause our pooled estimates to be biased • Instead Lyall takes a difg-in-difg approach: compare attacks Δ attacks 𝑗 = 𝜀 0 + 𝛾 1 Δ shelling 𝑗 + Δ𝑣 𝑗
Recommend
More recommend