“A Course in Applied Econometrics” Outline Lecture 1 1. Introduction Estimation of Average Treatment Effects 2. Potential Outcomes Under Unconfoundedness, Part I 3. Estimands and Identification 4. Estimation and Inference Guido Imbens IRP Lectures, UW Madison, August 2008 1 Unusual case with many proposed (semi-parametric) estima- 1. Introduction tors (matching, regression, propensity score, or combinations), many of which are actually used in practice. We are interested in estimating the average effect of a program or treatment, allowing for heterogenous effects, assuming that We discuss implementation, and assessment of the critical as- selection can be taken care of by adjusting for differences in sumptions (even if they are not testable). observed covariates. In practice concern with overlap in covariate distributions tends This setting is of great applied interest. to be important. Once overlap issues are addressed, choice of estimators is less Long literature, in both statistics and economics. Influential important. Estimators combining matching and regression or economics/econometrics papers include Ashenfelter and Card weighting and regression are recommended for robustness rea- (1985), Barnow, Cain and Goldberger (1980), Card and Sulli- sons. van (1988), Dehejia and Wahba (1999), Hahn (1998), Heck- man and Hotz (1989), Heckman and Robb (1985), Lalonde Key role for analysis of the joint distribution of treatment in- (1986). In stat literature work by Rubin (1974, 1978), Rosen- dicator and covariates prior to using outcome data. baum and Rubin (1983). 2 3
2. Potential Outcomes (Rubin, 1974) We observe N units, indexed by i = 1 , . . . , N , viewed as drawn randomly from a large population. Several additional pieces of notation. We postulate the existence for each unit of a pair of potential First, the propensity score (Rosenbaum and Rubin, 1983) is outcomes, defined as the conditional probability of receiving the treat- Y i (0) for the outcome under the control treatment and ment, Y i (1) for the outcome under the active treatment Y i (1) − Y i (0) is unit-level causal effect e ( x ) = Pr( W i = 1 | X i = x ) = E [ W i | X i = x ] . Covariates X i (not affected by treatment) Each unit is exposed to a single treatment; W i = 0 if unit i Also the two conditional regression and variance functions: receives the control treatment and W i = 1 if unit i receives the active treatment. We observe for each unit the triple σ 2 ( W i , Y i , X i ), where Y i is the realized outcome: µ w ( x ) = E [ Y i ( w ) | X i = x ] , w ( x ) = V ( Y i ( w ) | X i = x ) . � Y i (0) if W i = 0 , Y i ≡ Y i ( W i ) = Y i (1) if W i = 1 . 6 7 4. Estimation and Inference 3. Estimands and Identification Assumption 1 (Unconfoundedness, Rosenbaum and Rubin, 1983a) Population average treatments ( Y i (0) , Y i (1)) ⊥ ⊥ W i | X i . τ P = E [ Y i (1) − Y i (0)] τ P,T = E [ Y i (1) − Y i (0) | W = 1] . “conditional independence assumption,” “selection on observ- Most of the discussion in these notes will focus on τ P , with ables.” In missing data literature “missing at random.” extensions to τ P,T available in the references. To see the link with standard exogeneity assumptions, assume constant effect and linear regression: We will also look at the sample average treatment effect (SATE): Y i (0) = α + X ′ Y i = α + τ · W i + X ′ i β + ε i , = ⇒ i β + ε i N τ S = 1 � ( Y i (1) − Y i (0)) N with ε i ⊥ ⊥ X i . Given the constant treatment effect assumption, i =1 unconfoundedness is equivalent to independence of W i and ε i τ P versus τ S does not matter for estimation, but matters for conditional on X i , which would also capture the idea that W i variance. is exogenous. 8 9
Motivation for Unconfoundeness Assumption (I) Motivation for Unconfoundeness Assumption (II) The first is a statistical, data descriptive motivation. A second argument is that almost any evaluation of a treatment involves comparisons of units who received the treatment with units who did not. A natural starting point in the evaluation of any program is a comparison of average outcomes for treated and control units. The question is typically not whether such a comparison should be made, but rather which units should be compared, that is, A logical next step is to adjust any difference in average out- which units best represent the treated units had they not been comes for differences in exogenous background characteristics treated. (exogenous in the sense of not being affected by the treat- ment). It is clear that settings where some of necessary covariates are not observed will require strong assumptions to allow for iden- Such an analysis may not lead to the final word on the efficacy tification. E.g., instrumental variables settings Absent those of the treatment, but the absence of such an analysis would assumptions, typically only bounds can be identified (e.g., Man- seem difficult to rationalize in a serious attempt to understand ski, 1990, 1995). the evidence regarding the effect of the treatment. 10 11 Motivation for Unconfoundeness Assumption (III) Example of a model that is consistent with unconfoundedness: suppose we are interested in estimating the average effect of Overlap a binary input on a firm’s output, or Y i = g ( W, ε i ). Second assumption on the joint distribution of treatments and Suppose that profits are output minus costs, covariates: W i = arg max E [ π i ( w ) | c i ] = arg max E [ g ( w, ε i ) − c i · w | c i ] , w w Assumption 2 (Overlap) implying W i = 1 { E [ g (1 , ε i ) − g (0 , ε i ) ≥ c i | c i ] } = h ( c i ) . 0 < Pr( W i = 1 | X i ) < 1 . If unobserved marginal costs c i differ between firms, and these Rosenbaum and Rubin (1983a) refer to the combination of the marginal costs are independent of the errors ε i in the firms’ two assumptions as ”stongly ignorable treatment assignment.” forecast of output given inputs, then unconfoundedness will hold as ( g (0 , ε i ) , g (1 , ε i )) ⊥ ⊥ c i . 12 13
Alternative Assumptions Identification Given Assumptions E [ Y i ( w ) | W i , X i ] = E [ Y i ( w ) | X i ] , τ ( x ) ≡ E [ Y i (1) − Y i (0) | X i = x ] = E [ Y i (1) | X i = x ] − E [ Y i (0) | X i = x ] for w = 0 , 1. Although this assumption is unquestionably = E [ Y i (1) | X i = x, W i = 1] − E [ Y i (0) | X i = x, W i = 0] weaker, in practice it is rare that a convincing case can be made for the weaker assumption without the case being equally = E [ Y i | X i , W i = 1] − E [ Y i | X i , W i = 0] . strong for the stronger Assumption. The reason is that the weaker assumption is intrinsically tied to To make this feasible, one needs to be able to estimate the functional form assumptions, and as a result one cannot iden- expectations E [ Y i | X i = x, W i = w ] for all values of w and x in the tify average effects on transformations of the original outcome support of these variables. This is where overlap is important. (e.g., logarithms) without the strong assumption. Given identification of τ ( x ), If we are interested in τ P,T it is sufficient to assume τ P = E [ τ ( X i )] Y i (0) ⊥ ⊥ W i | X i , 14 15 Efficiency Bound Propensity Score Hahn (1998): for any regular estimator for τ P , denoted by ˆ τ , Result 1 Suppose that Assumption 1 holds. Then: with √ d ( Y i (0) , Y i (1)) ⊥ ⊥ W i | e ( X i ) . N · (ˆ τ − τ P ) − → N (0 , V ) , Only need to condition on scalar function of covariates, which the variance must satisfy: would be much easier in practice if X i is high-dimensional. σ 2 σ 2 � 1 ( X i ) 0 ( X i ) � 1 − e ( X i ) + ( τ ( X i ) − τ P ) 2 V ≥ E e ( X i ) + . (1) (Problem is that the propensity score e ( x ) is almost never known.) Estimators exist that achieve this bound. 16 17
A. Regression Estimators Estimate µ w ( x ) consistently and estimate τ P or τ S as Estimators N τ reg = 1 � ˆ (ˆ µ 1 ( X i ) − ˆ µ 0 ( X i )) . N i =1 A. Regression Estimators Simple implementations include B. Matching µ w ( x ) = β ′ x + τ · w, in which case the average treatment effect is equal to τ . In C. Propensity Score Estimators this case one can estimate τ simply by least squares estimation using the regression function D. Mixed Estimators (recommended) Y i = α + β ′ X i + τ · W i + ε i . More generally, one can specify separate regression functions for the two regimes, µ w ( x ) = β ′ w x . 18 19 These simple regression estimators can be sensitive to dif- ferences in the covariate distributions for treated and control 8 units. 7 The reason is that in that case the regression estimators rely 6 heavily on extrapolation. 5 Note that µ 0 ( x ) is used to predict missing outcomes for the 4 treated. Hence on average one wishes to use predict the control 3 outcome at X T = � i W i · X i /N T , the average covariate value 2 for the treated. With a linear regression function, the average prediction can be written as ¯ Y C + ˆ β ′ ( X T − X C ). 1 0 If X T and X C are close, the precise specification of the regres- sion function will not matter much for the average prediction. −1 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 With the two averages very different, the prediction based on a linear regression function can be sensitive to changes in the specification. 20
Recommend
More recommend