treatment choice with many covariate values
play

Treatment choice with many covariate values Aleksey Tetenov - PowerPoint PPT Presentation

Treatment choice with many covariate values Aleksey Tetenov (University of Bristol) Cemmap masterclass Statistical decision theory for treatment choice and prediction May 30-31, 2017 Stoye (2009), Proposition 4: If covariate X is continuously


  1. Treatment choice with many covariate values Aleksey Tetenov (University of Bristol) Cemmap masterclass Statistical decision theory for treatment choice and prediction May 30-31, 2017

  2. Stoye (2009), Proposition 4: If covariate X is continuously distributed, minimax regret is constant and does not decrease with sample size. This result changes if 1. Stronger assumptions on how average treatment response E [ Y t | X ] varies with X (Stoye, 2012) 2. The set of feasible treatment rules is restricted Source: Kitagawa and Tetenov (2017), “Who Should be Treated? Empirical Welfare Maximization Methods for Treatment Choice” Cemmap working paper CWP24/17

  3. Regret is evaluated relative to the best implementable treatment rule. Stoye (2009, Proposition 4) assumes that any treatment allocation is feasible, including arbitrarily complex treatment rules. This is an unreasonable benchmark for public policies. Constraints frequently restrict the complexity and other characteristics of feasible treatment rules. ◮ Treatment rules are often publicly communicated to individuals and need to be understandable and transparent ◮ Monotonicity of treatment rules in some covariates if desirable (e.g., cannot treat the rich but not the poor) ◮ Some treatments may be capacity-constrained ◮ Other aggregate constraints (e.g., aggregate proportion treated cannot vary with race)

  4. Setup A Randomized Controlled Trial (RCT) sample ◮ X i ∈ X - pre-treatment observed covariates ◮ D i ∈ { 0 , 1 } - randomized treatment ◮ Y i ∈ R - treatment outcome ◮ Y 0 , i , Y 1 , i - potential outcomes ◮ e ( x ) ∈ [ κ, 1 − κ ] - the probability of being randomized to treatment 1 in the experiment We consider a restricted set of treatment rules G . Each G specifies which subset of the population will be treated (after analyzing the experimental data) ◮ X ∈ G will be assigned to treatment 1 ◮ X / ∈ G will be assigned to treatment 0 (excludes randomized/fractional treatment rules) ˆ G ∈ G treatment rule as a function of the sample

  5. Empirical Welfare Maximization ◮ Estimate the policy directly by maximizing empirical welfare ˆ G EWM = arg max G ∈G W n ( G ) , ◮ Sample analogue � Y i D i � � n e ( X i ) · 1 { X i ∈ G } + Y i (1 − D i ) W n ( G ) ≡ 1 1 − e ( X i ) · 1 { X i / ∈ G } n i =1 consistently estimates the population welfare of policy G , W ( G ) = E [ Y 1 · 1 { X ∈ G } + Y 0 · 1 { X i / ∈ G } ] . ◮ EWM treatment rule: � G EWM ≡ arg max G ∈G W n ( G )

  6. Empirical Illustration ◮ National Job Training Partnership Act (JTPA) Study (Bloom et al, 1997) ◮ Sample: 11,204 adult applicants ◮ Propensity score = 2/3 (probability of treatment) ◮ Outcome Y = D ( Y 1 − cost ) + (1 − D ) Y 0 : ◮ Total individual earnings in the 30-month period following treatment assignment ◮ Total earnings minus $774 (average cost of each treatment assignment, taking into account variable take-up) ◮ Covariates X : Years of education, pre-program earnings ◮ Average treatment effect: $1,157 ◮ 95% CI: ($513, $1,801)

  7. Parametric plug-in treatment rule: estimate E ( Y 1 | X ) and E ( Y 0 | X ) by OLS. Assign treatment 1 if X ′ β 1 > X ′ β 0 No cost: treat everyone, est. gain $1,157 With $774 cost: treat 96%, est. gain $466 (per population member) OLS plug-in rule, no cost OLS plug-in rule, $774 cost per assignee Population density $20K Pre-program annual earnings $10K $0 6 8 10 12 14 16 18 Years of education

  8. EWM linear rule: maximizes the sample analog of welfare among linear decision rules ˆ G = 1 { X ′ β ≥ 0 } No cost: treat 90%, est. gain $1,408. 95% CI: ($592, $2,225) $774 cost: treat 90%, est. gain $712. 95% CI: (-$107, $1,532) EWM linear rule, no cost or $774 cost per assignee Population density $20K Pre-program annual earnings $10K $0 6 8 10 12 14 16 18 Years of education

  9. EWM quadrant rule: select best min or max threshold for each covariate ˆ G = 1 { x 1 > ( < ) t 1 , x 2 > ( < ) t 2 } No cost: treat 93%, est. gain $1,277. 95% CI: ($519, $2,034) $774 cost: treat 83%, est. gain $687. 95% CI: (-$71, $1,445) EWM quadrant rule, no cost EWM quadrant rule, $774 cost per assignee Population density $20K Pre-program annual earnings $10K $0 6 8 10 12 14 16 18 Years of education

  10. Non-parametric plug-in rule: bivariate kernel reg of Y 1 | X and Y 0 | X (ROT bandwidth). No cost: treat 82%, est. gain $1,867 $774 cost: treat 69%, est. gain $1,257

  11. Welfare Criterion Object of interest: policy with the highest utilitarian (additive) welfare Outcome variable Y should reflect social preferences, so it may need to ◮ give different weight to different individuals ◮ non-linearly transform outcomes ◮ aggregate multi-dimensional outcomes ◮ subtract treatment costs from outcomes

  12. ◮ The utilitarian welfare of treatment rule G is W ( G ) ≡ E [ Y 1 · 1 { X ∈ G } + Y 0 · 1 { X / ∈ G } ] = E [ Y 0 ] + E [ τ ( X )1 { X ∈ G } ] , τ ( X ) ≡ E ( Y 1 − Y 0 | X ) : the conditional treatment effect ◮ We can equivalently work with the welfare gain of treating subset G relative to treating no one V ( G ) ≡ W ( G ) − W ( ∅ ) = E [ τ ( X ) · 1 { X ∈ G } ] ,

  13. First Best treatment rule (with no constraints on G ) G ∗ ≡ { x : τ ( x ) ≥ 0) } FB ∈ arg max G ∈B ( X ) W ( G ) ∈ arg max G ∈B ( X ) V ( G ) Second Best treatment rule maximizing welfare in a constrained class G G ∗ ∈ arg max G ∈G W ( G ) ∈ arg max G ∈G V ( G ) The maximized feasible welfare W ∗ W ( G ) ≤ W ( G ∗ G ≡ sup FB ) G ∈G

  14. Assumptions: Distribution of ( Y 0 , Y 1 , D , Y ) is P ∈ P . The only assumption on the distribution of treatment response: � � − M 2 , M ◮ Bounded Outcomes : Y 1 , Y 0 ∈ , M < ∞ , implying 2 | τ ( x ) | ≤ M , ∀ x . Restriction on experimental design (point-identifies τ ( x )) ◮ Strict Overlap : There exist κ > 0, s.t. e ( x ) ∈ [ κ, 1 − κ ], ∀ x . Restriction on G : ◮ Complexity of Decision Sets : G is a countable VC-class of subsets with finite VC-dimension: v = the maximal number of points in X that can be shattered by G .

  15. Examples of VC-classes G Linear eligibility score: � { x : x ′ β ≥ 0 } : β ∈ R d x � G = has v = d x + 1. Generalized eligibility score: �� � � m � : ( a 1 , . . . , a m ) ∈ R m G = x : a k f k ( x ) ≥ g ( x ) k =1 has v ≤ m + 1. Multiple index rules: G = {{ x : ( f 1 ( x 1 ) ≤ c 1 ) ∩ · · · ( f K ( x K ) ≤ c K ) } : ( c 1 , . . . , c m ) ∈ R m } has v ≤ K + 1.

  16. Upper bound on maximum regret of EWM Theorem 2.1 : Let P be a class of DGPs satisfying assumptions Bounded Outcomes and Strict Overlap. Let G be a VC-class of treatment choice rules. Then � v � � M G − W ( ˆ W ∗ sup E P n G EWM ) ≤ C 1 n , κ P ∈P where C 1 is a universal constant. Remarks on rate bounds: ◮ This rate bound is valid whether G ∗ FB ∈ G or not. ◮ Parametric plug-in with misspecified regressions does not have such second-best optimality.

  17. Proof: sketch For any ˜ G ∈ G , W ( ˜ G ) − W ( ˆ G EWM ) W n ( ˆ G EWM ) − W n ( ˜ G ) + W ( ˜ G ) − W ( ˆ ≤ G EWM ) � � � � � � � � � W n ( ˆ G EWM ) − W ( ˆ � W n ( ˜ G ) − W ( ˜ ≤ G EWM ) � + G ) � ≤ 2 sup | W n ( G ) − W ( G ) | . G ∈G So, W ∗ G − W ( ˆ G EWM ) ≤ 2 sup | W n ( G ) − W ( G ) | G ∈G

  18. Proof: sketch W n ( G ) = E n ( f ( · ; G )) and W ( G ) = E ( f ( · ; G )), where � Y i D i � e ( X i )1 { X i ∈ G } + Y i (1 − D i ) f ( · ; G ) = 1 − e ( X i ) 1 { X i / ∈ G } Lemma A.1 If G is a VC-class of sets with VC-dimension v and g ( · ) , h ( · ) are two given real-valued functions of observations, then functions { f ( · ; G ) = g ( · ) · 1 { x ∈ G } + h ( · ) · 1 { x / ∈ G } , G ∈ G} form a VC-subgraph class with VC-dimension ≤ v . Using this lemma, we can apply a well-known maximal inequality for centered empirical processes to sup | W n ( G ) − W ( G ) | = sup | E n ( f ) − E ( f ) | G ∈G G ∈G

  19. Lower bound on minimax regret Theorem 2.2 : Let P be a class of DGPs satisfying Bounded Outcomes and Strict Overlap. Let G be a VC-class of treatment choice rules. Then, for any treatment choice rule ˆ G � v � � √ ≥ M W ∗ G − W ( ˆ 2 e − 2 2 sup G ) for all n ≥ 16 v , E P n n P ∈P Remarks on rate bounds: ◮ Both are finite-sample bounds (but not sharp). ◮ ˆ G EWM is minimax rate optimal: no ˆ G has maximum regret converging to zero at a faster rate uniformly over P . ◮ EWM is minimax rate optimal even when v grows with n .

  20. Proof: sketch For the lower bound, we adapt the argument in Lugosi (2002): E P n � � W ∗ G − W ( G n ) sup P ∈P P ∈P ∗ E P n � � W ∗ ≥ sup G − W ( G n ) � P ∗ E P n � � W ∗ ≥ G − W ( G n ) d µ ( P ) � � � W ∗ G − W ( ˆ ≥ G bayes ) d µ ( P ) , P ∗ E P n where P ∗ ⊂ P is a class of DGPs that has a discrete support of X with v points and τ ( x ) = γ or − γ . For uniform prior µ , the Bayes risk can be analytically computed as a function of γ . Setting � γ = v / n gives the lower bound.

Recommend


More recommend