what is to be done two attempts using gaussian process
play

What is to be done? Two attempts using Gaussian process priors - PowerPoint PPT Presentation

What is to be done? What is to be done? Two attempts using Gaussian process priors Maximilian Kasy Department of Economics, Harvard University Oct 14 2017 1 / 33 What is to be done? What questions should econometricians work on?


  1. What is to be done? What is to be done? Two attempts using Gaussian process priors Maximilian Kasy Department of Economics, Harvard University Oct 14 2017 1 / 33

  2. What is to be done? What questions should econometricians work on? ◮ Incentives of the publication process: ◮ Appeal to referees from the same subfield. ◮ Danger of self-referentiality, untethering from external relevance. ◮ Versus broader usefulness: ◮ Tools useful for empirical researchers, policy makers. ◮ Anchored in substantive applications, broader methodological considerations. ◮ One way to get there: Well defined decision problems. 2 / 33

  3. What is to be done? Decision problems ◮ Objects to carefully choose: ◮ Objective function. ◮ Space of possible decisions / policy alternatives. ◮ Identifying assumptions. ◮ Prior information. ◮ Features the priors should be uninformative about. ◮ Once these are specified, coherent and well-behaved solutions can be derived. ◮ Useful tool for tractable solutions without functional form restrictions: Gaussian process priors. 3 / 33

  4. What is to be done? Outline of this talk ◮ Brief introduction to Gaussian process regression ◮ Application 1: Optimal treatment assignment in experiments. ◮ Setting: Treatment assignment given baseline covariates ◮ General decision theory result: Non-random rules dominate random rules ◮ Prior for expectation of potential outcomes given covariates ◮ Expression for MSE of estimator for ATE to minimize by treatment assignment ◮ Application 2: Optimal insurance and taxation. ◮ Economic setting: Co-insurance rate for health insurance ◮ Statistical setting: prior for behavioral average response function ◮ Expression for posterior expected social welfare to maximize by choice of co-insurance rate 4 / 33

  5. What is to be done? References Williams, C. and Rasmussen, C. (2006). Gaussian processes for machine learning . MIT Press, chapter 2. Kasy, M. (2016). Why experimenters might not always want to randomize, and what they could do instead. Political Analysis , 24(3):324–338. Kasy, M. (2017). Optimal taxation and insurance using machine learning. Working Paper, Harvard University . 5 / 33

  6. What is to be done? Gaussian process regression Brief introduction to Gaussian process regression ◮ Suppose we observe n i.i.d. draws of ( Y i , X i ) , where Y i is real valued and X i is a k vector. ◮ Y i = f ( X i )+ ε i ◮ ε i | X , f ( · ) ∼ N ( 0 , σ 2 ) ◮ Prior: f is distributed according to a Gaussian process, f | X ∼ GP ( 0 , C ) , where C is a covariance kernel, Cov ( f ( x ) , f ( x ′ ) | X ) = C ( x , x ′ ) . ◮ We will leave conditioning on X implicit. 6 / 33

  7. What is to be done? Gaussian process regression Posterior mean ◮ The joint distribution of ( f ( x ) , Y ) is given by � � � � �� f ( x ) C ( x , x ) c ( x ) ∼ N 0 , , c ( x ) ′ C + σ 2 I n Y where ◮ c ( x ) is the n vector with entries C ( x , X i ) , ◮ and C is the n × n matrix with entries C i , j = C ( X i , X j ) . ◮ Therefore � � − 1 · Y . C + σ 2 I n E [ f ( x ) | Y ] = c ( x ) · ◮ Read: � f ( · ) = E [ f ( · ) | Y ] ◮ is a linear combination of the functions C ( · , X i ) � � − 1 · Y . C + σ 2 I n ◮ with weights 7 / 33

  8. What is to be done? Gaussian process regression Both applications use Gaussian process priors 1. Optimal experimental design ◮ How to assign treatment to minimize mean squared error for treatment effect estimators? ◮ Gaussian process prior for the conditional expectation of potential outcomes given covariates. 2. Optimal insurance and taxation ◮ How to choose a co-insurance rate or tax rate to maximize social welfare, given (quasi-)experimental data? ◮ Gaussian process prior for the behavioral response function mapping the co-insurance rate into the tax base. 8 / 33

  9. What is to be done? Experimental design Application 1 “Why experimenters might not always want to randomize” Setup 1. Sampling: random sample of n units baseline survey ⇒ vector of covariates X i 2. Treatment assignment: binary treatment assigned by D i = d i ( X , U ) X matrix of covariates; U randomization device 3. Realization of outcomes: Y i = D i Y 1 i +( 1 − D i ) Y 0 i 4. Estimation: estimator � β of the (conditional) average treatment effect, β = 1 n ∑ i E [ Y 1 i − Y 0 i | X i , θ ] 9 / 33

  10. What is to be done? Experimental design Questions ◮ How should we assign treatment? ◮ In particular, if X i has continuous or many discrete components? ◮ How should we estimate β ? ◮ What is the role of prior information? 10 / 33

  11. What is to be done? Experimental design Some intuition ◮ “Compare apples with apples” ⇒ balance covariate distribution. ◮ Not just balance of means! ◮ We don’t add random noise to estimators – why add random noise to experimental designs? ◮ Identification requires controlled trials (CTs), but not randomized controlled trials (RCTs). 11 / 33

  12. What is to be done? Experimental design General decision problem allowing for randomization ◮ General decision problem: ◮ State of the world θ , observed data X , randomization device U ⊥ X , ◮ decision procedure δ ( X , U ) , loss L ( δ ( X , U ) , θ ) . ◮ Conditional expected loss of decision procedure δ ( X , U ) : R ( δ , θ | U = u ) = E [ L ( δ ( X , u ) , θ ) | θ ] ◮ Bayes risk: � � R B ( δ , π ) = R ( δ , θ | U = u ) d π ( θ ) dP ( u ) ◮ Minimax risk: � R mm ( δ ) = R ( δ , θ | U = u ) dP ( u ) max θ 12 / 33

  13. What is to be done? Experimental design Theorem (Optimality of deterministic decisions) Consider a general decision problem. Let R ∗ equal R B or R mm . Then: 1. The optimal risk R ∗ ( δ ∗ ) , when considering only deterministic procedures δ ( X ) , is no larger than the optimal risk when allowing for randomized procedures δ ( X , U ) . 2. If the optimal deterministic procedure δ ∗ is unique, then it has strictly lower risk than any non-trivial randomized procedure. 13 / 33

  14. What is to be done? Experimental design Proof ◮ Any probability distribution P ( u ) satisfies ◮ ∑ u P ( u ) = 1, P ( u ) ≥ 0 for all u . ◮ Thus ∑ u R u · P ( u ) ≥ min u R u for any set of values R u . ◮ Let δ u ( x ) = δ ( x , u ) . ◮ Then � R B ( δ , π ) = ∑ R ( δ u , θ ) d π ( θ ) P ( u ) u � R ( δ u , θ ) d π ( θ ) = min u R B ( δ u , π ) . ≥ min u ◮ Similarly R mm ( δ ) = ∑ R ( δ u , θ ) P ( u ) max θ u R ( δ u , θ ) = min u R mm ( δ u ) . ≥ min u max θ 14 / 33

  15. What is to be done? Experimental design Bayesian setup ◮ Back to experimental design setting. ◮ Conditional distribution of potential outcomes: for d = 0 , 1 Y d i | X i = x ∼ N ( f ( x , d ) , σ 2 ) . ◮ Gaussian process prior: f ∼ GP ( µ , C ) , E [ f ( x , d )] = µ ( x , d ) Cov ( f ( x 1 , d 1 ) , f ( x 2 , d 2 )) = C (( x 1 , d 1 ) , ( x 2 , d 2 )) ◮ Conditional average treatment effect (CATE): β = 1 i | X i , θ ] = 1 n ∑ E [ Y 1 i − Y 0 n ∑ f ( X i , 1 ) − f ( X i , 0 ) . i i 15 / 33

  16. What is to be done? Experimental design Notation ◮ Covariance matrix C , where C i , j = C (( X i , D i ) , ( X j , D j )) ◮ Mean vector µ , components µ i = µ ( X i , D i ) ◮ Covariance of observations with CATE, C i = Cov ( Y i , β | X , D ) = 1 n ∑ ( C (( X i , D i ) , ( X j , 1 )) − C (( X i , D i ) , ( X j , 0 ))) . j 16 / 33

  17. What is to be done? Experimental design Posterior expectation and risk ◮ The posterior expectation � β of β equals β = µ β + C ′ · ( C + σ 2 I ) − 1 · ( Y − µ ) . � ◮ The corresponding risk equals R B ( d , � β | X ) = Var ( β | X , Y ) = Var ( β | X ) − Var ( E [ β | X , Y ] | X ) = Var ( β | X ) − C ′ · ( C + σ 2 I ) − 1 · C . 17 / 33

  18. What is to be done? Experimental design Discrete optimization ◮ The optimal design solves C ′ · ( C + σ 2 I ) − 1 · C . max d ◮ Possible optimization algorithms: 1. Search over random d 2. greedy algorithm 3. simulated annealing 18 / 33

  19. What is to be done? Experimental design Special case linear separable model ◮ Suppose f ( x , d ) = x ′ · γ + d · β , γ ∼ N ( 0 , Σ) , and we estimate β using comparison of means. 1 − X 0 ) ′ · γ , prior expected squared bias ◮ Bias of � β equals ( X 1 − X 1 − X 0 ) ′ · Σ · ( X 0 ) . ( X ◮ Mean squared error � � 1 + 1 1 − X 1 − X MSE ( d 1 ,..., d n ) = σ 2 · 0 ) ′ · Σ · ( X 0 ) . +( X n 1 n 0 ◮ ⇒ Risk is minimized by 1. choosing treatment and control arms of equal size, 2. and optimizing balance as measured by the difference in covariate 1 − X 0 ) . means ( X 19 / 33

  20. What is to be done? Optimal insurance Application 2 “Optimal insurance and taxation using machine learning” Economic setting ◮ Population of insured individuals i . ◮ Y i : health care expenditures of individual i . ◮ T i : share of health care expenditures covered by the insurance 1 − T i : coinsurance rate; Y i · ( 1 − T i ) : out-of-pocket expenditures ◮ Behavioral response to share covered: structural function Y i = g ( T i , ε i ) . ◮ Per capita expenditures under policy t : average structural function m ( t ) = E [ g ( t , ε i )] . 20 / 33

Recommend


More recommend