Shrinkage Econ 2148, fall 2019 Applications of Gaussian process priors Maximilian Kasy Department of Economics, Harvard University 1 / 36
Shrinkage Applications from my own work Agenda ◮ Optimal treatment assignment in experiments. ◮ Setting: Treatment assignment given baseline covariates ◮ General decision theory result: Non-random rules dominate random rules ◮ Prior for expectation of potential outcomes given covariates ◮ Expression for MSE of estimator for ATE to minimize by treatment assignment ◮ Optimal insurance and taxation. ◮ Review: Envelope theorem. ◮ Economic setting: Co-insurance rate for health insurance ◮ Statistical setting: prior for behavioral average response function ◮ Expression for posterior expected social welfare to maximize by choice of co-insurance rate 2 / 36
Shrinkage Applications use Gaussian process priors 1. Optimal experimental design ◮ How to assign treatment to minimize mean squared error for treatment effect estimators? ◮ Gaussian process prior for the conditional expectation of potential outcomes given covariates. 2. Optimal insurance and taxation ◮ How to choose a co-insurance rate or tax rate to maximize social welfare, given (quasi-)experimental data? ◮ Gaussian process prior for the behavioral response function mapping the co-insurance rate into the tax base. 3 / 36
Shrinkage Experimental design Application 1 “Why experimenters might not always want to randomize” Setup 1. Sampling: random sample of n units baseline survey ⇒ vector of covariates X i 2. Treatment assignment: binary treatment assigned by D i = d i ( X , U ) X matrix of covariates; U randomization device 3. Realization of outcomes: Y i = D i Y 1 i +( 1 − D i ) Y 0 i 4. Estimation: estimator � β of the (conditional) average treatment effect, β = 1 n ∑ i E [ Y 1 i − Y 0 i | X i , θ ] 4 / 36
Shrinkage Experimental design Questions ◮ How should we assign treatment? ◮ In particular, if X i has continuous or many discrete components? ◮ How should we estimate β ? ◮ What is the role of prior information? 5 / 36
Shrinkage Experimental design Some intuition ◮ “Compare apples with apples” ⇒ balance covariate distribution. ◮ Not just balance of means! ◮ We don’t add random noise to estimators – why add random noise to experimental designs? ◮ Identification requires controlled trials (CTs), but not randomized controlled trials (RCTs). 6 / 36
Shrinkage Experimental design General decision problem allowing for randomization ◮ General decision problem: ◮ State of the world θ , observed data X , randomization device U ⊥ X , ◮ decision procedure δ ( X , U ) , loss L ( δ ( X , U ) , θ ) . ◮ Conditional expected loss of decision procedure δ ( X , U ) : R ( δ , θ | U = u ) = E [ L ( δ ( X , u ) , θ ) | θ ] ◮ Bayes risk: � � R B ( δ , π ) = R ( δ , θ | U = u ) d π ( θ ) dP ( u ) ◮ Minimax risk: � R mm ( δ ) = max R ( δ , θ | U = u ) dP ( u ) θ 7 / 36
Shrinkage Experimental design Theorem (Optimality of deterministic decisions) Consider a general decision problem. Let R ∗ equal R B or R mm . Then: 1. The optimal risk R ∗ ( δ ∗ ) , when considering only deterministic procedures δ ( X ) , is no larger than the optimal risk when allowing for randomized procedures δ ( X , U ) . 2. If the optimal deterministic procedure δ ∗ is unique, then it has strictly lower risk than any non-trivial randomized procedure. 8 / 36
Shrinkage Experimental design Practice problem Proof this. Hints: ◮ Assume for simplicity that U has finite support. ◮ Note that a (weighted) average of numbers is always at least as large as their minimum. ◮ Write the risk (Bayes or minimax) of any randomized assignment rule as (weighted) average of the risk of deterministic rules. 9 / 36
Shrinkage Experimental design Solution ◮ Any probability distribution P ( u ) satisfies ◮ ∑ u P ( u ) = 1, P ( u ) ≥ 0 for all u . ◮ Thus ∑ u R u · P ( u ) ≥ min u R u for any set of values R u . ◮ Let δ u ( x ) = δ ( x , u ) . ◮ Then � R B ( δ , π ) = ∑ R ( δ u , θ ) d π ( θ ) P ( u ) u � R ( δ u , θ ) d π ( θ ) = min u R B ( δ u , π ) . ≥ min u ◮ Similarly R mm ( δ ) = ∑ R ( δ u , θ ) P ( u ) max θ u R ( δ u , θ ) = min u R mm ( δ u ) . ≥ min u max θ 10 / 36
Shrinkage Experimental design Bayesian setup ◮ Back to experimental design setting. ◮ Conditional distribution of potential outcomes: for d = 0 , 1 Y d i | X i = x ∼ N ( f ( x , d ) , σ 2 ) . ◮ Gaussian process prior: f ∼ GP ( µ , C ) , E [ f ( x , d )] = µ ( x , d ) Cov( f ( x 1 , d 1 ) , f ( x 2 , d 2 )) = C (( x 1 , d 1 ) , ( x 2 , d 2 )) ◮ Conditional average treatment effect (CATE): β = 1 i | X i , θ ] = 1 E [ Y 1 i − Y 0 n ∑ n ∑ f ( X i , 1 ) − f ( X i , 0 ) . i i 11 / 36
Shrinkage Experimental design Notation: ◮ Covariance matrix C , where C i , j = C (( X i , D i ) , ( X j , D j )) ◮ Mean vector µ , components µ i = µ ( X i , D i ) ◮ Covariance of observations with CATE, C i = Cov( Y i , β | X , D ) = 1 n ∑ ( C (( X i , D i ) , ( X j , 1 )) − C (( X i , D i ) , ( X j , 0 ))) . j Practice problem ◮ Derive the posterior expectation � β of β . ◮ Derive the risk of any deterministic treatment assignment vector d , assuming 1. The estimator � β is used. 2. The loss function ( � β − β ) 2 is considered. 12 / 36
Shrinkage Experimental design Solution ◮ The posterior expectation � β of β equals β = µ β + C ′ · ( C + σ 2 I ) − 1 · ( Y − µ ) . � ◮ The corresponding risk equals R B ( d , � β | X ) = Var( β | X , Y ) = Var( β | X ) − Var ( E [ β | X , Y ] | X ) = Var( β | X ) − C ′ · ( C + σ 2 I ) − 1 · C . 13 / 36
Shrinkage Experimental design Discrete optimization ◮ The optimal design solves C ′ · ( C + σ 2 I ) − 1 · C . max d ◮ Possible optimization algorithms: 1. Search over random d 2. greedy algorithm 3. simulated annealing 14 / 36
Shrinkage Experimental design Variation of the problem Practice problem ◮ Suppose that the researcher insists on estimating β using a simple comparison of means, β = 1 D i Y i − 1 � n 1 ∑ n 0 ∑ ( 1 − D i ) Y i . i i ◮ Derive again the risk of any deterministic treatment assignment vector d , assuming 1. The estimator � β is used. 2. The loss function ( � β − β ) 2 is considered. 15 / 36
Shrinkage Experimental design Solution ◮ Notation: i = µ ( X i , d ) and C d 1 , d 2 ◮ Let µ d = C (( X i , d 1 ) , ( X j , d 2 )) . i , j ◮ Collect these terms in the vectors µ d and matrices C d 1 , d 2 , and let � µ = ( µ 1 , µ 2 ) , � � C 00 C 01 � C = . C 10 C 11 ◮ Weights w = ( w 0 , w 1 ) , w 1 i = d i n 1 − 1 n , i = − 1 − d i w 0 n 0 + 1 n . ◮ Risk: Sum of variance and squared bias, � � � � 2 + w ′ · � 1 + 1 β | X ) = σ 2 · w ′ · � R B ( d , � + µ C · w . n 1 n 0 16 / 36
Shrinkage Experimental design Special case linear separable model ◮ Suppose f ( x , d ) = x ′ · γ + d · β , γ ∼ N ( 0 , Σ) , and we estimate β using comparison of means. 1 − X 0 ) ′ · γ , prior expected squared bias ◮ Bias of � β equals ( X 1 − X 1 − X 0 ) ′ · Σ · ( X 0 ) . ( X ◮ Mean squared error � � 1 + 1 1 − X 1 − X MSE ( d 1 ,..., d n ) = σ 2 · 0 ) ′ · Σ · ( X 0 ) . +( X n 1 n 0 ◮ ⇒ Risk is minimized by 1. choosing treatment and control arms of equal size, 1 − X 0 ) . 2. and optimizing balance as measured by the difference in covariate means ( X 17 / 36
Shrinkage Envelope theorem Review for application 2: The envelope theorem ◮ Policy parameter t ◮ Vector of individual choices x ◮ Choice set X ◮ Individual utility υ ( x , t ) ◮ Realized choices x ( t ) ∈ argmax υ ( x , t ) . x ∈ X ◮ Realized utility V ( t ) = max x ∈ X υ ( x , t ) = υ ( x ( t ) , t ) 18 / 36
Shrinkage Envelope theorem ◮ Let x ∗ = x ( t ∗ ) for some fixed t ∗ ◮ Define V ( t ) = V ( t ) − υ ( x ∗ , t ) ˜ (1) = υ ( x ( t ) , t ) − υ ( x ( t ∗ ) , t ) x ∈ X υ ( x , t ) − υ ( x ∗ , t ) . = max (2) ◮ Definition of ˜ V immediately implies: ◮ ˜ V ( t ) ≥ 0 for all t and ˜ V ( t ∗ ) = 0. ◮ Thus: t ∗ is a global minimizer of ˜ V . ◮ If ˜ V is differentiable at t ∗ : ˜ V ′ ( t ∗ ) = 0 ◮ Thus V ′ ( t ∗ ) = ∂ ∂ t υ ( x ∗ , t ) | t = t ∗ , ◮ Behavioral responses don’t matter for effect of policy change on individual utility! 19 / 36
Shrinkage Optimal insurance Application 2 “Optimal insurance and taxation using machine learning” Economic setting ◮ Population of insured individuals i . ◮ Y i : health care expenditures of individual i . ◮ T i : share of health care expenditures covered by the insurance 1 − T i : coinsurance rate; Y i · ( 1 − T i ) : out-of-pocket expenditures ◮ Behavioral response to share covered: structural function Y i = g ( T i , ε i ) . ◮ Per capita expenditures under policy t : average structural function m ( t ) = E [ g ( t , ε i )] . 20 / 36
Recommend
More recommend