Approximate Cross-Validation and Dynamic Experiments for Policy Choice Approximate Cross-Validation and Dynamic Experiments for Policy Choice Maximilian Kasy Department of Economics, Harvard University April 23, 2018 1 / 23
Approximate Cross-Validation and Dynamic Experiments for Policy Choice Introduction Introduction ◮ Two separate, early stage projects: 1. Approximate cross-validation ◮ First order approximation to leave-one-out estimator. ◮ Relationship to Stein’s unbiased risk estimator. ◮ Accelerated tuning. ◮ Joint with Lester Mackey, MSR. 2. Dynamic experiments for policy choice ◮ Experimental design problem for choosing discrete treatment. ◮ Goal: maximize average outcome. ◮ Multiple waves. ◮ Joint with Anja Sautman, J-PAL. ◮ Feedback appreciated! 2 / 23
Approximate Cross-Validation and Dynamic Experiments for Policy Choice Approximate cross-validation Project 1: Approximate cross-validation ◮ Different ways of estimating risk (mean squared error): ◮ Covariance penalties, ◮ Stein’s Unbiased Risk Estimate (SURE), ◮ Cross-validation (CV). ◮ Result 1: ◮ Consider repeated draws of some vector. ◮ Then CV for estimating mean is approximately equal to SURE. ◮ Without normality, unknown variance! ◮ Result 2: ◮ Consider penalized M-estimation problem. ◮ Then CV for prediction loss is approximately equal to in-sample risk plus penalty, ◮ with a simple penalty based on gradient, Hessian. ◮ ⇒ algorithm for accelerated tuning! 3 / 23
Approximate Cross-Validation and Dynamic Experiments for Policy Choice Approximate cross-validation The normal means model ◮ θ , X ∈ R k ◮ X ∼ N ( θ , Σ) ◮ Estimator � θ ( X ) of θ (“almost differentiable”) ◮ Mean squared error: � θ − θ � 2 � MSE ( � � � θ , θ ) = 1 k E θ � θ j − θ j ) 2 � ( � = 1 k ∑ . E θ j ◮ Would like to estimate MSE ( � θ , θ ) . ◮ Choose tuning parameters to minimize estimated MSE. ◮ Choose between estimators to minimize estimated MSE. ◮ Theoretical tool for proving dominance results. 4 / 23
Approximate Cross-Validation and Dynamic Experiments for Policy Choice Approximate cross-validation Covariance penalty ◮ Efron (2004): Adding and subtracting θ j gives θ j − X j ) 2 = ( � θ j − θ j ) 2 + 2 · ( � ( � θ j − θ j )( θ j − X j )+( θ j − X j ) 2 . ◮ Thus MSE ( � θ , θ ) = 1 k ∑ j MSE j , where � θ j − θ j ) 2 � ( � MSE j = E θ � ( X j − θ j ) 2 � = E θ [( � θ j − X j ) 2 ]+ 2 E θ [( � θ j − θ j ) · ( X j − θ j )] − E θ = E θ [( � θ j − X j ) 2 ]+ 2Cov θ ( � θ j , X j ) − Var θ ( X j ) . ◮ First term: In-sample prediction error (observed). ◮ Second term: Covariance penalty (depends on unobserved θ ). ◮ Third term: Doesn’t depend on � θ . 5 / 23
Approximate Cross-Validation and Dynamic Experiments for Policy Choice Approximate cross-validation Stein’s Unbiased Risk Estimate ◮ Using partial integration and fact that ϕ ′ ( x ) = − x · ϕ ( x ) , can show � � � � θ − X � 2 + 2trace θ ′ · Σ � � � MSE = 1 − trace (Σ) . k E θ ◮ All terms on the right hand side are observed! Sample version: � � � � θ − X � 2 + 2trace θ ′ · Σ � � � SURE = 1 − trace (Σ) . k ◮ Key assumptions that we used: ◮ X is normally distributed. ◮ Σ is known. ◮ � θ is almost differentiable. 6 / 23
Approximate Cross-Validation and Dynamic Experiments for Policy Choice Approximate cross-validation Cross-validation ◮ Assume panel structure: X is a sample average, i = 1 ,..., n and j = 1 ,..., k , Y i ∼ i . i . d . ( θ , n · Σ) . X = 1 n ∑ Y i , i ◮ Leave-one-out mean and estimator: θ − i = � � 1 n − 1 ∑ X − i = Y i ′ , θ ( X − i ) . i ′ � = i ◮ n -fold cross-validation: CV i = � Y i − � CV = 1 n ∑ θ − i � 2 . CV i , i 7 / 23
Approximate Cross-Validation and Dynamic Experiments for Policy Choice Approximate cross-validation Large n : SURE ≈ CV Proposition Suppose � θ ( · ) is continuously differentiable in a neighborhood of θ , i − θ ) / √ and suppose X n = 1 n ∑ i Y n i with ( Y n n i.i.d. with expectation 0 and variance Σ . Let � Σ = 1 i − X n ) ′ . Then n 2 ∑ i ( Y n i − X n )( Y n � Σ n � CV n = � X n − � θ n � 2 + 2trace θ ′ · � +( n − 1 ) trace ( � � Σ n )+ o p ( 1 ) as n → ∞ . ◮ New result, I believe. ◮ “For large n , CV is the same as SURE, plus the irreducible forecasting error” n · trace (Σ) = E θ [ � Y i − θ � 2 ] . ◮ Does not require ◮ normality, ◮ known Σ ! 8 / 23
Approximate Cross-Validation and Dynamic Experiments for Policy Choice Approximate cross-validation Sketch of proof ◮ Let s = √ n − 1, omit superscript n , U i = 1 s ( Y i − X ) U i ∼ ( 0 , Σ) , X − i = X − 1 Y i = X + sU i s U i θ ( X − i ) = � � s � θ ′ ( X ) · U i +∆ i θ ( X ) − 1 ∆ i = o ( 1 s U i ) � U i U ′ Σ = 1 n ∑ i . i ◮ Then θ − i � 2 = � X + sU i − ( � CV i = � Y i − � s � θ ′ ( X ) · U i +∆ i ) � 2 θ − 1 � � θ � 2 + 2 = � X − � U i , � θ ′ ( X ) · U i + s 2 � U i � 2 � � � 1 � θ , ( s + 1 X − � s � s 2 � � θ ′ ( X ) · U i � 2 + 2 � ∆ i , Y i − � θ ′ ) U i θ − i � + 2 + . � � θ � 2 + 2trace θ ′ · � CV i = � X − � � +( n − 1 ) trace ( � CV = 1 n ∑ Σ Σ) i + 0 + o p ( 1 n ) . 9 / 23
Approximate Cross-Validation and Dynamic Experiments for Policy Choice Approximate cross-validation More general setting: Penalized M-estimation ◮ Suppose β = argmin b E [ m ( X , β )] . ◮ Estimate β using penalized M-estimation, � ∑ β ( λ ) = argmin m ( X i , b )+ π ( b , λ ) . b i ◮ Would like to choose λ to minimize the out-of-sample prediction error R ( λ ) = E [ m ( X , � β ( λ ))] . ◮ Leave-one-out estimator, n-fold cross-validation � ∑ β − i ( λ ) = argmin m ( X j , b )+ π ( b , λ ) . b j � = i m ( X i , � CV ( λ ) = 1 n ∑ β − i ( λ )) . i 10 / 23
Approximate Cross-Validation and Dynamic Experiments for Policy Choice Approximate cross-validation ◮ Computationally costly to re-estimate β for every choice of i and λ ! ◮ Notation for Hessian, gradients: � � m bb ( X j , � β ( λ ))+ π bb ( � ∑ H = β ( λ ) , λ ) j g i = m b ( X i , � β ( λ )) . ◮ First-order approximation to leave-one-out estimator (assuming 2nd derivatives): β ( λ ) ≈ H − 1 · g i . β − i ( λ ) − � � ◮ In-sample prediction error: m ( X i , � ¯ R ( λ ) = 1 n ∑ β ( λ )) . i 11 / 23
Approximate Cross-Validation and Dynamic Experiments for Policy Choice Approximate cross-validation ◮ Another first-order approximation: �� � β − i ( λ ) − � CV ( λ ) ≈ ¯ R ( λ )+ 1 n ∑ g i · β ( λ ) . i ◮ Combining the two approximations: R ( λ )+ 1 i · H − 1 · g i . CV ( λ ) ≈ ¯ n ∑ g t i ◮ ¯ R , g i and H are automatically available if Newton-Raphson was used for finding � β ( λ ) ! ◮ If not, could approximate then without bias using random subsample. 12 / 23
Approximate Cross-Validation and Dynamic Experiments for Policy Choice Approximate cross-validation Open questions ◮ Implementation! ◮ Regularity conditions for validity of approximations? ◮ Gains of speed in tuning, e.g., neural nets? ◮ Gains of efficiency relative to wasteful sample-partition methods? 13 / 23
Approximate Cross-Validation and Dynamic Experiments for Policy Choice Dynamic experiments for policy choice Project 2: Dynamic experiments for policy choice ◮ Setup: ◮ Optimal treatment assignment (multiple treatments) ◮ in multi-wave experiments. ◮ Goal: After experiment, choose a policy ◮ to maximize welfare (average outcome net of costs). ◮ Dynamic stochastic optimization problem, ◮ used normatively (for experimenter) rather than descriptively (as in structural models). ◮ Solution via exact backward induction. ◮ Outline: 1. Setup: ¯ d treatments, binary outcomes, T waves 2. Objective function: social welfare, max over treatment 3. Independent Beta priors for mean potential outcomes 4. Value functions, backward induction 14 / 23
Approximate Cross-Validation and Dynamic Experiments for Policy Choice Dynamic experiments for policy choice Setup ◮ Waves t = 1 ,..., T , sample sizes N t . ◮ Treatment D ∈ { 1 ,..., ¯ d } , outcomes Y ∈ { 0 , 1 } , potential outcomes Y d , ¯ d ∑ 1 ( D it = d ) Y d Y it = it . d = 1 it ,..., Y ¯ ◮ ( Y 0 d it ) are i.i.d. across both i and t . ◮ Denote θ d = E [ Y d t ] t = ∑ n d 1 ( D it = d ) i t = ∑ s d 1 ( D it = d , Y it = Y d it = 1 ) . i 15 / 23
Approximate Cross-Validation and Dynamic Experiments for Policy Choice Dynamic experiments for policy choice Treatment assignment, outcomes, state space t ,..., n ¯ ◮ Treatment assignment in wave t : n t = ( n 1 d t ) . t ,..., s ¯ ◮ Outcomes of wave t : s t = ( s 1 d t ) . ◮ Cumulative versions: M t = ∑ t ′ ≤ t N t ′ , t ,..., m ¯ t ) = ∑ m t = ( m 1 d n t t ′ ≤ t t ,..., s ¯ t ) = ∑ r t = ( s 1 d s t . t ′ ≤ t ◮ Relevant information for the experimenter in period t + 1 is summarized by m t and r t . 16 / 23
Recommend
More recommend