Estimating risk Estimating risk Maximilian Kasy Department of Economics, Harvard University May 4, 2018 1 / 17
Estimating risk Introduction Introduction ◮ Some of the topics about which I learned from Gary: ◮ The normal means model. ◮ Finite sample risk and point estimation. ◮ Shrinkage and tuning. ◮ Random coefficients and empirical Bayes. ◮ This talk: ◮ Brief review of these topics. ◮ Building on that, some new results from my own work. 2 / 17
Estimating risk Introduction The normal means model ◮ θ , X ∈ R k ◮ X ∼ N ( θ , Σ) ◮ Estimator � θ ( X ) of θ (“almost differentiable”) ◮ Mean squared error: � θ − θ � 2 � MSE ( � � � θ , θ ) = 1 k E θ � θ j − θ j ) 2 � ( � = 1 k ∑ . E θ j ◮ Would like to estimate MSE ( � θ , θ ) , to 1. choose tuning parameters to minimize estimated MSE, 2. choose between estimators to minimize estimated MSE, 3. as a theoretical tool for proving dominance results. ◮ Key ingredient for machine learning! 3 / 17
Estimating risk Introduction Roadmap ◮ Review: ◮ Covariance penalties, ◮ Stein’s Unbiased Risk Estimate (SURE), ◮ Cross-Validation (CV). ◮ Panel version of (normal) means model: ◮ X ∈ R k as sample mean of n i.i.d. draws Y i . ◮ ⇒ n -fold Cross-Validation. ◮ Two results that are new (I think): ◮ Large n ⇒ CV approximates SURE. ◮ Large k ⇒ CV and SURE converge to MSE, yield oracle optimal tuning (“uniform loss consistency”). 4 / 17
Estimating risk Introduction References ◮ Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics , 9(6):1135–1151 ◮ Efron, B. (2004). The estimation of prediction error: covariance penalties and cross-validation. Journal of the American Statistical Association , 99(467):619–632 ◮ Abadie, A. and Kasy, M. (2018). Choosing among regularized estimators in empirical economics. Working Paper. ◮ Fessler, P . and Kasy, M. (2018). How to use economic theory to improve estimators: Shrinking toward theoretical restrictions. Working Paper ◮ Kasy, M. and Mackey, L. (2018). Approximate cross-validation. Work in progress 5 / 17
Estimating risk SURE and CV Covariance penalty ◮ Efron (2004): Adding and subtracting θ j gives θ j − X j ) 2 = ( � θ j − θ j ) 2 + 2 · ( � ( � θ j − θ j )( θ j − X j )+( θ j − X j ) 2 . ◮ Thus MSE ( � θ , θ ) = 1 k ∑ j MSE j , where � θ j − θ j ) 2 � ( � MSE j = E θ � ( X j − θ j ) 2 � = E θ [( � θ j − X j ) 2 ]+ 2 E θ [( � θ j − θ j ) · ( X j − θ j )] − E θ = E θ [( � θ j − X j ) 2 ]+ 2Cov θ ( � θ j , X j ) − Var θ ( X j ) . ◮ First term: In-sample prediction error (observed). ◮ Second term: Covariance penalty (depends on unobserved θ ). ◮ Third term: Irreducible prediction error, doesn’t depend on � θ . 6 / 17
Estimating risk SURE and CV Stein’s Unbiased Risk Estimate ◮ Stein (1981): For normal pdf with variance σ 2 , ϕ ′ σ ( x − θ ) = − x − θ σ · ϕ σ ( x − θ ) . ◮ Suppose for a moment that Σ = σ 2 I . ◮ Then, by partial integration, � Cov θ ( � E θ [ � θ j , X j ) = θ j | X j = x j ]( x j − θ j ) ϕ σ ( x j − θ j ) dx j � − E θ [ � θ j | X j = x j ] ϕ ′ = σ · σ ( x j − θ j ) dx j � ∂ x j E θ [ � = σ · θ j | X j = x j ] ϕ σ ( x j − θ j ) dx j = σ · E θ [ ∂ X j � θ j ] . 7 / 17
Estimating risk SURE and CV ◮ Thus � θ j − σ 2 � θ j − X j ) 2 + 2 σ 2 · ∂ X j � ( � MSE = 1 k ∑ MSE j = 1 k ∑ . E θ j j ◮ For non-diagonal Σ , by change of coordinates we get more generally � � � � θ − X � 2 + 2trace θ ′ · Σ � � � MSE = 1 − trace (Σ) . k E θ ◮ All terms on the right hand side are observed! Sample version: � � � � θ − X � 2 + 2trace θ ′ · Σ � � � SURE = 1 − trace (Σ) . k ◮ Key assumptions that we used: ◮ X is normally distributed. ◮ Σ is known. ◮ � θ is almost differentiable. 8 / 17
Estimating risk SURE and CV Panel setting and cross-validation ◮ Assume panel structure: X is a sample average, i = 1 ,..., n and j = 1 ,..., k , Y i ∼ i . i . d . ( θ , n · Σ) . X = 1 n ∑ Y i , i ◮ Leave-one-out mean and estimator: θ − i = � � 1 n − 1 ∑ X − i = Y i ′ , θ ( X − i ) . i ′ � = i ◮ n -fold cross-validation: CV i = � Y i − � CV = 1 n ∑ θ − i � 2 . CV i , i 9 / 17
Estimating risk Large n Large n : SURE ≈ CV Proposition Suppose � θ ( · ) is continuously differentiable in a neighborhood of θ , i − θ ) / √ and suppose X n = 1 n ∑ i Y n i with ( Y n n i.i.d. with expectation 0 and variance Σ . Let � i − X n ) ′ . Then Σ = 1 n 2 ∑ i ( Y n i − X n )( Y n � Σ n � CV n = � X n − � θ n � 2 + 2trace θ ′ · � � +( n − 1 ) trace ( � Σ n )+ o p ( 1 ) as n → ∞ . ◮ New result, I believe. ◮ “For large n , CV is the same as SURE, plus the irreducible forecasting error” n · trace (Σ) = E θ [ � Y i − θ � 2 ] . ◮ Does not require normality, known Σ ! 10 / 17
Estimating risk Large n Sketch of proof ◮ Let s = √ n − 1, omit superscript n , U i = 1 s ( Y i − X ) U i ∼ ( 0 , Σ) , X − i = X − 1 Y i = X + sU i s U i θ ( X − i ) = � � s � θ ′ ( X ) · U i +∆ i θ ( X ) − 1 ∆ i = o ( 1 s U i ) � U i U ′ Σ = 1 n ∑ i . i ◮ Then θ − i � 2 = � X + sU i − ( � CV i = � Y i − � s � θ ′ ( X ) · U i +∆ i ) � 2 θ − 1 � � θ � 2 + 2 = � X − � U i , � θ ′ ( X ) · U i + s 2 � U i � 2 � � � 1 � θ , ( s + 1 X − � s � s 2 � � θ ′ ( X ) · U i � 2 + 2 � ∆ i , Y i − � θ ′ ) U i θ − i � + 2 + . � � θ � 2 + 2trace θ ′ · � CV i = � X − � � +( n − 1 ) trace ( � CV = 1 n ∑ Σ Σ) i + 0 + o p ( 1 n ) . 11 / 17
Estimating risk Large k Large k : SURE , CV ≈ MSE ◮ Abadie and Kasy (2018): Random effects (empirical Bayes) perspective: ( X j , θ j ) ∼ i . i . d . π , E π [ X j | θ j ] = θ j . ◮ Unbiasedness of SURE, CV: E θ [ CV ] = E θ [ CV i ] = MSE n − 1 . E θ [ SURE ] = MSE , ◮ Law of large numbers: For fixed π , n , plim k → ∞ CV − MSE n − 1 = 0 . plim k → ∞ SURE − MSE = 0 ◮ Questions: ◮ Does this hold uniformly over π ? ◮ If so, does this yield oracle-optimal tuning parameters? 12 / 17
Estimating risk Large k Componentwise estimators ◮ Answer requires more structure on estimators. Assume � θ j = m ( X j , λ ) . Examples: ◮ Ridge: m R ( x , λ ) = 1 1 + λ x . ◮ Lasso: m L ( x , λ ) = 1 ( x < − λ )( x + λ )+ 1 ( x > λ )( x − λ ) . ◮ Denote k SE ( λ ) = 1 ( m ( X j , λ ) − θ j ) 2 , ∑ (squared error loss) k j = 1 MSE ( λ ) = E θ [ SE ( λ )] , (compound risk) MSE ( λ ) = E π [ MSE ( λ )] = E π [ SE ( λ )] , (empirical Bayes risk) ◮ and � MSE ( λ ) an estimator of MSE , e.g. SURE or CV. 13 / 17
Estimating risk Large k Theorem (Uniform loss consistency) Assume that, as k → ∞ , � � � � � � � SE ( λ ) − MSE ( λ ) � > ε → 0 , ∀ ε > 0 , sup P π sup π ∈ Q λ ∈ [ 0 , ∞ ] � � � � � � � > ε MSE ( λ ) − MSE ( λ ) − v π → 0 , ∀ ε > 0 . sup P π sup π ∈ Q λ ∈ [ 0 , ∞ ] Then �� � � � � � SE ( � � � λ ) − λ ∈ [ 0 , ∞ ] SE ( λ ) � > ε → 0 , ∀ ε > 0 , sup P π inf π ∈ Q where � λ ∈ argmin λ ∈ [ 0 , ∞ ] � MSE ( λ ) . 14 / 17
Estimating risk Large k Theorem (Uniform convergence) Suppose that sup π ∈ Q E π [ X 4 ] < ∞ . Under some conditions on m (satisfied for Ridge and Lasso), the assumptions of the previous theorem are satisfied. Remarks: ◮ Extension of Glivenko-Cantelli theorem. ◮ Need conditions on m to get uniformity over λ . ◮ Only need (and get) uniform convergence of � MSE − MSE − v π to 0 for some constant v π . ◮ For CV , get uniform loss consistency to the estimator using λ optimal for SE n − 1 (thus shrinking a bit too much for small n ). n ≈ sample size / # of parameters 15 / 17
Estimating risk Large k Outlook and work in progress 1. Approximate CV using first-order approx to leave-1-out estimator, in penalized M-estimator settings: � � − 1 β − i ( λ ) − � � m bb ( X j , � β ( λ ))+ π bb ( � · m b ( X i , � ∑ β ( λ ) ≈ β ( λ ) , λ ) β ( λ )) . j ◮ Fast alternative to CV for tuning of neural nets, etc. ◮ Additional acceleration by only calculating this for subset of i , j . 2. Risk reductions for shrinkage toward inequality restrictions. ◮ Relevant for many restrictions implied by economic theory. ◮ Proving uniform dominance using SURE, extending James-Stein. ◮ Open question: Smooth choice of “degrees of freedom” that is not too conservative. 16 / 17
Estimating risk Large k Thank you! 17 / 17
Recommend
More recommend