econ 2148 fall 2019 shrinkage in the normal means model
play

Econ 2148, fall 2019 Shrinkage in the Normal means model Maximilian - PowerPoint PPT Presentation

Shrinkage Econ 2148, fall 2019 Shrinkage in the Normal means model Maximilian Kasy Department of Economics, Harvard University 1 / 47 Shrinkage Agenda Setup: the Normal means model X N ( , I k ) and the canonical estimation problem


  1. Shrinkage Econ 2148, fall 2019 Shrinkage in the Normal means model Maximilian Kasy Department of Economics, Harvard University 1 / 47

  2. Shrinkage Agenda ◮ Setup: the Normal means model X ∼ N ( θ , I k ) and the canonical estimation problem with loss � � θ − θ � 2 . ◮ The James-Stein (JS) shrinkage estimator. ◮ Three ways to arrive at the JS estimator (almost): 1. Reverse regression of θ i on X i . 2. Empirical Bayes: random effects model for θ i . 3. Shrinkage factor minimizing Stein’s Unbiased Risk Estimate. ◮ Proof that JS uniformly dominates X as estimator of θ . ◮ The Normal means model as asymptotic approximation. 2 / 47

  3. Shrinkage Takeaways for this part of class ◮ Shrinkage estimators trade off variance and bias. ◮ In multi-dimensional problems, we can estimate the optimal degree of shrinkage. ◮ Three intuitions that lead to the JS-estimator: 1. Predict θ i given X i ⇒ reverse regression. 2. Estimate distribution of the θ i ⇒ empirical Bayes. 3. Find shrinkage factor that minimizes estimated risk. ◮ Some calculus allows us to derive the risk of JS-shrinkage ⇒ better than MLE, no matter what the true θ is. ◮ The Normal means model is more general than it seems: large sample approximation to any parametric estimation problem. 3 / 47

  4. Shrinkage The Normal means model The Normal means model Setup ◮ θ ∈ R k ◮ ε ∼ N ( 0 , I k ) ◮ X = θ + ε ∼ N ( θ , I k ) ◮ Estimator: � θ = � θ ( X ) ◮ Loss: squared error θ , θ ) = ∑ L ( � ( � θ i − θ i ) 2 i ◮ Risk: mean squared error � � � θ i − θ i ) 2 � = ∑ R ( � L ( � ( � θ , θ ) = E θ θ , θ ) . E θ i 4 / 47

  5. Shrinkage The Normal means model Two estimators ◮ Canonical estimator: maximum likelihood, ML = X � θ ◮ Risk function � � ML , θ ) = ∑ R ( � ε 2 θ = k . E θ i i ◮ James-Stein shrinkage estimator � � JS = 1 − ( k − 2 ) / k � θ · X . X 2 ◮ Celebrated result: uniform risk dominance; for all θ JS , θ ) < R ( � ML , θ ) = k . R ( � θ θ 5 / 47

  6. Shrinkage Regression perspective First motivation of JS: Regression perspective ◮ We will discuss three ways to motivate the JS-estimator (up to degrees of freedom correction). ◮ Consider estimators of the form � θ i = c · X i or � θ i = a + b · X i . ◮ How to choose c or ( a , b ) ? ◮ Two particular possibilities: 1. Maximum likelihood: c = 1 � � 1 − ( k − 2 ) / k 2. James-Stein: c = X 2 6 / 47

  7. Shrinkage Regression perspective Practice problem (Infeasible estimator) ◮ Suppose you knew X 1 ,..., X k as well as θ 1 ,..., θ k , ◮ but are constrained to use an estimator of the form � θ i = c · X i . 1. Find the value of c that minimizes loss. 2. For estimators of the form � θ i = a + b · X i , find the values of a and b that minimize loss. 7 / 47

  8. Shrinkage Regression perspective Solution ◮ First problem: c ∗ = argmin ∑ ( c · X i − θ i ) 2 c i ◮ Least squares problem! ◮ First order condition: ( c ∗ · X i − θ i ) · X i . 0 = ∑ i ◮ Solution c ∗ = ∑ X i θ i . ∑ i X 2 i 8 / 47

  9. Shrinkage Regression perspective Solution continued ◮ Second problem: ( a ∗ , b ∗ ) = argmin ∑ ( a + b · X i − θ i ) 2 a , b i ◮ Least squares problem again! ◮ First order conditions: ( a ∗ + b ∗ · X i − θ i ) 0 = ∑ i 0 = ∑ ( a ∗ + b ∗ · X i − θ i ) · X i . i ◮ Solution b ∗ = ∑ ( X i − X ) · ( θ i − θ ) = s X θ a ∗ + b ∗ · X = θ , s 2 ∑ i ( X i − X ) 2 X 9 / 47

  10. Shrinkage Regression perspective Regression and reverse regression ◮ Recall X i = θ i + ε i , E [ ε i | θ i ] = 0, Var( ε i ) = 1. ◮ Regression of X on θ : Slope s X θ = 1 + s εθ ≈ 1 . s 2 s 2 θ θ ◮ For optimal shrinkage, we want to predict θ given X , not the other way around! ◮ Reverse regression of θ on X : Slope s 2 s 2 θ + s εθ s X θ θ = ≈ θ + 1 . s 2 s 2 s 2 θ + 2 s εθ + s 2 ε X ◮ Interpretation: “signal to (signal plus noise) ratio” < 1. 10 / 47

  11. Shrinkage Regression perspective Illustration 148 S. M. STIGLER an of of Florida be used to improve estimate the price French wine, when it is assumed that they are unre- lated? The best heuristic explanation that has been offered is a Bayesian argument: If the are a priori 0i mean of is independent N(0, r2), then the posterior Oi same form as OJ, and hence of the can be viewed O'J (9i,X,) I@ * -' as an empirical Bayes estimator (Efron and Morris, Another 1973; Lehmann, 1983, page 299). explanation 8~~~~~~(i X,)- that has been offered is that can be viewed as a 0JS relative of a "pre-test" estimator; if one performs a preliminary test of the null hypothesis that 0 = 0, and 0 = 0 or 0i = Xi depending one then uses on the outcome of the test, the resulting estimator is a o~~~ x weighted average of 0 and 00 of which is a smoothed 0Js version (Lehmann, 1983, pages 295-296). But neither of these explanations is fully satisfactory (although both help render the result more plausible); the first because it a priori where requires special assumptions Stein did not, the second because it corresponds to 11 / 47 the result only in the loosest qualitative way. The bivariate vs. i =1, * ,k. FIG. 1. Hypothetical plot of Xi, for Oi difficulty of understanding the Stein paradox is com- pounded by the fact that its proof usually depends on the of means (, )to lie near the 45? expect point explicit computation of the risk function or the theory line. of complete sufficient statistics, by a process that Now our goal is to estimate all of the all Oi's given convinces us of its truth without really illuminating of the Xi's, with no assumptions about a possible the reasons that it works. (The best presentation I dfistributional structure for the Oi's-they are simply know is that in Lehmann (1983, pages 300-302) of a to be viewed as unknown constants. to Nonetheless, proof due to Efron and Morris (1973); Berger (1980, 3ee why we should expect that the ordinary estimator page 165, example 54) outlines a short but unintuitive O' can be it to think about what improved upon, helps proof; the one shorter proof I have encountered in a we would do if this were not the case. If the Oi's, and textbook is vitiated by a major noncorrectable error.) hence the had a known pairs (Xi, Oi ), joint distribution, The purpose of this paper is to show how a different a natural (and in some even method settings optimal) perspective, one developed by Francis Galton over a setig hreth -fdono ---0aditibt -n (X) = E of proceeding would be to calculate 6 X) (O I century ago (Stigler, 1986, chapter 8), can render the With no di~stiuinlasmtosaothe', and use this, the theoretical function of 0 regression result transparent, as' well as lead to a simple, full on X, to generate estimates of the Oi's by evaluating it proof. This perspective is perhaps closer to that of the for each Xi. We may think of this as an unattainable period before 1950 than to subsequent approaches, but ideal, unattainable because we do not know the con- it has points in common with more recent works, ditional distribution of 0 X. Indeed, we will not given particularly those of Efron and Morris (1973), Rubin assume that our uncertainty about the unknown con- (1980), Dempster (1980) and Robbins (1983). stants can be described a probability distribution Oi by at all; our view is not that of either the Bayesian or 2. STEIN ESTIMATION AS A REGRESSION empirical Bayesian approach. We do know the condi- PROBLEM tional distribution of X given N(O, 1), and 0, namely The estimation problem involves pairs of values 0) = 0. Indeed we can calculate E (X I this, the theo- (Xi, Os), i = 1, , k, * * where one element of each pair retical line of X on 0, to the regression corresponds 0 = X in Figure (Xi) is known and one (Oi) is unknown. Since the Oi's line and it is this line which 1, gives are unknown, the pairs cannot in fact be plotted, but = Xi. Thus the ordinary the ordinary estimators 09? it will help our understanding of the problem and estimator may be viewed as being based on the suggest a means of approaching it if we imagine what line, on E l ) rather than "4wrong" regression (XI such a plot would look like. Figure 1 is hypothetical, as Francis Galton knew in the E(O I X). Since, already but some aspects of it accurately reflect the situation. the of X on 0 and of 0 on X can be 1880's, regressions X is Since N(O, 1), we can think of the X's as being markedly different, this suggests that the ordinary generated by adding N(O, 1) "errors" to the given O's. estimator can be improved upon and even suggests Thus the horizontal deviations of the points from the to approxi- how this might be done-by attempting 450 line 0 N(O, 1), = X are independent and in that mate "E(O I "-or whatever that mean in a X) might respect they should cluster around the line as indi- cated. Also, E(X) = W and Var(X) = 1/k, so we should

Recommend


More recommend