Shrinkage Econ 2148, fall 2017 Shrinkage in the Normal means model Maximilian Kasy Department of Economics, Harvard University 1 / 47
Shrinkage Agenda ◮ Setup: the Normal means model X ∼ N ( θ , I k ) and the canonical estimation problem with loss � � θ − θ � 2 . ◮ The James-Stein (JS) shrinkage estimator. ◮ Three ways to arrive at the JS estimator (almost): 1. Reverse regression of θ i on X i . 2. Empirical Bayes: random effects model for θ i . 3. Shrinkage factor minimizing Stein’s Unbiased Risk Estimate. ◮ Proof that JS uniformly dominates X as estimator of θ . ◮ The Normal means model as asymptotic approximation. 2 / 47
Shrinkage Takeaways for this part of class ◮ Shrinkage estimators trade off variance and bias. ◮ In multi-dimensional problems, we can estimate the optimal degree of shrinkage. ◮ Three intuitions that lead to the JS-estimator: 1. Predict θ i given X i ⇒ reverse regression. 2. Estimate distribution of the θ i ⇒ empirical Bayes. 3. Find shrinkage factor that minimizes estimated risk. ◮ Some calculus allows us to derive the risk of JS-shrinkage ⇒ better than MLE, no matter what the true θ is. ◮ The Normal means model is more general than it seems: large sample approximation to any parametric estimation problem. 3 / 47
Shrinkage The Normal means model The Normal means model Setup ◮ θ ∈ R k ◮ ε ∼ N ( 0 , I k ) ◮ X = θ + ε ∼ N ( θ , I k ) ◮ Estimator: � θ = � θ ( X ) ◮ Loss: squared error θ , θ ) = ∑ L ( � ( � θ i − θ i ) 2 i ◮ Risk: mean squared error � � � θ i − θ i ) 2 � = ∑ R ( � L ( � ( � θ , θ ) = E θ θ , θ ) . E θ i 4 / 47
Shrinkage The Normal means model Two estimators ◮ Canonical estimator: maximum likelihood, ML = X � θ ◮ Risk function � � ML , θ ) = ∑ R ( � ε 2 θ = k . E θ i i ◮ James-Stein shrinkage estimator � � JS = 1 − ( k − 2 ) / k � θ · X . X 2 ◮ Celebrated result: uniform risk dominance; for all θ JS , θ ) < R ( � ML , θ ) = k . R ( � θ θ 5 / 47
Shrinkage Regression perspective First motivation of JS: Regression perspective ◮ We will discuss three ways to motivate the JS-estimator (up to degrees of freedom correction). ◮ Consider estimators of the form � θ i = c · X i or � θ i = a + b · X i . ◮ How to choose c or ( a , b ) ? ◮ Two particular possibilities: 1. Maximum likelihood: c = 1 � � 1 − ( k − 2 ) / k 2. James-Stein: c = X 2 6 / 47
Shrinkage Regression perspective Practice problem (Infeasible estimator) ◮ Suppose you knew X 1 ,..., X k as well as θ 1 ,..., θ k , ◮ but are constrained to use an estimator of the form � θ i = c · X i . 1. Find the value of c that minimizes loss. 2. For estimators of the form � θ i = a + b · X i , find the values of a and b that minimize loss. 7 / 47
Shrinkage Regression perspective Solution ◮ First problem: c ∗ = argmin ( c · X i − θ i ) 2 ∑ c i ◮ Least squares problem! ◮ First order condition: ( c ∗ · X i − θ i ) · X i . 0 = ∑ i ◮ Solution c ∗ = ∑ X i θ i . ∑ i X 2 i 8 / 47
Shrinkage Regression perspective Solution continued ◮ Second problem: ( a ∗ , b ∗ ) = argmin a , b ∑ ( a + b · X i − θ i ) 2 i ◮ Least squares problem again! ◮ First order conditions: ( a ∗ + b ∗ · X i − θ i ) 0 = ∑ i ( a ∗ + b ∗ · X i − θ i ) · X i . 0 = ∑ i ◮ Solution b ∗ = ∑ ( X i − X ) · ( θ i − θ ) = s X θ a ∗ + b ∗ · X = θ , s 2 ∑ i ( X i − X ) 2 X 9 / 47
Shrinkage Regression perspective Regression and reverse regression ◮ Recall X i = θ i + ε i , E [ ε i | θ i ] = 0, Var ( ε i ) = 1. ◮ Regression of X on θ : Slope s X θ = 1 + s εθ ≈ 1 . s 2 s 2 θ θ ◮ For optimal shrinkage, we want to predict θ given X , not the other way around! ◮ Reverse regression of θ on X : Slope s 2 s 2 θ + s εθ s X θ θ = ≈ θ + 1 . s 2 s 2 s 2 θ + 2 s εθ + s 2 ε X ◮ Interpretation: “signal to (signal plus noise) ratio” < 1. 10 / 47
Shrinkage Regression perspective Illustration 148 S. M. STIGLER Florida be used to improve an estimate of the price of French wine, when it is assumed that they are unre- heuristic that has been lated? The best explanation offered is a Bayesian argument: If the 0i are a priori independent N(0, r2), then the posterior mean of is Oi of the same form as OJ, and hence can be viewed O'J (9i,X,) I@ * -' as an empirical Bayes estimator (Efron and Morris, 1973; Lehmann, 1983, page 299). Another explanation 8~~~~~~(i X,)- that has been offered is that can be viewed as a 0JS if one a relative of a "pre-test" estimator; performs preliminary test of the null hypothesis that 0 = 0, and then 0 = 0 or 0i = Xi depending on the one uses outcome of the test, the resulting estimator is a o~~~ x weighted average of 0 and of which 00 is a smoothed 0Js version (Lehmann, 1983, pages 295-296). But neither of these explanations is fully satisfactory (although both help render the result more plausible); the first because it requires special a priori assumptions where Stein did not, the second because it corresponds to the result only in the loosest qualitative way. The bivariate vs. i =1, * ,k. FIG. 1. Hypothetical plot of Oi Xi, for difficulty of understanding the Stein paradox is com- 11 / 47 pounded by the fact that its proof usually depends on the of means (, )to lie near the 45? expect point explicit computation of the risk function or the theory line. of complete sufficient statistics, by a process that Now our goal is to estimate all of the all Oi's given convinces us of its truth without really illuminating of the Xi's, with no assumptions about a possible the reasons that it works. (The best presentation I dfistributional structure for the Oi's-they are simply know is that in Lehmann (1983, pages 300-302) of a to be viewed as unknown constants. Nonetheless, to proof due to Efron and Morris (1973); Berger (1980, 3ee we should that the estimator why expect ordinary page 165, example 54) outlines a short but unintuitive O' can be it to think about what improved upon, helps proof; the one shorter proof I have encountered in a we would do if this were case. If not the the Oi's, and textbook is vitiated by a major noncorrectable error.) hence the had a known pairs (Xi, Oi ), joint distribution, The purpose of this paper is to show how a different a natural (and in some settings even optimal) method perspective, one developed by Francis Galton over a setig hreth -fdono ---0aditibt -n of would be to calculate 6 (X) = E proceeding X) (O I century ago (Stigler, 1986, chapter 8), can render the With no di~stiuinlasmtosaothe', and use this, the theoretical function of 0 regression result transparent, as' well as lead to a simple, full on to estimates of the it X, generate Oi's by evaluating proof. This perspective is perhaps closer to that of the for each Xi. We may think of this as an unattainable period before 1950 than to subsequent approaches, but unattainable because we do not know the con- ideal, it has points in common with more recent works, ditional distribution of 0 X. Indeed, we will not given particularly those of Efron and Morris (1973), Rubin assume that our the unknown con- uncertainty about (1980), Dempster (1980) and Robbins (1983). stants can be described a probability distribution Oi by at all; our view is not that of either the Bayesian or 2. STEIN ESTIMATION AS A REGRESSION We do know the condi- empirical Bayesian approach. PROBLEM tional distribution of X given N(O, 1), and 0, namely The estimation problem involves pairs of values 0) = 0. Indeed we can calculate E (X I this, the theo- (Xi, Os), i = 1, , k, * * where one element of each pair retical line of X on 0, to the regression corresponds is unknown. 0 = X in Figure (Xi) is known and one (Oi) Since the Oi's line 1, and it is this line which gives are the cannot in fact be but unknown, pairs plotted, the ordinary estimators 09? = Xi. Thus the ordinary it will help our understanding of the problem and estimator may be viewed as being based on the a means suggest of approaching it if we imagine what "4wrong" regression line, on E l ) rather than (XI such a plot would look like. Figure 1 is hypothetical, as Francis Galton knew in the E(O I X). Since, already but some aspects of it accurately reflect the situation. 1880's, the regressions of X on 0 and of 0 on X can be Since X is we can think of the X's as being N(O, 1), this suggests that the ordinary markedly different, generated by adding N(O, 1) "errors" to the given O's. estimator can be improved upon and even suggests Thus the horizontal deviations of the points from the how this might be done-by attempting to approxi- 450 line 0 N(O, 1), = X are independent and in that mate "E(O I X) "-or whatever that mean in a might respect they should cluster around the line as indi- cated. Also, E(X) = W and Var(X) = 1/k, so we should
Recommend
More recommend