High Dimensional Predictive Inference Workshop on Current Trends and Challenges in Model Selection and Related Areas Vienna, Austria July 2008 Ed George The Wharton School (joint work with L. Brown, F. Liang, and X. Xu)
1. Estimating a Normal Mean: A Brief History • Observe X | µ ∼ N p ( µ, I ) and estimate µ by ˆ µ under µ ( X ) − µ � 2 R Q ( µ, ˆ µ ) = E µ � ˆ • ˆ µ MLE ( X ) = X is the MLE, best invariant and minimax with constant risk • Shocking Fact: ˆ µ MLE is inadmissible when p ≥ 3. (Stein 1956) • Bayes rules are a good place to look for improvements • For a prior π ( µ ), the Bayes rule ˆ µ π ( X ) = E π ( µ | X ) minimizes E π R Q ( µ, ˆ µ ) • Remark: The (formal) Bayes rule under π U ( µ ) ≡ 1 is µ U ( X ) ≡ ˆ ˆ µ MLE ( X ) = X
• ˆ µ H ( X ), the Bayes rule under the Harmonic prior π H ( µ ) = � µ � − ( p − 2) , dominates ˆ µ U when p ≥ 3. (Stein 1974) • ˆ µ a ( X ), the Bayes rule under π a ( µ ) where s ∼ (1 + s ) a − 2 µ | s ∼ N p (0 , s I ) , dominates ˆ µ U and is proper Bayes when p = 5 and a ∈ [ . 5 , 1) or when p ≥ 6 and a ∈ [0 , 1). (Strawderman 1971) • A Unifying Phenomenon: These domination results can be at- tributed to properties of the marginal distribution of X under π H and π a .
• The Bayes rule under π ( µ ) can be expressed as µ π ( X ) = E π ( µ | X ) = X + ∇ log m π ( X ) ˆ where � e − ( X − µ ) 2 / 2 π ( µ ) dµ m π ( X ) ∝ is the marginal of X under π ( µ ). ( ∇ = ( ∂ ∂ ∂x 1 , . . . , ∂x p ) ′ ) (Brown 1971) • The risk improvement of ˆ µ π ( X ) over ˆ µ U ( X ) can be expressed as �∇ log m π ( X ) � 2 − 2 ∇ 2 m π ( X ) � � R Q ( µ, ˆ µ U ) − R Q ( µ, ˆ µ π ) = E µ m π ( X ) � � − 4 ∇ 2 � � = E µ m π ( X ) / m π ( X ) ( ∇ 2 = � ∂ 2 i ) (Stein 1974, 1981) i ∂x 2
• That ˆ µ H ( X ) dominates ˆ µ U when p ≥ 3, follows from the fact that the marginal m π ( X ) under π H is superharmonic, i.e. ∇ 2 m π ( X ) ≤ 0 • That ˆ µ a ( X ) dominates ˆ µ U when p ≥ 5 (and conditions on a ), follows from the fact that the sqrt of the marginal under π a is superharmonic, i.e. ∇ 2 � m π ( X ) ≤ 0 (Fourdrinier, Strawderman and Wells 1998)
2. The Prediction Problem • Observe X | µ ∼ N p ( µ, v x I ) and predict Y | µ ∼ N p ( µ, v y I ) – Given µ , Y is independent of X – v x and v y are known (for now) • The Problem: To estimate p ( y | µ ) by q ( y | x ). • Measure closeness by Kullback-Leibler loss, p ( y | µ ) log p ( y | µ ) � L ( µ, q ( y | x )) = q ( y | x ) dy • Risk function � R KL ( µ, q ) = L ( µ, q ( y | x )) p ( x | µ ) dx = E µ [ L ( µ, q ( y | X )]
3. Bayes Rules for the Prediction Problem • For a prior π ( µ ), the Bayes rule � p π ( y | x ) = p ( y | µ ) π ( µ | x ) dµ = E π [ p ( y | µ ) | X ] � minimizes R KL ( µ, q ) π ( µ ) dµ (Aitchison 1975) • Let p U ( y | x ) denote the Bayes rule under π U ( µ ) ≡ 1 • p U ( y | x ) dominates p ( y | ˆ µ = x ), the naive “plug-in” predictive distribution (Aitchison 1975) • p U ( y | x ) is best invariant and minimax with constant risk (Murray 1977, Ng 1980, Barron and Liang 2003) • Shocking Fact: p U ( y | x ) is inadmissible when p ≥ 3
• p H ( y | x ), the Bayes rule under the Harmonic prior π H ( µ ) = � µ � − ( p − 2) , dominates p U ( y | x ) when p ≥ 3. (Komaki 2001). • p a ( y | x ), the Bayes rule under π a ( µ ) where s ∼ (1 + s ) a − 2 , µ | s ∼ N p (0 , s v 0 I ) , dominates p U ( y | x ) and is proper Bayes when v x ≤ v 0 and when p = 5 and a ∈ [ . 5 , 1) or when p ≥ 6 and a ∈ [0 , 1). (Liang 2002) • Main Question: Are these domination results attributable to the properties of m π ?
4. A Key Representation for p π ( y | x ) • Let m π ( x ; v x ) denote the marginal of X | µ ∼ N p ( µ, v x I ) under π ( µ ). • Lemma : The Bayes rule p π ( y | x ) can be expressed as p π ( y | x ) = m π ( w ; v w ) m π ( x ; v x ) p U ( y | x ) where W = v y X + v x Y ∼ N p ( µ, v w I ) v x + v y • Using this, the risk improvement can be expressed as � � p v x ( x | µ ) p v y ( y | µ ) log p π ( y | x ) R KL ( µ, p U ) − R KL ( µ, p π ) = p U ( y | x ) dxdy = E µ,v w log m π ( W ; v w ) − E µ,v x log m π ( X ; v x )
5. An Analogue of Stein’s Unbiased Estimate of Risk • Theorem : � ∇ 2 m π ( Z ; v ) ∂ − 1 � 2 �∇ log m π ( Z ; v ) � 2 ∂v E µ,v log m π ( Z ; v ) = E µ,v m π ( Z ; v ) � � 2 ∇ 2 � � = E µ,v m π ( Z ; v ) / m π ( Z ; v ) • Proof relies on using the heat equation ∂v m π ( z ; v ) = 1 ∂ 2 ∇ 2 m π ( z ; v ) , Brown’s representation and Stein’s Lemma.
6. General Conditions for Minimax Prediction • Let m π ( z ; v ) be the marginal distribution of Z | µ ∼ N p ( µ, vI ) under π ( µ ). • Theorem : If m π ( z ; v ) is finite for all z , then p π ( y | x ) will be minimax if either of the following hold: � (i) m π ( z ; v ) is superharmonic (ii) m π ( z ; v ) is superharmonic • Corollary : If m π ( z ; v ) is finite for all z , then p π ( y | x ) will be minimax if π ( µ ) is superharmonic • p π ( y | x ) will dominate p U ( y | x ) in the above results if the super- harmonicity is strict on some interval.
7. An Explicit Connection Between the Two Problems • Comparing Stein’s unbiased quadratic risk expression with our unbiased KL risk expression reveals � ∂ � R Q ( µ, ˆ µ U ) − R Q ( µ, ˆ µ π ) = − 2 ∂v E µ,v log m π ( Z ; v ) v =1 • Combined with our previous KL risk difference expression reveals a fascinating connection � v x R KL ( µ, p U ) − R KL ( µ, p π ) = 1 1 v 2 [ R Q ( µ, ˆ µ U ) − R Q ( µ, ˆ µ π )] v dv 2 v w • Ultimately it is this connection that yields the similar conditions for minimaxity and domination in both problems. Can we go further?
8. Sufficient Conditions for Admissibility • Let B KL ( π, q ) ≡ E π [ R KL ( µ, q )] be the average KL risk of q ( y | x ) under π . • Theorem (Blyth’s Method): If there is a sequence of finite non- negative measures satisfying π n ( { µ : � µ � ≤ 1 } ) ≥ 1 such that B KL ( π n , q ) − B KL ( π n , p π n ) → 0 then q is admissible. • Theorem : For any two Bayes rules p π and p π n � v x B KL ( π n , p π ) − B KL ( π n , p π n ) = 1 1 v 2 [ B Q ( π n , ˆ µ π ) − B Q ( π n , ˆ µ π n )] v dv 2 v w where B Q ( π, ˆ µ ) is the average quadratic risk of ˆ µ under π . • Using the explicit construction of π n ( µ ) from Brown and Hwang (1984), we obtain tail behavior conditions that prove admissibility of p U ( y | x ) when p ≤ 2, and admissibility of p H ( y | x ) when p ≥ 3.
9. A Complete Class Theorem • Theorem : In the KL risk problem, all the admissible procedures are Bayes or formal Bayes procedures. • Our proof uses the weak* topology from L ∞ to L 1 to define con- vergence on the action space which is the set of all proper densities on R p . • A Sletch of the Proof: (i) All the admissible procedures are non-randomized. (ii) For any admissible procedure p ( ·| x ), there exists a sequence of priors π i ( µ ) such that p π i ( ·| x ) → p ( ·| x ) weak* for a.e. x . (iii) We can find a subsequence { π i ′′ } and a limit prior π such that p π i ′′ ( · | x ) → p π ( · | x ) weak ∗ for almost every x . There- fore, p ( · | x ) = p π ( · | x ) for a.e. x , i.e. p ( · | x ) is a Bayes or a formal Bayes rule.
10. Predictive Estimation for Linear Regression • Observe X m × 1 = A m × p β p × 1 + ε m × 1 and predict Y n × 1 = B n × p β p × 1 + τ n × 1 – ε ∼ N m (0 , I m ) is independent of τ ∼ N n (0 , I n ) – rank ( A ′ A ) = p • Given a prior π on β , the Bayes procedure p L π ( y | x ) is � p ( x | Aβ ) p ( y | Bβ ) π ( β ) dβ p L π ( y | x ) = � p ( x | Aβ ) π ( β ) dβ • The Bayes procedure p L U ( y | x ) under the uniform prior π U ≡ 1 is minimax with constant risk
11. The Key Marginal Representation • For any prior π , π ( y | x ) = m π (ˆ β x,y , ( C ′ C ) − 1 ) p L p L U ( y | x ) m π (ˆ β x , ( A ′ A ) − 1 ) where C ( m + n ) × p = ( A ′ , B ′ ) ′ and ˆ β x = ( A ′ A ) − 1 A ′ x ∼ N p ( β, ( A ′ A ) − 1 ) β x,y = ( C ′ C ) − 1 C ′ ( x ′ , y ′ ) ′ ∼ N p ( β, ( C ′ C ) − 1 ) ˆ
12. Risk Improvement over p L U ( y | x ) • Here the difference between the KL risks of p L U ( y | x ) and p L π ( y | x ) can be expressed as R KL ( β, p L U ) − R KL ( β, p L π ) = E β, ( C ′ C ) − 1 log m π (ˆ β x,y ; ( C ′ C ) − 1 ) − E β, ( A ′ A ) − 1 log m π (ˆ β x ; ( A ′ A ) − 1 ) • Minimaxity of p L π ( y | x ) is here obtained when ∂ ∂ω E µ,V ω log m π ( Z ; V ω ) < 0 where V ω ≡ ω ( A ′ A ) − 1 + (1 − ω )( C ′ C ) − 1 • This leads to weighted superharmonic conditions on m π and π for minimaxity.
13. Minimax Shrinkage Towards 0 • Our Lemma representation p H ( y | x ) = m H ( w ; v w ) m H ( x ; v x ) p U ( y | x ) shows how p H ( y | x ) “shrinks p U ( y | x ) towards 0” by an adaptive multiplicative factor • The following figure illustrates how this shrinkage occurs for var- ious values of x .
Recommend
More recommend