a new bayesian variable selection criterion based on a g
play

A new Bayesian variable selection criterion based on a g -Prior - PowerPoint PPT Presentation

A new Bayesian variable selection criterion based on a g -Prior extension for p > n Yuzo Maruyama and Edward George CSIS, The University of Tokyo, Japan Department of Stat, University of Pennsylvania Overview: Our recommendable Bayes factor


  1. A new Bayesian variable selection criterion based on a g -Prior extension for p > n Yuzo Maruyama and Edward George CSIS, The University of Tokyo, Japan Department of Stat, University of Pennsylvania

  2. Overview: Our recommendable Bayes factor  � � − n +1 sv[ X γ ] × � ˆ β MP  LSE [ γ ] � if q γ ≥ n − 1     γ ) − n − q γ  d q γ + 3 4 B ( q γ 4 , n − q γ 2 + 1 − 3 q γ (1 − R 2 4 ) 2 2 4 + q γ 1 q γ � ˆ 4 , n − q γ  2 B ( 1 − 3 sv[ X γ ] q γ (1 − R 2 γ + d 2 β LSE [ γ ] � 2 ) 4 )   2    if q γ ≤ n − 2 ◮ A criterion based on full Bayes ◮ but we need no MCMC ◮ An exact closed form by using a special prior ◮ applicable for p > n as well as n > p ◮ model selection consistency and good numerical performance

  3. Introduction Priors Sketch of the calculation of the marginal density The estimation after selection Model selection consistency Numerical experiments Summary and Future work

  4. Full model ◮ Y |{ α, β, σ 2 } ∼ N n ( α 1 n + X β, σ 2 I ) ◮ α : an intercept parameter ◮ 1 n = (1 , 1 , . . . , 1) ′ ◮ X = ( X 1 , . . . , X p ): an n × p standarized design matrix rank X = min( n − 1 , p ) ◮ β : a p × 1 vector of unknown coefficients ◮ σ 2 : an unknown variance Since there is usually a subset of useless regressors in the full model, we would like to choose a good sub-model with only important regressors.

  5. Submodel ◮ submodel M γ Y |{ α, β γ , σ 2 } ∼ N n ( α 1 n + X γ β γ , σ 2 I ) ◮ Assume the intercept is always included ◮ X γ : the n × q γ matrix, rank X γ = min( n − 1 , q γ ) columns = the γ th subset of X 1 , . . . , X p ◮ β γ : a q γ × 1 vector of unknown regression coefficients ◮ q γ : the number of regressors of M γ ◮ The null model: The special case of sub-model M N : Y |{ α, σ 2 } ∼ N n ( α 1 n , σ 2 I )

  6. Variable selection in the Bayesian framework ◮ It entails the specification of prior ◮ on the models Pr( M γ ) ◮ on parameters p ( α, β γ , σ 2 ) of each model ◮ Assumption: equal model space probability Pr( M γ ) = Pr( M γ ′ ) for any γ � = γ ′ ◮ Choose M γ as the best model which maximizes m γ ( y ) posterior prob. Pr( M γ | y ) = � γ m γ ( y ) ◮ m γ ( y ): the marginal density under M γ larger m γ ( y ) is better!

  7. Variable selection in the Bayesian framework ◮ the marginal density ��� p y ( y | α, β γ , σ 2 ) p ( α, β γ , σ 2 ) d α d β γ d σ 2 m γ ( y ) = ◮ Recall that we consider Full Bayes method, which means the joint prior density p ( α, β γ , σ 2 ) does not depend on data unlike Empirical Bayes method. ◮ Bayes factor is often used for expression of Pr( M γ | y ) BF( M γ ; M N ) Pr( M γ | y ) = � γ BF( M γ ; M N ) where BF( M γ ; M N ) = m γ ( y ) m N ( y )

  8. Introduction Priors Sketch of the calculation of the marginal density The estimation after selection Model selection consistency Numerical experiments Summary and Future work

  9. Priors ◮ The form of our joint density p ( α, β γ , σ 2 ) = p ( α ) p ( σ 2 ) p ( β | σ 2 ) � = 1 × σ − 2 × p ( β | g , σ 2 ) p ( g ) dg ◮ 1 × σ − 2 : a popular non-informative prior ◮ improper but justificated because α and σ 2 are included in all submodels ◮ p ( β | g , σ 2 ) and p ( g )

  10. The original Zellner’s g -prior ◮ prior of regression coefficients ◮ Zellner’s (1986) g -prior is popular p β γ ( β γ | σ 2 , g ) = N q γ (0 , g σ 2 ( X ′ γ X γ ) − 1 ) ◮ It is applicable for the traditional situation p + 1 < n ⇒ q γ + 1 < n for any M γ ◮ There are many papers which use g -priors including George and Foster (2000, Biometrika) and Liang et al. (2008, JASA)

  11. The beauty of the g -prior ◮ The marginal density of y given g and σ 2 � � �� g α,β γ log p ( Y | α, β γ , σ 2 ) − q γ g + 1 exp max log( g + 1) g + 1 2 g ◮ Under known σ 2 , g − 1 ( g + 1) log( g + 1) = 2 , or log n leads to AIC by Akaike (1974) and BIC by Schwarz (1978) respectively ◮ several studies: how to choose g based on non-full Bayesian method

  12. Many regressors case ( p > n ) ◮ In modern statistics, treating (very) many regressors case ( p > n ) becomes more and more important ◮ the original Zellner’s g -prior is not available ◮ R 2 is always 1 in the case where q γ ≥ n − 1 ⇒ naive AIC and BIC methods do not work ◮ When we do not use the original g -prior, Bayesian method is available in many regressors case for example β ∼ N (0 , σ 2 λ I ) ◮ inverse-gamma conjugate prior for σ 2 are also available

  13. Many regressors case ( p > n ) ◮ The integral with respect to λ still remains in m γ ( y ) as long as the full Bayes method is considered. ◮ Needless to say, it should be calculated by numerical methods like MCMC or by approximation like Laplace method. ◮ We do not have comparative advantage in numerical methods,,,,, ◮ We like exact analytical results very much.

  14. A variant of Zellner’s g -prior ◮ a special variant of g -prior which enables us to ◮ not only calculate the marginal density analytically (closed form!!) ◮ but also treat many regressors case ◮ [KEY] singular value decomposition of X γ r � X γ = U γ D γ W ′ d i [ γ ] u i [ γ ] w ′ γ = i [ γ ] i =1 ◮ r : rank of X = min( q γ , n − 1) ◮ the n − 1 is from “ X is the centered matrix” ◮ singular values d 1 [ γ ] ≥ · · · ≥ d r [ γ ] > 0

  15. A special variant of g -prior  arbitrary   � �� � � n − 1  i β | g , σ 2 ) ×  i =1 p i ( w ′ p # ( W ′ # β ) p β ( β | g , σ 2 ) = if q ≥ n    � q i β | g , σ 2 ) if q ≤ n − 1  i =1 p i ( w ′ p i ( ·| g , σ 2 ) = N (0 , σ 2 { ν i (1 + g ) − 1 } ) d 2 i W # : a q × ( q − r ) matrix from the orthogonal complement of W q � c.f. original g -prior p β ( β | g , σ 2 ) = i β | g , σ 2 ) if q ≤ n − 1 p i ( w ′ i =1 p i ( ·| g , σ 2 ) = N (0 , g σ 2 ) d 2 i

  16. A special variant of g -prior ◮ ν 1 , . . . , ν r ( ν i ≥ 1) where r = min { n − 1 , q } hyperparameters we have to fix ◮ q ≤ n − 1 ⇒ ( Z ′ Z ) − 1 exists ν 1 = · · · = ν q = 1 ⇒ the original Zellner’s prior ◮ the descending order ν 1 ≥ · · · ≥ ν r like ν i = d 2 i / d 2 (our recommendation) r for 1 ≤ i ≤ r is reasonable for our purpose ◮ numerical experiment and the estimation after selection support the choice

  17. Introduction Priors Sketch of the calculation of the marginal density The estimation after selection Model selection consistency Numerical experiments Summary and Future work

  18. Sketch of the calculation of the marginal density ◮ we have prepared all of priors except for g (we will give a prior of g later) ◮ the marginal density of y given g = the marignal density after the integration w.r.t. α , β , σ 2 � � − ( n − 1) / 2 ( g + 1)(1 − R 2 γ ) + GR 2 m γ ( y | g ) = C ( n , y ) γ × (1 + g ) − r / 2+( n − 1) / 2 � r i =1 ν 1 / 2 i where G R 2 γ means the “generalized” R 2 γ r ( u ′ y 1 n } ) 2 i { y − ¯ � G R 2 γ = y 1 n � 2 ν i � y − ¯ i =1

  19. Many regressors case ◮ rank of X = r = n − 1, R 2 γ = 1 ◮ m γ ( y | g ) does not depend on g � � − ( n − 1) / 2 m γ ( y ) = m γ ( y | g ) = C ( n , y ) � n − 1 i =1 ν − 1 / 2 GR 2 γ i ◮ If ν 1 = · · · = ν n − 1 = 1, GR 2 γ just becomes 1 and hence m γ ( y ) = C ( n , y ) ◮ it does not work for model selection because it always takes the same value in many regressors case ◮ That is why the choice of ν is important.

  20. few regressors case ( q ≤ n − 2) ◮ p g ( g ) = { B ( a + 1 , b + 1) } − 1 g b (1 + g ) − a − b − 2 ◮ it is proper if a > − 1 and b > − 1 ◮ Liang et al (2008, JASA) “hyper- g priors” b = 0 p g ( g ) = ( a + 1) − 1 ( g + 1) − a − 2 ◮ b = ( n − 5 − r ) / 2 − a is for getting a closed simple form of the marginal density ◮ − 1 < a < − 1 / 2 is for well-defining the marginal density of every sub-model ◮ The median a = − 3 / 4 is our recommendation

  21. Sketch of the calculation of the marginal density ◮ When b = ( n − 5) / 2 − r / 2 − a , the beta function takes the integration w.r.t. g � m γ ( y | g ) p ( g ) dg = C ( n , y ) B ( q / 2 + a + 1 , b + 1)(1 − R 2 γ + GR 2 γ ) − ( n − 1) / 2+ b +1 � r i =1 ν 1 / 2 B ( a + 1 , b + 1)(1 − R 2 γ ) b +1 i ◮ When b � = ( n − 5) / 2 − r / 2 − a , there remains an integral with R 2 γ and GR 2 γ in m γ ( y ) ⇒ the need of MCMC or approximation ◮ Liang et al (2008, JASA) b = 0, ν 1 = · · · = ν r = 1 the Laplace approximation

  22. Our recommendable BF ◮ After insertion of our recommendable hyperparameters a = − 3 / 4, b = ( n − 5) / 2 − r / 2 − a and ν i = d 2 i / d 2 r Our criterion BF[ M γ ; M N ]= m γ ( y ) / m N ( y ) becomes  � � − n +1 sv[ X γ ] × � ˆ β MP  LSE [ γ ] � if q γ ≥ n − 1     γ ) − n − q γ d q γ + 3 4 B ( q γ 4 , n − q γ  q γ (1 − R 2 2 + 1 − 3 4 ) 2 2 4 + q γ q γ � ˆ 4 , n − q γ 1 2 B ( 1 − 3 sv[ X γ ] q γ (1 − R 2 γ + d 2 β LSE [ γ ] � 2 )  4 )   2    if q γ ≤ n − 2 ◮ It is exactly proportional to the posterior probability ◮ based on fundamental aggregated information of y and X γ

Recommend


More recommend