Introduction to Bayesian Statistics Lecture 11: Model Comparison Rung-Ching Tsai Department of Mathematics National Taiwan Normal University May 20, 2015
Evaluating and Comparing Models • Measure of predictive accuracy ◦ log predictive density as a measure of fit ◦ Out-of-sample predictive accuracy as a gold standard • deviance, information criteria and cross-validation ◦ Within-sample predictive accuracy ◦ Subtracting an adjustment ◦ Cross-validation • Model comparison based on predictive performance • Model comparison based on Bayes factor 2 of 16
Akaike information criterion (AIC) • � elpd AIC = log p ( y | ˆ θ mle ) − k • elpd= expected log predictive density • Based on fit to observed data given maximum likelihood estimates • Goal: use expected log predictive density (elpd) such that y | ˆ elpd = E ˜ y [log p (˜ θ mle )] ◦ expectation averages over the predictive distribution of ˜ y ◦ AIC began life with Akaikes (1973) theorem, which established that AIC is an unbiased estimator of predictive accuracy. 3 of 16
deviance What is the ‘deviance’? • For a likelihood p ( y | θ ), we define the deviance as D ( y , θ ) = − 2log p ( y | θ ) e.g. Y 1 , Y 2 , · · · , Y n ∼ Binomial( n i , θ i ), the deviance is � n i � � − 2[ y i log θ i + ( n i − y i )log(1 θ i ) + log ] y i i • It is possible to have a negative deviance. Deviance is derived from the likelihood and evaluated at a certain point in parameter space. Likelihoods greater than 1 could lead to negative deviance, and are appropriate. 4 of 16
mean deviance as measure of fit • Dempster (1974) suggested plotting posterior distribution of deviance D = − 2log p ( y | θ ) • Use of posterior mean deviance ¯ D = E[ D ] as a measure of fit • Invariant to parameterization of θ • Robust, generally converges well • But more complex models will fit the data better and so will have smaller ¯ D • Need to have some measure of model complexity to trade off against ¯ D 5 of 16
counting parameters and model complexity- p (1) D • Bayesian measures of model complexity (Spiegelhalter et al, 2002) E θ | y [ − 2log p ( y | θ )] − ( − 2log p ( y | ˜ θ )) = E θ | y [ D ( y , θ )] − D ( y , ˜ θ ) . where ˜ θ = E[ θ | y ], then the measure is defined as posterior mean deviance - deviance of posterior means . • the measure of effective number of parameters of a Bayesian model p (1) E θ | y [ D ( y , θ )] − D ( y , ˜ ˆ θ ) . = ˆ = D avg ( y ) − D ˆ θ ( y ) D L � 1 ( D ( y , θ l ) − D ˆ = θ ( y )) . L l =1 6 of 16
counting parameters and model complexity- p (2) D • A related way to measure model complexity is as half the posterior variance of the model-level deviance, its estimate is known as p (2) D (Gelman et al, 2004) p (2) = var θ | y [ D ( y , θ )] / 2 ˆ D � L 1 1 ( D ( y , θ l ) − ˆ D avg ( y )) 2 = 2 L − 1 l =1 7 of 16
comparison of p (1) D and p (2) D • p (1) is not invariant to reparameterization (subject of much criticism). D • In normal linear hierarchical models: p (1) D = tr ( H ) where Hy = ˆ y . Hence H is the hat matrix which projects data onto fitted values. Thus D = � i h ii = � leverages. In general, justification depends on p (1) asymptotic normality of posterior distribution. • p (1) or p (2) D , can be thought of as the number of ’unconstrained’ D parameters in the model, where a parameter counts as: 1 if it is estimated with no constraints or prior information; 0 if it is fully constrained or if all the information about the parameter comes from the prior distribution; or an intermediate value if both the data and the prior are informative. • p (1) and p (2) should be positive. A negative p (1) value indicates one or D D D more problems: log-likelihood is non-concave, a conflict between the prior and the data, or that the posterior mean is a poor estimator (such as with a bimodal posterior). 8 of 16
Deviance information criterion (DIC) • use criterion based on trade-off between the fit of the data to the model and the corresponding complexity of the model • Spiegelhalter et al (2002) proposed a Bayesian model comparison criterion based on this principle: Deviance Information Criterion, DIC = goodness of fit + complexity • � elpd DIC = log p ( y | ˆ θ Bayes ) − p DIC • Based on fit to observed data given posterior mean • Effective number of parameters p DIC computed based on normal approximation ( χ 2 approximation to -2 log likelihood): p (1) or p (2) D D • Either p (1) or p (2) is asymptotically ok in expectation D D 9 of 16
Model comparison-using DIC • The DIC is then defined analagously to AIC as DIC = D (ˆ θ Bayes ) + 2 p (1) D = ¯ D + p (1) D or DIC = ¯ D + p (2) D • DIC may be compared across different models and even different methods, as long as the dependent variable does not change between models, making DIC the most flexible model fit statistic. • Like AIC and BIC, DIC is an asymptotic approximation as the sample size becomes large. DIC is valid only when the joint posterior distribution is approximately multivariate normal. • Models with smaller DIC should be preferred . Since DIC increases with model complexity ( p (1) or p (2) D ), simpler models are preferred. D 10 of 16
How do I compare different DICs? • The model with the minimum DIC estimates will make the best short-term predictions, in the same spirit as Akaike’s criterion. • It is difficult to say what would constitute an important difference in DIC. Very roughly, ◦ differences of more than 10 might definitely rule out the model with the higher DIC. ◦ differences between 5 and 10 are substantial ◦ if the difference in DIC is, say, less than 5, and the models make very different inferences, then it could be misleading just to report the model with the lowest DIC. 11 of 16
Watanabe-Akaike information criterion (WAIC) elppd WAIC = ( � n • � i =1 log p post ( y i )) − p WAIC • elppd = expected log posterior predictive density • Based on posterior predictive fit to observed data • p WAIC = � n i =1 var post (log p ( y i | θ )) • Compute p post and var post using simulations • Requires data partition • Connection to leave-one-out cross-validation 12 of 16
Model comparison-Bayes factor Comparing two or more models: p ( H 2 | y ) p ( H 1 | y ) = p ( H 2 ) p ( y | H 2 ) p ( H 1 ) p ( y | H 1 ) p ( H 2 ) • p ( H 1 ) is “prior odds” • B [ H 2 : H 1 ] = p ( y | H 2 ) p ( y | H 1 ) is “Bayes factor” with � p ( y | H ) = p ( y | θ, H ) p ( θ | H ) d θ. • Problem with p ( y | H ) ◦ Integral depends on irrelevant tail properties of the prior density y ∼ N( θ, σ 2 / n ) and p ( θ ) ∝ U ( − A , A ), for some large A ◦ Consider ¯ ◦ Marginal p ( y ) is proportional to 1 A 13 of 16
An example where the Bayes factor is good • Genetics example with H 1 : the woman is affected, θ = 1 H 2 : the woman is unaffected, θ = 0 ◦ prior odds are p ( H 2 ) / p ( H 1 ) = 1 ◦ Bayes factor of the data is p ( y | H 2 ) / p ( y | H 1 ) = 1 . 0 / 0 . 25 = 4 ◦ the posterior odds are thus p ( H 2 | y ) / p ( H 1 | y ) = 4 • The two features that allow Bayes factors to be helpful. ◦ each of the discrete alternatives makes scientific sense, and there are no obvious scientific models in between; i.e., truly discrete parameter space ◦ Model of probabilities; no unbounded parameters 14 of 16
An example where the Bayes factor is bad • 8 schools example: y j ∼ N ( θ j , σ 2 j ), for j = 1 , . . . , 8 . H 1 : no pooling, p ( θ 1 , · · · , θ 8 ) ∝ 1 H 2 : complete pooling, θ 1 = . . . = θ J = θ , p ( θ ) ∝ 1 ◦ Bayes factor is 0/0 ◦ Instead, express flat priors as N (0 , A 2 ) and let A get large ◦ Now Bayes factor strongly depends on A ◦ As A → ∞ , complete pooling model gets 100% of the probability for any data! ◦ Also a horrible dependence on J 15 of 16
Interpretation of Bayes Factors • Jeffreys (1961) and Kass & Raftery (1995) 2 log ( B [ H 2 : H 1 ]) B [ H 2 : H 1 ] Favor H 2 over H 1 0 to 2 1 to 3 Not worth a bare mention 2 to 6 3 to 20 Positive 6 to 10 30 to 150 Strong > 10 > 150 Very Strong • B [ H 2 : H 1 ] = 1 / B [ H 1 : H 2 ] • Interpretation is on same scale as deviance and likelihood ratio statistics 16 of 16
Recommend
More recommend