Evaluating predictive loss for models with observation-level latent variables Russell Millar University of Auckland Dec 2015 Russell Millar University of Auckland Predictive loss Dec 2015 1 / 20
Motivation So, you’ve fitted a Bayesian model.... very likely more than one, you’d like to know which is preferred. How? Bayes factors? ◮ Computationally challenging. ◮ Sensitive to priors. Posterior predictive checks? ◮ Poor performance to detect model deficiencies. ◮ Not addressing the question directly. Predictive loss? ◮ DIC ◮ WAIC ◮ Cross-validation Russell Millar University of Auckland Predictive loss Dec 2015 2 / 20
Notation y = ( y 1 , ..., y n ), observations with density p ( y ) I d , parameter vector θ ∈ R p ( y | θ ), the likelihood p ( θ ), prior z , future realizations from true distribution of y . D ( θ ) = − 2 log p ( y | θ ), deviance function Russell Millar University of Auckland Predictive loss Dec 2015 3 / 20
DIC, the Dirty Information Criterion Widely used: Spiegelhalter et al. (2002) > 6 500 cites. DIC can be written as DIC = − 2 E θ | y [log p ( y | θ )] + p = D ( θ ) + p , where p is a penalty term to correct for using the data twice. A Taylor series expansion of D ( θ ) around θ = E θ | y [ θ ] “suggests” that p can be estimated as the posterior expected value of D ( θ ) − D ( θ ), giving p D = D ( θ ) − D ( θ ) . Easy to estimate from a posterior sample. Not invariant to re-parameterization due to use of θ . ��� p D can be negative if deviance is not concave. ��� Never explicitly stated what DIC is trying to estimate!!! Russell Millar University of Auckland Predictive loss Dec 2015 4 / 20
DIC, the Dirty Information Criterion Since D ( θ ) = E θ | y [ D ( θ )] = − 2 E θ | y [log p ( y | θ )] you might suspect that DIC is estimating the expected predictive deviance − 2 E z E θ | y [log p ( z | θ )] . (1) 1 van der Linde (2005) & Ando (2011). Russell Millar University of Auckland Predictive loss Dec 2015 5 / 20
DIC, the Dirty Information Criterion Since D ( θ ) = E θ | y [ D ( θ )] = − 2 E θ | y [log p ( y | θ )] you might suspect that DIC is estimating the expected predictive deviance − 2 E z E θ | y [log p ( z | θ )] . (1) But its not - it needs a heavier penalty for using y in place of z . 1 The extra-penalized form DIC ∗ = D ( θ ) + 2 p , is an asymptotically unbiased estimator of (1). 1 van der Linde (2005) & Ando (2011). Russell Millar University of Auckland Predictive loss Dec 2015 5 / 20
WAIC, Widely Applicable Information Criteria Sumio Watanabe (2009) developed a singular learning theory derived using algebraic geometry results developed by Heisuke Hironaka (who earned a Fields medal in 1970 for his work). It is assumed that p ( y i | θ ) are independent. Russell Millar University of Auckland Predictive loss Dec 2015 6 / 20
WAIC, Widely Applicable Information Criteria Sumio Watanabe (2009) developed a singular learning theory derived using algebraic geometry results developed by Heisuke Hironaka (who earned a Fields medal in 1970 for his work). It is assumed that p ( y i | θ ) are independent. Watanabe defines several WAIC variants. One particular variant has gained popularity due to: It’s asymptotic equivalence with Bayesian leave-one-out cross-validation (LOO-CV), Watanabe (2010). It’s high degree of approximation to its target loss Russell Millar University of Auckland Predictive loss Dec 2015 6 / 20
WAIC, Widely Applicable Information Criteria n � WAIC = − 2 log p ( y i | y ) + 2 V i =1 � � n = = − 2 log p ( y i | θ ) p ( θ | y ) d θ + 2 V , i =1 where � n V = Var θ | y (log p ( y i | θ )) . i =1 Watanabe showed that E Y [ WAIC ] is an asymptotically unbiased estimator of E Y ( B ) where � � � � n � n B = − 2 E Z i [log p i ( z i | y )] = − 2 E Z i log p ( z i | θ ) p ( θ | y ) d θ . i =1 i =1 This holds under very general conditions, including for non-identifiable, singular and unrealizable models. Russell Millar University of Auckland Predictive loss Dec 2015 7 / 20
LOO-CVL, Leave-one-out Cross-validation Letting y − i denote the observations with y i removed, a natural approximation for B is the LOO-CVL estimator n � CVL = CVL i , i =1 where CVL i = − 2 log p ( y i | y − i ) � = − 2 log p ( y i | θ ) p ( θ | y − i ) d θ . (2) CVL has asymptotic bias of O (1 / n ) as an estimator of B . Russell Millar University of Auckland Predictive loss Dec 2015 8 / 20
LOO-CVL, Leave-one-out Cross-validation Letting y − i denote the observations with y i removed, a natural approximation for B is the LOO-CVL estimator n � CVL = CVL i , i =1 where CVL i = − 2 log p ( y i | y − i ) � = − 2 log p ( y i | θ ) p ( θ | y − i ) d θ . (2) CVL has asymptotic bias of O (1 / n ) as an estimator of B . But, direct estimation of CVL can be very computationally intensive since it requires samples from n posteriors p ( θ | y − i ) , i = 1 , ..., n . This direct estimator will be denoted � CVL . Russell Millar University of Auckland Predictive loss Dec 2015 8 / 20
Importance sampling approximation to LOO-CVL p ( y i | y − i ) can be expressed as the harmonic mean of p ( y i | θ ) with respect to the full posterior, �� � − 1 1 p ( y i | y − i ) = p ( y i | θ ) p ( θ | y ) d θ , and so p ( y i | y − i ) can be estimated as S p ( y i | y − i ) = � , (3) � S 1 s =1 p ( y i | θ ( s ) ) where θ ( s ) , s = 1 , ..., S , is a sample from p ( θ | y ). Thus, each CVL i , i = 1 , ..., n and hence CVL = � n i =1 CVL i can be estimated from a single posterior sample. Note that (3) can also be written as a self-normalizing importance-sampling estimator, � S s =1 p ( y i | θ ( s ) ) w si p ( y i | y − i ) = � , (4) � S s =1 w si where w si = p ( y i | θ ( s ) ) − 1 . The importance-sampling estimator of CVL will be denoted � ISCVL . Russell Millar University of Auckland Predictive loss Dec 2015 9 / 20
Importance sampling approximation to LOO-CVL p ( y i | y − i ) can be highly unstable when θ ( s ) is in the tails of p ( y i | θ ( s ) ). Note that � It is very useful to quantify the reliability of importance sampling using the notion of effective sample size. The effective sample size is with respect to a sample from p ( θ | y − i ) for evaluating CVL i using (2). For observation i , ESS i can be calculated as ESS i = nw i 2 , w 2 i where w si = p ( y i | θ ( s ) ) − 1 and w i is the mean of the weights w si , s = 1 , ..., S , and i is the mean of the squared weights w 2 w 2 si , s = 1 , ..., S . Russell Millar University of Auckland Predictive loss Dec 2015 10 / 20
Evaluation of predictive loss Recent work has examined the relative performance of WAIC, CVL and IS-CVL in the context of normal models. I have been examining their performance with regard to: Model focus (i.e., level of hierarchy at which likelihood is specified). Use with non-normal data. Russell Millar University of Auckland Predictive loss Dec 2015 11 / 20
Evaluation of predictive loss Recent work has examined the relative performance of WAIC, CVL and IS-CVL in the context of normal models. I have been examining their performance with regard to: Model focus (i.e., level of hierarchy at which likelihood is specified). Use with non-normal data. Models for over-dispersed count data incorporate both of these issues. E.g., the negative binomial density can be expressed directly (marginal focus), or as a Poisson density conditional on an underlying gamma latent variable (conditional focus). Russell Millar University of Auckland Predictive loss Dec 2015 11 / 20
Evaluation of predictive loss, y ∼ Pois ( λ ) 6 5 Expected loss E y [B] 4 E y [WAIC] 3 2 1 0 0 2 4 6 8 10 λ 0 WAIC approximation not so good until normal approximation (to Poisson) kicks in at around λ 0 = 5. Russell Millar University of Auckland Predictive loss Dec 2015 12 / 20
Evaluation of predictive loss, y ∼ Pois ( λ ) FYI, the underlying R code to numerically evaluate B for y ∼ Pois ( λ 0 ). BayesLoss=function(y,lambda0,alpha=0.001,beta=0.001) { yrep_limits=qpois(c(1e-15,1-1e-15),lambda0) yrep_grid=seq(yrep_limits[1],yrep_limits[2]) #Grid of values for reps grid_probs=dpois(yrep_grid,lambda0) #Probabilities over the grid grid_pd=dnbinom(yrep_grid,size=y+alpha,mu=(y+alpha)/(beta+1)) #Pred density BLoss=-2*sum(grid_probs*log(grid_pd)) #Predictive loss, B, for a given y return(BLoss) } Russell Millar University of Auckland Predictive loss Dec 2015 13 / 20
Simulation study with over-dispersed count data How well can the predictive criteria distinguish the following three models? Poisson: y i | µ ∼ Pois ( µ ) PGA: y i | λ i ∼ Pois ( λ i ) where λ i ∼ Γ( α, α/µ ) PLN: y i | λ i ∼ Pois ( λ i ) where λ i ∼ LN (log( µ ) − 0 . 5 τ 2 , τ 2 ) These are conditional-level specifications. Russell Millar University of Auckland Predictive loss Dec 2015 14 / 20
Recommend
More recommend