Bayesian model averaging Dr. Jarad Niemi Iowa State University - PowerPoint PPT Presentation

Bayesian model averaging Dr. Jarad Niemi Iowa State University September 7, 2017 Jarad Niemi (Iowa State) Bayesian model averaging September 7, 2017 1 / 30

Bayesian model averaging Bayesian model averaging Let { M γ : γ ∈ Γ } indicate a set of models for a particular data set y . If ∆ is a quantity of interest, e.g. effect size, a future observable, or the utility of a course of action, then its posterior distribution is � p (∆ | y ) = p (∆ | M γ , y ) p ( M γ | y ) γ ∈ Γ where p ( M γ | y ) = p ( y | M γ ) p ( M γ ) p ( y | M γ ) p ( M γ ) = p ( y ) � λ ∈ Γ p ( y | M λ ) p ( M λ ) is the posterior model probability and � p ( y | M γ ) = p ( y | θ γ , M γ ) p ( θ γ | M γ ) d θ γ is the marginal likelihood for model M γ and θ γ is the set of parameters in model M γ . Jarad Niemi (Iowa State) Bayesian model averaging September 7, 2017 2 / 30

Bayesian model averaging Bayesian model averaged moments Since p (∆ | y ) is a discrete mixture, we may be interested in simplifying inference concerning ∆ to a couple of moments. Let ˆ ∆ γ = E [∆ | y , M γ ]. Then the expectation is � ˆ E [∆ | y ] = ∆ γ p ( M γ | y ) γ ∈ Γ and the variance is   � ( Var [∆ | y , M γ ) + ˆ ∆ 2  − E [∆ | y ] 2 Var [∆ | y ] = γ ) p ( M γ | y ) γ ∈ Γ The appealing aspect here is that the moments only depend on the moments from each individual model and the posterior model probability. Jarad Niemi (Iowa State) Bayesian model averaging September 7, 2017 3 / 30

Bayesian model averaging Difficulties with BMA Evaluating the summation can be difficult since | Γ | , the cardinality of Γ , might be huge. Calculating the marginal likelihood. Specifying the prior over models. Choosing the class of models to average over. Jarad Niemi (Iowa State) Bayesian model averaging September 7, 2017 4 / 30

Bayesian model averaging Reducing cardinality Reducing cardinality If | Γ | is small enough, we can enumerate all models and perform model averaging exactly. But if | Γ | is too large, we will need some parsimony. Rather than summing over Γ , we can only include those models whose posterior probability is sufficiently large � M γ : max λ p ( M λ | y ) = max λ p ( y | M λ ) p ( M λ ) � A = ≤ C p ( M γ | y ) p ( y | M γ ) p ( M γ ) relative to other models where C is chosen by the researcher. Also, appealing to Occam’s razor, we should exclude complex models which receive less support than sub-models of that complex model, i.e. � M γ : ∀ M λ ∈ A , M λ ⊂ M γ , p ( M λ | y ) � B = p ( M γ | y ) < 1 So, we typically sum over the smaller set of models Γ ′ = A \ B . Jarad Niemi (Iowa State) Bayesian model averaging September 7, 2017 5 / 30

Bayesian model averaging Reducing cardinality Searching through models One approach is to search through models and keep a list of the best models. To speed up the search the following criteria can be used to decide what models should be kept in Γ ′ : When comparing two nested models, if a simpler model is rejected, then all submodels of the simpler model are rejected. When comparing two non-nested models, we calculate the ratio of posterior model probabilities p ( M γ | y ) p ( M γ ′ | y ) if this quantity is less than O L , we reject M γ and if it is greater than O R we reject M γ ′ . Jarad Niemi (Iowa State) Bayesian model averaging September 7, 2017 6 / 30

Bayesian model averaging Reducing cardinality Using MCMC to search through models Construct a neighborhood around M ( i ) (the current model in the chain), call it nbh ( M ( i ) ). Now propose a draw M ∗ from the following proposal distribution ∀ M ∗ / � ∈ nbh ( M ( i ) ) 0 q ( M ∗ | M ( i ) ) = ∀ M ∗ ∈ nbh ( M ( i ) ) 1 | nbh ( M ( i ) ) | Set M ( i +1) = M ∗ with probability min { 1 , ρ ( M ( i ) , M ∗ ) } where ρ ( M ( i ) , M ∗ ) = p ( M ∗ | y ) | nbh ( M ( i ) ) | | nbh ( M ∗ ) | p ( M ( i ) | y ) and otherwise set M ( i +1) = M ( i ) . This Markov chain converges to draws from p ( M γ | y ) and therefore can estimate posterior model probabilities. Jarad Niemi (Iowa State) Bayesian model averaging September 7, 2017 7 / 30

Bayesian model averaging Evaluating integrals Evaluating the marginal likelihoods Recall that as the sample size n increases, the posterior converges to a normal distribution. Let g ( θ ) = log( p ( y | θ, M ) p ( θ | M )) = log p ( y | θ, M ) + log p ( θ | M ) Let ˆ θ MAP be the MAP for θ in model M . Taking a Taylor series expansion of g ( θ ) around ˆ θ MAP , we have θ MAP ) − 1 g ( θ ) ≈ g (ˆ 2( θ − ˆ θ MAP ) A ( θ − ˆ θ MAP ) ⊤ where A is the negative Hession of g ( θ ) evaluated at ˆ θ MAP . Combining this with the first equation and exponentiating, we have � − 1 � p ( y | θ, M ) p ( θ | M ) ≈ p ( y | ˆ θ MAP , M ) p (ˆ 2( θ − ˆ θ MAP ) A ( θ − ˆ θ MAP ) ⊤ θ MAP ) exp Hence, the approximation to p ( θ | y , M ) ∝ p ( y | θ, M ) p ( θ | M ) is normal. Jarad Niemi (Iowa State) Bayesian model averaging September 7, 2017 8 / 30

Bayesian model averaging Evaluating integrals Evaluating the marginal likelihoods (cont.) If we take the integral over θ of both sides and take the logarithm, we have 2 log(2 π ) − 1 θ MAP | M ) + p log p ( y | M ) ≈ log p ( y | ˆ θ MAP , M ) + log p (ˆ 2 log | A | where p is the dimension of θ , i.e. the number of parameters. We call this approximation the Laplace approximation. Another approximation that is more computationally efficient but less accurate is to only retain terms that increase with n : log p ( y | ˆ θ, M ) increases linearly with n log | A | increases as p log n As n gets large ˆ θ MAP → ˆ θ MLE . Taking these two together we have θ MLE , M ) − p log p ( y | M ) ≈ log p ( y | ˆ 2 log n Multiplying by -2, we obtain Schwarz’s Bayesian Information Criterion (BIC) BIC = − 2 log p ( y | ˆ θ MLE , M ) + p log n Jarad Niemi (Iowa State) Bayesian model averaging September 7, 2017 9 / 30

Bayesian model averaging Priors over models Priors over models For data-based comparisons of models, you can use Bayes Factors directly since � BF ( M γ : M γ ′ ) = p ( y | M γ ) p ( y | θ γ ) p ( θ γ | M γ ) d θ γ p ( y | M γ ′ ) = � p ( y | θ γ ′ ) p ( θ γ ′ | M γ ′ ) d θ γ ′ where the last equality is a reminder that priors over parameters still matter. For model averaging, you need to calculate posterior model probabilities which require specification of the prior probabability of each model. One possible prior for regression models is p w 1 − γ i � (1 − w i ) γ i p ( M γ ) = i i =1 Setting w i = 0 . 5 corresponds to a uniform prior over the model space. Jarad Niemi (Iowa State) Bayesian model averaging September 7, 2017 10 / 30

Bayesian model averaging Priors over models BMA output The quantities of interest from BMA are typically Posterior model probabilities p ( M γ | y ) Posterior inclusions probabilities (for regression) � p (including explanatory variable i | y ) = p ( M γ | y ) I ( γ i = 1) γ ∈ Γ which provides an overall assessment of whether explanatory variable i is important or not. Posterior distributions, means, and variances for “parameters”, e.g. � E ( θ i | y ) = p ( M γ | y ) E [ θ γ, i | y ] γ ∈ Γ But does this make any sense? What happened to θ γ ? Predictions: � p (˜ y | y ) = p ( M γ | y ) p (˜ y | M γ , y ) γ ∈ Γ Jarad Niemi (Iowa State) Bayesian model averaging September 7, 2017 11 / 30

BMA in R R packages for BMA There are two main packages for Bayesian model average in R BMA: glm model averaging using BIC BMS: lm model averaging using g-priors and (possibly) MCMC Until recently there was another package BAS: lm model averaging with a variety of priors and (possibly) MCMC (additionally performed sampling without replacement) Jarad Niemi (Iowa State) Bayesian model averaging September 7, 2017 12 / 30

BMA in R BMA BMA in R library(BMA) UScrime <- MASS::UScrime # Set up data x = UScrime[,-16] y = log(UScrime[,16]) x[,-2] = log(x[,-2]) # Run BMA using BIC lma = bicreg(x, y, strict = TRUE, # remove submodels that are less likely OR = 20) # maximum BF ratio Jarad Niemi (Iowa State) Bayesian model averaging September 7, 2017 13 / 30

Bayesian model averaging Dr. Jarad Niemi Iowa State University - PowerPoint PPT Presentation

Bayesian model averaging Dr. Jarad Niemi Iowa State University September 7, 2017 Jarad Niemi (Iowa State) Bayesian model averaging September 7, 2017 1 / 30 Bayesian model averaging Bayesian model averaging Let { M : } indicate

Bayesian model averaging Dr. Jarad Niemi STAT 544 - Iowa State University March 9, 2017 Jarad

Value Averaging I nvesting The Strategy for Enhancing Investment Returns What is Value Averaging?

Bayesian Model Selection and Averaging Nonlinear Models Bayes factors Example Families FFX

Reynolds Averaging Reynolds Averaging We separate the dynamical fields into slowly varying mean

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

FAST UNCERTAINTY ESTIMATES AND BAYESIAN MODEL AVERAGING OF DNNS WESLEY MADDOX JOINT WORK WITH

Time (integrator) parallel exponential integration and phase-averaging for geophysical fluid

Capital Budgeting: CoC Averaging (Welch, Chapter 13-2) Ivo Welch Averaging (Opportunity) CoC

Averaging kernels and their use in validating AIRS temperature and water vapor A work in

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Forecasting Ination Using Dynamic Model Averaging Gary Koop and Dimitris Korobilis September

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

MCMC based machine learning a . (Bayesian Model Averaging) Nicos Angelopoulos

STAT 339 Hidden Markov Models III 21 April 2017 Bayesian Estimation / Model Averaging Outline

Introduction to Machine Learning Part 2 Yingyu Liang yliang@cs.wisc.edu Computer Sciences

What Makes A Design Difficult to Route Charles J. Alpert, Zhuo Li, Michael Moffitt, Gi-Joon Nam,

Meeting Agenda Welcome and opening remarks Israel Ruiz, Executive Vice President and Treasurer

RIPE Address Policy Working Group October 24, 2017 RIPE 75, Dubai WG Chairs: Gert D oring

Learning Bayesian networks viewed as an optimization problem Milan Studen y Institute of

Outline 1) Incorporating theoretical systematics in p fits 2) Spectral anomalies Frequentist

An abstract two-level Schwarz method for systems with high contrast coefficients Clemens

Customer Heterogeneity in Purchasing Habit of Variety Seeking Based on Hierarchical Bayesian