Prediction and Model Comparison Applied Bayesian Statistics Dr. Earvin Balderama Department of Mathematics & Statistics Loyola University Chicago October 31, 2017 Prediction and Model Comparison 1 Last edited October 25, 2017 by <ebalderama@luc.edu>
MCMC (Bayesian) Modeling is all about MCMCMC � Prediction and Model Comparison 2 Last edited October 25, 2017 by <ebalderama@luc.edu>
MCMC � MCMCMC Steps in (Bayesian) modeling: Model Creation (Choice; Computation) 1 Model Checking (Criticism; Diagnostics) 2 Model Comparison (Choice; Selection; Change) 3 Prediction and Model Comparison 3 Last edited October 25, 2017 by <ebalderama@luc.edu>
MCMC � MCMCMC Steps in (Bayesian) modeling: Model Creation (Choice; Computation) 1 Model Checking (Criticism; Diagnostics) 2 Model Comparison (Choice; Selection; Change) 3 Repeat! 4 Prediction and Model Comparison 3 Last edited October 25, 2017 by <ebalderama@luc.edu>
MCMC What we’re focusing on today Recall conditional distributions: f ( A | B ) = f ( A , B ) f ( B ) joint conditional = marginal This time, we’ll give some attention to the marginal distribution. f ( θ | y ) = f ( y | θ ) f ( θ ) f ( y | θ ) f ( θ ) � = f ( y ) f ( y | θ ) f ( θ ) d θ Prediction and Model Comparison 4 Last edited October 25, 2017 by <ebalderama@luc.edu>
MCMC Outline Some Popular Bayesian fit statistics 1 Connection to classical statistics Predictive Distributions 2 Prior predictive distribution Posterior predictive distribution Posterior predictive checks Predictive Performance 3 Precision Accuracy Extreme Values Prediction and Model Comparison 5 Last edited October 25, 2017 by <ebalderama@luc.edu>
MCMC Modeling Classical methods... Standardized Pearson residuals 1 p -values 2 Likelihood ratio 3 MLE 4 ...also apply in a Bayesian analysis: Posterior mean of the standardized residuals. 1 Posterior probabilities 2 Bayes factor 3 Posterior mean 4 Prediction and Model Comparison 6 Last edited October 25, 2017 by <ebalderama@luc.edu>
Popular Model Fit Statistics Bayes Factor For determining which model fits the data “better”, the Bayes factor is commonly used in a hypothesis test. Given data y and two competing models, M 1 and M 2 , with parameter vectors θ 1 and θ 2 , respectively, the Bayes factor is a measure of how much the data favors Model 1 over Model 2: � f ( y | θ 1 ) f ( θ 1 ) d θ 1 BF ( y ) = f 1 ( y ) � f 2 ( y ) = f ( y | θ 2 ) f ( θ 2 ) d θ 2 Note: The Bayes factor is an odds ratio: the ratio of the posterior and prior odds of favoring Model 1 over Model 2: Prediction and Model Comparison 7 Last edited October 25, 2017 by <ebalderama@luc.edu>
Popular Model Fit Statistics Bayes Factor The good: More robust than frequentist hypothesis testing. Often used for testing a “full model” vs. “reduced model” like in classical statistics. One model doesn’t need to be nested within the other model. The bad: Difficult to compute, although easy to approximate with software. Only defined for proper marginal density functions. Computation is conditional that one of the models is true. Because of this, Gelman thinks Bayes factors are irrelevant. Prefers looking at distance measures between data and model. Many distance measures to choose from! One of which is... Prediction and Model Comparison 8 Last edited October 25, 2017 by <ebalderama@luc.edu>
Popular Model Fit Statistics DIC Like many good measures of model fit and comparison, the Deviance Information Criterion (DIC) includes how well the model fits the data ( goodness of fit ) and 1 the complexity of the model ( effective number of parameters ). 2 The Deviance Information Criterion (DIC) is given by DIC = ¯ D + p D where ¯ D = E ( D ) is the “mean deviance” 1 p D = ¯ D − D (¯ θ ) is the “mean deviance − deviance at means” 2 Prediction and Model Comparison 9 Last edited October 25, 2017 by <ebalderama@luc.edu>
Popular Model Fit Statistics Deviance Define the deviance as D = − 2 log f ( y | θ ) Example: Poisson likelihood � µ y i � � i e − µ i D = − 2 log y i ! i � � � = − 2 − µ i + y i log µ i − log ( y i ! ) i Prediction and Model Comparison 10 Last edited October 25, 2017 by <ebalderama@luc.edu>
Popular Model Fit Statistics DIC DIC can then be rewritten as DIC = ¯ D + p D = p D + D (¯ ( since ¯ D = p D + D (¯ θ ) + p D θ )) = D (¯ θ ) + 2 p D = − 2 log f ( y | ¯ θ ) + 2 p D which is a generalization of AIC = − 2 log f ( y | ˆ θ MLE ) + 2 k DIC can be used to compare different models as well as different methods. Preferred models have low DIC values. Prediction and Model Comparison 11 Last edited October 25, 2017 by <ebalderama@luc.edu>
Popular Model Fit Statistics DIC Requires joint posterior distribution to be approximately multivariate normal. Doesn’t work well with highly non-linear models mixture models with discrete parameters models with missing data If p D is negative log-likelihood may be non-concave prior may be misspecified posterior mean may not be a good estimator Prediction and Model Comparison 12 Last edited October 25, 2017 by <ebalderama@luc.edu>
Predictive Distributions Predictions Maybe a better (best?) way to decide between competing models is to rank them based on how “well” each model does in predicting future observations. Prediction and Model Comparison 13 Last edited October 25, 2017 by <ebalderama@luc.edu>
Predictive Distributions The plug-in approach to prediction Example Consider the regression model � β 0 + X i 1 β 1 + · · · + X ip β p , σ 2 � ind Y i ∼ Normal Suppose we have a new covariate vector X new and we would like to predict the corresponding response Y new . The “plug-in” approach would be to fix β and σ at their posterior means ˆ β and σ to make predictions: ˆ � σ 2 � Y new | ˆ X new ˆ β , ˆ σ ∼ Normal β , ˆ . Prediction and Model Comparison 14 Last edited October 25, 2017 by <ebalderama@luc.edu>
Predictive Distributions The plug-in approach to prediction However, this plug-in approach suppresses uncertainty about the parameters, β and σ . Therefore, the prediction intervals will be too narrow, leading to undercoverage . We need to account for all uncertainty when making predictions, including our uncertainty about β and σ . Prediction and Model Comparison 15 Last edited October 25, 2017 by <ebalderama@luc.edu>
Predictive Distributions Predictive distributions In Bayesian analyses, predictive distributions are used for comparing models in terms of how “well” each model does in predicting future observations. The idea is, we want to explore the predictive distributions of the unknown observations , which accounts for the uncertainty in predicting those observations. Having distributions for the unknown future observations comes naturally in a Bayesian analysis because of uncertainty distributions for the unknown model parameters. First, a question: Before any data is observed, what could we use for predictions? Prediction and Model Comparison 16 Last edited October 25, 2017 by <ebalderama@luc.edu>
Predictive Distributions Prior Predictive Distribution Before any data is observed, what could we use for predictions? We have a likelihood function, but to account for all uncertainty when making predictions, we marginalize over the model parameters. The marginal likelihood is what one would expect data to look like after averaging over the prior distribution of θ , so it is also called the prior predictive distribution : � f ( y ) = f ( y | θ ) f ( θ ) d θ Prediction and Model Comparison 17 Last edited October 25, 2017 by <ebalderama@luc.edu>
Predictive Distributions Posterior Predictive Distribution More interestingly, what if a set of data y have already been observed? How can we make predictions for future (or new or unobserved) observations y new ? We can select from the marginal posterior likelihood of y new , called the posterior predictive distribution (PPD) : � f ( y new | y ) = f ( y new | θ ) f ( θ | y ) d θ This distribution is what one would expect y new to look like after observing y and averaging over the posterior distribution of θ given y . The concept of the PPD applies generally (e.g., logistic regression). Prediction and Model Comparison 18 Last edited October 25, 2017 by <ebalderama@luc.edu>
Predictive Distributions Posterior Predictive Distribution Equivalently, y new can be considered missing values and treated as additional parameters to be estimated in a Bayesian framework. (More on missing values later) Example For a certain complete dataset, we may want to randomly assign NA values to some number m observations which creates a test set y mis = { y 1 , y 2 , . . . , y m } . After MCMC, the m posterior predictive distributions, P 1 , P 2 , . . . , P m , can be used to determine measures of overall model goodness-of-fit, as well as predictive performance measures of each y i in the test set. Prediction and Model Comparison 19 Last edited October 25, 2017 by <ebalderama@luc.edu>
Recommend
More recommend