Introduction to Bayesian Statistics Lecture 4: Multiparameter models - PowerPoint PPT Presentation

Introduction to Bayesian Statistics Lecture 4: Multiparameter models (I) Rung-Ching Tsai Department of Mathematics National Taiwan Normal University March 18, 2015

Noninformative prior distributions • Proper and improper prior distributions • Unnormalized densities • Uniform prior distributions on different scales • Some examples ◦ Probability parameter θ ∈ (0 , 1) • One possibility: p ( θ ) = 1 [proper] • Another possibility: p (logit θ ) ∝ 1 corresponds to p ( θ ) ∝ θ − 1 (1 − θ ) − 1 [improper] ◦ Location parameter θ unconstrained y , σ 2 • One possibility: p ( θ ) ∝ 1 [improper] ⇒ p ( θ | y ) ≈ normal( θ | ¯ n ) ◦ Scale parameter σ > 0 • One possibility: p ( σ ) ∝ 1 [improper] • Another possibility: p (log σ 2 ) ∝ 1 corresponds to p ( σ 2 ) ∝ σ − 2 [improper] 2 of 17

Noninformative prior distributions: Jeffrey’s principle d φ | = p ( θ ) | h ′ ( θ ) | − 1 • φ = h ( θ ) , p ( φ ) = p ( θ ) | d θ • Jeffrey’s principle leads to a non informative prior density: p ( θ ) ∝ [ J ( θ )] 1 / 2 , where J ( θ ) is the Fisher information for θ : �� d log p ( y | θ ) � 2 � � d 2 log p ( y | θ ) � J ( θ ) = E | θ = − E | θ d θ 2 d θ • Jeffrey’s prior model is invariant to parameterization, evaluate J ( φ ) at θ = h − 1 ( φ ): � 2 � 2 � d 2 log p ( y | φ ) d 2 log p ( y | θ = h − 1 ( φ )) � � � � � d θ d θ � � � � J ( φ ) = − E = − E = J ( θ ) ; � � � � d φ 2 d θ 2 d φ d φ � � � � thus, J ( φ ) 1 / 2 = J ( θ ) 1 / 2 | d θ d φ | 3 of 17

Examples: Various noninformative prior distributions � n � θ y (1 − θ ) n − y • y | θ ∼ binomial( n , θ ), p ( y | θ ) = y • Jeffrey’s prior density p ( θ ) ∝ [ J ( θ )] 1 / 2 : log p ( y | θ ) = constant + y log θ + ( n − y )log(1 − θ ) . � d 2 log p ( y | θ ) � n J ( θ ) = − E | θ = d θ 2 θ (1 − θ ) θ − 1 / 2 (1 − θ ) − 1 / 2 . Jeffreys ′ prior ⇒ p ( θ ) ∝ • Three alternatives of prior ◦ Jeffreys’ prior: θ ∼ Beta( 1 2 , 1 2 ) ◦ uniform prior: θ ∼ Beta(1 , 1), i.e., p ( θ ) = 1 ◦ improper prior: θ ∼ Beta(0 , 0) i.e., p (log θ ) ∝ 1 4 of 17

From single-parameter to multiparameter models • The reality of applied statistics: there are always several (maybe many) unknown parameters! • BUT the interest usually lies in only a few of these (parameters of interest) while others are regarded as nuisance parameters for which we have no interest in making inferences but which are required in order to construct a realistic model. • At this point the simple conceptual framework of the Bayesian approach reveals its principal advantage over other forms of inference. 5 of 17

Bayesian approach to multiparameter models • The Bayesian approach is clear: Obtain the joint posterior distribution of all unknowns, then integrate over the nuisance parameters to leave the marginal posterior distribution for the parameters of interest. • Alternatively using simulation, draw samples from the entire joint posterior distribution (even this may be computationally difficult), look at the parameters of interest and ignore the rest. 6 of 17

Parameter of interest and nuisance parameter • Suppose model parameter θ has two parts θ = ( θ 1 , θ 2 ) ◦ Parameter of interest: θ 1 ◦ Nuisance parameter: θ 2 • For example y | µ, σ 2 ∼ normal( µ, σ 2 ) , ◦ Unknown: µ and σ 2 ◦ Parameter of interest (usually, not always): µ ◦ Nuisance parameter: σ 2 • Approach to obtain p ( θ 1 | y ) ◦ Averaging over nuisance parameters ◦ Factoring the joint posterior ◦ A strategy for computation: Conditional simulation via Gibbs sampler 7 of 17

Posterior distribution of θ = ( θ 1 , θ 2 ) • Prior of θ : p ( θ ) = p ( θ 1 , θ 2 ) • Likelihood of θ : p ( y | θ ) = p ( y | θ 1 , θ 2 ) • Posterior of θ = ( θ 1 , θ 2 ) given y : p ( θ 1 , θ 2 | y ) ∝ p ( θ 1 , θ 2 ) p ( y | θ 1 , θ 2 ) . 8 of 17

Approaches to obtain marginal posterior of θ 1 , p ( θ 1 | y ) • Joint posterior of θ 1 and θ 2 : p ( θ 1 , θ 2 | y ) ∝ p ( θ 1 , θ 2 ) p ( y | θ 1 , θ 2 ) • Approaches to obtain marginal posterior density p ( θ 1 | y ) ◦ By averaging or integrating over the nuisance parameter θ 2 : � p ( θ 1 | y ) = p ( θ 1 , θ 2 | y ) d θ 2 . ◦ By factoring the joint posterior: � p ( θ 1 | y ) = p ( θ 1 , θ 2 | y ) d θ 2 � = p ( θ 1 | θ 2 , y ) p ( θ 2 | y ) d θ 2 . (1) • p ( θ 1 | y ) is a mixture of the conditional posterior distributions given the nuisance parameter θ 2 , p ( θ 1 | θ 2 , y ). • The weighting function p ( θ 2 | y ) combines evidence from data and prior. • θ 2 can be categorical (discrete) and may take only a few possible values representing, for example, different sub-models. 9 of 17

A strategy for computation: Simulations instead of integration We rarely evaluate integral (1) explicitly, but it suggests an important strategy for constructing and computing with multiparameter models, using simulations. • Successive conditional simulations ◦ Draw θ 2 from its marginal posterior distribution, p ( θ 2 | y ). ◦ Draw θ 1 from conditional posterior distribution given the drawn value of θ 2 , p ( θ 1 | θ 2 , y ). • All-Others conditional simulations (Gibbs sampler) ◦ Draw θ ( t +1) from conditional posterior distribution given the previous 1 drawn value of θ ( t ) 2 , p ( θ 1 | θ ( t ) 2 , y ). ◦ Draw θ ( t +1) from conditional posterior distribution given the drawn 2 value of θ ( t ) 1 , p ( θ 2 | θ ( t ) 1 , y ). ◦ Iterating the procedure will ultimately generate samples from the marginal posterior distribution of p ( θ 1 , θ 2 | y ). 10 of 17

Multiparameter model: the normal model (I) iid ∼ normal( µ, σ 2 ), both µ and σ 2 unknown, use Bayesian • y 1 , · · · , y n approach to estimate µ . ◦ choose a prior for ( µ, σ 2 ), take noninformative priors: p ( µ, σ 2 ) = p ( µ ) p ( σ 2 ) ∝ 1 · ( σ 2 ) − 1 = σ − 2 • prior independence of location and scale • p ( µ ) ∝ 1: noninformative or uniform but improper prior • p (log σ 2 ) ∝ 1 ⇒ p ( σ 2 ) ∝ ( σ 2 ) − 1 : noninformative or uniform on log σ 2 ◦ likelihood: n 1 � − 1 � p ( y | µ, σ 2 ) � 2 σ 2 ( y i − µ ) 2 = √ exp 2 πσ i =1 � n � − 1 � σ − n exp ( y i − µ ) 2 ∝ 2 σ 2 ( i =1 11 of 17

Joint posterior distribution, p ( µ, σ 2 | y ) iid ∼ normal( µ, σ 2 ) • y 1 , · · · , y n ◦ prior of ( µ, σ 2 ): p ( µ, σ 2 ) = p ( µ ) p ( σ 2 ) ∝ 1 · ( σ 2 ) − 1 = σ − 2 ◦ find the joint posterior distribution of ( µ, σ 2 ): p ( µ, σ 2 | y ) p ( µ, σ 2 ) p ( y | µ, σ 2 ) ∝ � n � − 1 � σ − n − 2 exp ( y i − µ ) 2 ∝ 2 σ 2 ( i =1 � n � − 1 y ) 2 + n (¯ σ − n − 2 exp � y − µ ) 2 = 2 σ 2 ( ( y i − ¯ i =1 � � − 1 2 σ 2 [( n − 1) s 2 + n (¯ σ − n − 2 exp y − µ ) 2 ] = . where s 2 = � n 1 y ) 2 , the sample variance. The sufficient i =1 ( y i − ¯ n − 1 statistics are s 2 and ¯ y . 12 of 17

Marginal posterior distribution, p ( σ 2 | y ) • p ( µ, σ 2 | y ) = p ( µ | σ 2 , y ) p ( σ 2 | y ) • p ( σ 2 | y ) requires averaging the joint distribution 2 σ 2 [( n − 1) s 2 + n (¯ p ( µ, σ 2 | y ) ∝ σ − n − 2 exp − 1 y − µ ) 2 ] � � over µ , that is, evaluating the simple normal integral � 2 πσ 2 � � − 1 � y − µ ) 2 exp 2 σ 2 n (¯ d µ = , n thus, − ( n − 1) s 2 � � p ( σ 2 | y ) ( σ 2 ) − ( n +1) / 2 exp ∝ 2 σ 2 σ 2 | y Inv − χ 2 ( n − 1 , s 2 ) , ∼ which is a scaled inverse- χ 2 distribution. 14 of 17

Analytic form of marginal posterior distribution of µ • µ is typically the estimand of interest, so ultimate objective of the Bayesian analysis is the marginal posterior distribution of µ . This can be obtained by integrating σ 2 out of the joint posterior distribution. Easily done by simulation: first draw σ 2 from p ( σ 2 | y ), then draw µ from p ( µ | σ 2 , y ). • The posterior distribution of µ , p ( µ | y ), can be thought of as a mixture of normal distributions mixed over the scaled inverse chi-squared distribution for the variance - a rare case where analytic results are available. 15 of 17

Performing the integration • We start by integrating the joint posterior density over σ 2 � ∞ p ( µ, σ 2 | y ) d σ 2 p ( µ | y ) = 0 2 σ 2 , A = ( n − 1) s 2 + n ( µ − ¯ A y ) 2 , the • With the substitution z = result is an unnormalized gamma integral: � ∞ A − n / 2 z ( n − 2) / 2 exp( − z ) dz p ( µ | y ) ∝ 0 [( n − 1) s 2 + n ( µ − ¯ y ) 2 ] − n / 2 ∝ � − n / 2 y ) 2 � 1 + n ( µ − ¯ ∝ ( n − 1) s 2 y , s 2 • µ | y ∼ t n − 1 (¯ n ) . 16 of 17

Introduction to Bayesian Statistics Lecture 4: Multiparameter models - PowerPoint PPT Presentation

Introduction to Bayesian Statistics Lecture 4: Multiparameter models (I) Rung-Ching Tsai Department of Mathematics National Taiwan Normal University March 18, 2015 Noninformative prior distributions Proper and improper prior distributions

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Statistics for Analytical Science at Warwick Simon Spencer Bayesian statistics in epidemiology

Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 Graham Neubig Non-parametric

Statistics for Applications Chapter 8: Bayesian Statistics 1/17 The Bayesian approach (1)

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Bayesian statistics DS GA 1002 Probability and Statistics for Data Science

Introduction to Bayesian Statistics Louis Raes Spring 2017 Table of contents Organisation,

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Workshop 7.2b: Introduction to Bayesian models Murray Logan February 7, 2017 Table of

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Bayesian Inference and Traffic Analysis Carmela Troncoso George Danezis September-November

Welcome to the co u rse ! FU N DAME N TAL S OF BAYE SIAN DATA AN ALYSIS IN R Rasm u s Bth

MaCh3 and Bayesian Analysis Patrick Dunne Outline Introduce T2K method for analysis How

Part 3 Robust Bayesian statistics & applications in reliability networks by Gero Walter 69

Bayesian Fitting Probabilistic Morphable Models Summer School, June 2017 Sandro Schnborn

Model inference . Course of Machine Learning Master Degree in Computer Science University of

Probabilistic Graphical Models Lecture 6 Variable Elimination CS/CNS/EE 155 Andreas Krause

First Results with PAWIAN th 2019| P ANDA CM 19/2 GSI | Jennifer Ptz June 25 Outline

Introduction to Bayesian Statistics Lecture 4: Multiparameter models - PowerPoint PPT Presentation

Introduction to Bayesian Statistics Lecture 4: Multiparameter models (I) Rung-Ching Tsai Department of Mathematics National Taiwan Normal University March 18, 2015 Noninformative prior distributions Proper and improper prior distributions

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Statistics for Analytical Science at Warwick Simon Spencer Bayesian statistics in epidemiology

Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 Graham Neubig Non-parametric

Statistics for Applications Chapter 8: Bayesian Statistics 1/17 The Bayesian approach (1)

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Bayesian statistics DS GA 1002 Probability and Statistics for Data Science

Introduction to Bayesian Statistics Louis Raes Spring 2017 Table of contents Organisation,

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Workshop 7.2b: Introduction to Bayesian models Murray Logan February 7, 2017 Table of

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Bayesian Inference and Traffic Analysis Carmela Troncoso George Danezis September-November

Welcome to the co u rse ! FU N DAME N TAL S OF BAYE SIAN DATA AN ALYSIS IN R Rasm u s Bth

MaCh3 and Bayesian Analysis Patrick Dunne Outline Introduce T2K method for analysis How

Part 3 Robust Bayesian statistics &amp; applications in reliability networks by Gero Walter 69

Bayesian Fitting Probabilistic Morphable Models Summer School, June 2017 Sandro Schnborn

Model inference . Course of Machine Learning Master Degree in Computer Science University of

Probabilistic Graphical Models Lecture 6 Variable Elimination CS/CNS/EE 155 Andreas Krause

First Results with PAWIAN th 2019| P ANDA CM 19/2 GSI | Jennifer Ptz June 25 Outline

Part 3 Robust Bayesian statistics & applications in reliability networks by Gero Walter 69