introduction to bayesian statistics
play

Introduction to Bayesian Statistics Lecture 9: Hierarchical Models - PowerPoint PPT Presentation

Introduction to Bayesian Statistics Lecture 9: Hierarchical Models Rung-Ching Tsai Department of Mathematics National Taiwan Normal University May 6, 2015 Example Data: Weekly weights of 30 young rats (Gelfand, Hills, Racine-Poon, &


  1. Introduction to Bayesian Statistics Lecture 9: Hierarchical Models Rung-Ching Tsai Department of Mathematics National Taiwan Normal University May 6, 2015

  2. Example • Data: Weekly weights of 30 young rats (Gelfand, Hills, Racine-Poon, & Smith, 1990). Day 8 15 22 29 36 Rat 1 151 199 246 283 320 Rat 2 145 199 249 293 354 · · · Rat 30 153 200 244 286 324 • Model: Y ij = α + β x j + ǫ ij , where Y ij : weight of i -th rat on day x j ; ǫ ij ∼ Normal(0 , σ 2 ) • What is the assumption on the growth of the 30 rats in this model? 2 of 22

  3. Example • Data: Number of Failures and length of operation time of 10 power plant pumps (George, Makov, & Smith, 1993). Pump 1 2 3 4 5 6 7 8 9 10 time 94.5 15.7 62.9 126 5.24 31.4 1.05 1.05 2.1 10.5 failure 5 1 5 14 3 19 1 1 4 22 • Model: X ij ∼ Poisson( λ t i ) where X ij is the number of power failures, λ is the failure rate, and t i is the length of operation time of pump i (in 1000s of hours). • What is the assumption on the failure rates of the 10 power plant pumps in this model? 3 of 22

  4. Possible problems with above approaches • A single ( α, β ) may be inadequate to fit all the rats. Likewise, a common failure rate for all the power plant pumps may not be suitable. • Separate unrelated ( α i , β i ) for each rat, or λ i for each pump are likely to overfit the data. Some information about the parameters of one rat or one pump can be obtained from others’ data. 4 of 22

  5. Motivation for hierarchical models • A thought naturally arises by assuming that ( α i , β i )’s or λ i ’s are samples from a common population distribution. The distribution of observed outcomes are conditional on parameters which themselves have a probability specification, known as a hierarchical or multilevel model. • The new parameters introduced to govern the population distribution of the parameters are called hyperparameters. • Thus, we would need to estimate the parameters governing the population distribution of ( α i , β i ) rather than each ( α i , β i ) separately. 5 of 22

  6. Bayesian approach to hierarchical models • Model specification ◦ specify the sampling distribution of data: p ( y | θ ) ◦ specify the population distribution of θ : p ( θ | φ ) where φ is the hyperparameter • Bayesian estimation ◦ specify the prior for hyperparameter: p ( φ ); Many levels are possible. The hyperprior distribution at highest level is often chosen to be non-informative ◦ consider the above model specification: p ( y | θ ) and p ( θ | φ ) ◦ find the joint posterior distribution of parameter θ and hyperparameter φ : p ( θ, φ | y ) ∝ p ( θ, φ ) p ( y | θ, φ ) = p ( θ, φ ) p ( y | θ ) ∝ p ( φ ) p ( θ | φ ) p ( y | θ ) ◦ Point and Credible interval estimations for φ and θ ◦ Predictive distribution for ˜ y 6 of 22

  7. Analytical derivation of conditional/marginal dist. • Write put the joint posterior distribution: p ( θ, φ | y ) ∝ p ( φ ) p ( θ | φ ) p ( y | θ ) • Determine analytically the conditional posterior density of θ given φ : p ( θ | φ, y ) • Obtain the marginal posterior distribution of φ : � p ( φ | y ) = p ( θ, φ | y ) d θ or p ( φ | y ) = p ( θ, φ | y ) p ( θ | φ, y ) . 7 of 22

  8. Simulations from the posterior distributions 1. Two steps to simulate a random draw from the joint posterior distribution of θ and φ : p ( θ, φ | y ) ◦ Draw φ from its marginal posterior distribution: p ( φ | y ) ◦ Draw parameter θ from its conditional posterior p ( θ | φ, y ) 2. If desired, draw predictive values ˜ y from the posterior predictive distribution given the drawn θ 8 of 22

  9. Example: Rat tumors • Goal: Estimating the risk of tumor in a group of rats • Data (number of rats developed some kind of tumor): 1. 70 historical experiments: 0/20 0/20 0/20 0/20 0/20 0/20 0/20 0/19 0/19 0/19 0/19 0/18 0/18 0/17 1/20 1/20 1/20 1/20 1/19 1/19 1/18 1/18 2/25 2/24 2/23 2/20 2/20 2/20 2/20 2/20 2/20 1/10 5/49 2/19 5/46 3/27 2/17 7/49 7/47 3/20 3/20 2/13 9/48 10/50 4/20 4/20 4/20 4/20 4/20 4/20 4/20 10/48 4/19 4/19 4/19 5/22 11/46 12/49 5/20 5/20 6/23 5/19 6/22 6/20 6/20 6/20 16/52 15/47 15/46 9/24 2. Current experiment: 4/14 9 of 22

  10. Bayesian approach to hierarchical models • Model specification ◦ sampling distribution of data: y j ∼ binomial( n j , θ j ) , j = 1 , 2 , · · · , 71 . ◦ the population distribution of θ : θ j ∼ Beta( α, β ) where α and β are the hyperparameters. • Bayesian estimation ◦ non-informative prior for hyperparameters: p ( α, β ) ◦ consider the above model specification: p ( θ | α, β ) ◦ find the joint posterior distribution of parameter θ and hyperparameters α and β : p ( θ, α, β | y ) ∝ p ( α, β ) p ( θ | α, β ) p ( y | θ, α, β ) J J Γ( α + β ) � � Γ( α )Γ( β ) θ α − 1 (1 − θ j ) β − 1 θ y i j (1 − θ j ) n j − y j ∝ p ( α, β ) j j =1 j =1 10 of 22

  11. Analytical derivation of conditional/marginal dist. • the joint posterior distribution: J J Γ( α + β ) � � θ y i Γ( α )Γ( β ) θ α − 1 (1 − θ j ) β − 1 j (1 − θ j ) n j − y j p ( θ, α, β | y ) ∝ p ( α, β ) j j =1 j =1 • the conditional posterior density of θ given α and β : J Γ( α + β + n j ) Γ( α + y j )Γ( β + n j − y j ) θ α + y j − 1 � (1 − θ j ) β + n j − y j − 1 p ( θ | α, β, y ) = j j =1 • the marginal posterior distribution of α and β : J p ( α, β | y ) = p ( θ, α, β | y ) Γ( α + β ) Γ( α + y j )Γ( β + n j − y j ) � p ( θ | α, β, y ) ∝ p ( α, β ) Γ( α )Γ( β ) Γ( α + β + n j ) j =1 11 of 22

  12. Choice of hyperprior distribution • Idea: To set up a ‘non-informative’ hyperprior distribution � � α + β ) = log( α α ◦ p logit( β ) , log( α + β ) ∝ 1 NO GOOD because it leads to improper posterior. � � α ◦ p α + β , α + β ∝ 1 or p ( α, β ) ∝ 1 NO GOOD because the posterior density is not integrable in the limit. ◦ � � α α + β , ( α + β ) − 1 / 2 p ( α, β ) ∝ ( α + β ) − 5 / 2 ∝ 1 ⇐ ⇒ p � � log( α ∝ αβ ( α + β ) − 5 / 2 β ) , log( α + β ) ⇐ ⇒ p OK because it leads to proper posterior. 12 of 22

  13. Computing marginal posterior of the hyperparameters • Computing the relative (unnormalized) posterior density on a grid of values that cover the effective range of ( α, β ) � � log( α ◦ β ) , log( α + β ) ∈ [ − 1 , − 2 . 5] × [1 . 5 , 3] � � log( α ◦ β ) , log( α + β ) ∈ [ − 1 . 3 , − 2 . 3] × [1 , 5] • Drawing contour plot of the marginal density of � � log( α β ) , log( α + β ) ◦ contour lines are at 0 . 05 , 0 . 15 , · · · , 0 . 95 times the density at the mode. • Normalizing by approximating the posterior distribution as a step function over a grid and setting total probability in the grid to 1. • Computing the posterior moments based on the grid of (log( α β ) , log( α + β )). For example, E( α | y ) is estimated by = α p (log( α � β ) , log( α + β ) | y ) log ( α β ) , log ( α + β ) 13 of 22

  14. Sampling from the joint posterior 1. Simulation 1000 draws of (log( α β ) , log( α + β )) from their posterior distribution using the discrete-grid sampling procedure. 2. For l = 1 , · · · , 1000 ◦ Transform the l -th draw of (log( α β ) , log( α + β )) to the scale of ( α, β ) to yield a draw of the hyperparameters from their marginal posterior distribution. ◦ For each j = 1 , · · · , J , sample θ j from its conditional posterior distribution θ j | α, β, y ∼ Beta( α + y j , β + n j − y j ). 14 of 22

  15. Displaying the results • Plot the posterior means and 95% intervals for the θ j ’s (Figure 5.4 on page 131) • Rate θ j ’s are shrunk from their sample point estimates, y j n j , towards the population distribution, with approximate mean. • Experiment with few observation are shrunk more and have higher posterior variances. • Note that posterior variability is higher in the full Bayesian analysis, reflecting posterior uncertainty in the hyperparameters. 15 of 22

  16. Hierarchical normal models (I) • Model specification ◦ Sampling distribution of data: y ij | θ j ∼ Normal( θ j , σ 2 ) , i = 1 , · · · , n j , j = 1 , 2 , · · · , J . σ 2 known ◦ the population distribution of θ : θ j ∼ Normal( µ, τ 2 ) where µ and τ are the hyperparameters. That is, J � N( θ j | µ, τ 2 ) p ( θ 1 , · · · , θ J | µ, τ ) = j =1 ◦ J � � [N( θ j | µ, τ 2 )] p ( µ, τ ) d ( µ, τ ) . p ( θ 1 , · · · , θ J ) = j =1 16 of 22

  17. Hierarchical normal models (II) • Bayesian estimation ◦ non-informative prior for hyperparameters: p ( µ, τ ) = p ( µ | τ ) p ( τ ) ∝ p ( τ ) ◦ consider the above model specification: p ( θ | µ, τ ) ◦ find the joint posterior distribution of parameter θ and hyperparameters µ and τ : p ( θ, µ, τ | y ) ∝ p ( µ, τ ) p ( θ | µ, τ ) p ( y | θ ) J J � � N( θ j | µ, τ 2 ) y . j | θ j , σ 2 / n j ) ∝ p ( µ, τ ) N(¯ j =1 j =1 17 of 22

  18. Conditional posterior of θ given ( µ, τ ), p ( θ | µ, τ, y ) • θ j | µ, τ ∼ Normal( µ, τ 2 ) , • θ j | µ, τ, y ∼ Normal(ˆ θ j , V j ) , where ◦ n j y . j + 1 σ 2 ¯ τ 2 µ ˆ θ j = n j σ 2 + 1 τ 2 ◦ 1 V j = n j σ 2 + 1 τ 2 18 of 22

Recommend


More recommend