introduction to bayesian statistics
play

Introduction to Bayesian Statistics Lecture 4: Multiparameter models - PowerPoint PPT Presentation

Introduction to Bayesian Statistics Lecture 4: Multiparameter models (I) Rung-Ching Tsai Department of Mathematics National Taiwan Normal University March 18, 2015 Noninformative prior distributions Proper and improper prior distributions


  1. Introduction to Bayesian Statistics Lecture 4: Multiparameter models (I) Rung-Ching Tsai Department of Mathematics National Taiwan Normal University March 18, 2015

  2. Noninformative prior distributions • Proper and improper prior distributions • Unnormalized densities • Uniform prior distributions on different scales • Some examples ◦ Probability parameter θ ∈ (0 , 1) • One possibility: p ( θ ) = 1 [proper] • Another possibility: p (logit θ ) ∝ 1 corresponds to p ( θ ) ∝ θ − 1 (1 − θ ) − 1 [improper] ◦ Location parameter θ unconstrained y , σ 2 • One possibility: p ( θ ) ∝ 1 [improper] ⇒ p ( θ | y ) ≈ normal( θ | ¯ n ) ◦ Scale parameter σ > 0 • One possibility: p ( σ ) ∝ 1 [improper] • Another possibility: p (log σ 2 ) ∝ 1 corresponds to p ( σ 2 ) ∝ σ − 2 [improper] 2 of 17

  3. Noninformative prior distributions: Jeffrey’s principle d φ | = p ( θ ) | h ′ ( θ ) | − 1 • φ = h ( θ ) , p ( φ ) = p ( θ ) | d θ • Jeffrey’s principle leads to a non informative prior density: p ( θ ) ∝ [ J ( θ )] 1 / 2 , where J ( θ ) is the Fisher information for θ : �� d log p ( y | θ ) � 2 � � d 2 log p ( y | θ ) � J ( θ ) = E | θ = − E | θ d θ 2 d θ • Jeffrey’s prior model is invariant to parameterization, evaluate J ( φ ) at θ = h − 1 ( φ ): � 2 � 2 � d 2 log p ( y | φ ) d 2 log p ( y | θ = h − 1 ( φ )) � � � � � d θ d θ � � � � J ( φ ) = − E = − E = J ( θ ) ; � � � � d φ 2 d θ 2 d φ d φ � � � � thus, J ( φ ) 1 / 2 = J ( θ ) 1 / 2 | d θ d φ | 3 of 17

  4. Examples: Various noninformative prior distributions � n � θ y (1 − θ ) n − y • y | θ ∼ binomial( n , θ ), p ( y | θ ) = y • Jeffrey’s prior density p ( θ ) ∝ [ J ( θ )] 1 / 2 : log p ( y | θ ) = constant + y log θ + ( n − y )log(1 − θ ) . � d 2 log p ( y | θ ) � n J ( θ ) = − E | θ = d θ 2 θ (1 − θ ) θ − 1 / 2 (1 − θ ) − 1 / 2 . Jeffreys ′ prior ⇒ p ( θ ) ∝ • Three alternatives of prior ◦ Jeffreys’ prior: θ ∼ Beta( 1 2 , 1 2 ) ◦ uniform prior: θ ∼ Beta(1 , 1), i.e., p ( θ ) = 1 ◦ improper prior: θ ∼ Beta(0 , 0) i.e., p (log θ ) ∝ 1 4 of 17

  5. From single-parameter to multiparameter models • The reality of applied statistics: there are always several (maybe many) unknown parameters! • BUT the interest usually lies in only a few of these (parameters of interest) while others are regarded as nuisance parameters for which we have no interest in making inferences but which are required in order to construct a realistic model. • At this point the simple conceptual framework of the Bayesian approach reveals its principal advantage over other forms of inference. 5 of 17

  6. Bayesian approach to multiparameter models • The Bayesian approach is clear: Obtain the joint posterior distribution of all unknowns, then integrate over the nuisance parameters to leave the marginal posterior distribution for the parameters of interest. • Alternatively using simulation, draw samples from the entire joint posterior distribution (even this may be computationally difficult), look at the parameters of interest and ignore the rest. 6 of 17

  7. Parameter of interest and nuisance parameter • Suppose model parameter θ has two parts θ = ( θ 1 , θ 2 ) ◦ Parameter of interest: θ 1 ◦ Nuisance parameter: θ 2 • For example y | µ, σ 2 ∼ normal( µ, σ 2 ) , ◦ Unknown: µ and σ 2 ◦ Parameter of interest (usually, not always): µ ◦ Nuisance parameter: σ 2 • Approach to obtain p ( θ 1 | y ) ◦ Averaging over nuisance parameters ◦ Factoring the joint posterior ◦ A strategy for computation: Conditional simulation via Gibbs sampler 7 of 17

  8. Posterior distribution of θ = ( θ 1 , θ 2 ) • Prior of θ : p ( θ ) = p ( θ 1 , θ 2 ) • Likelihood of θ : p ( y | θ ) = p ( y | θ 1 , θ 2 ) • Posterior of θ = ( θ 1 , θ 2 ) given y : p ( θ 1 , θ 2 | y ) ∝ p ( θ 1 , θ 2 ) p ( y | θ 1 , θ 2 ) . 8 of 17

  9. Approaches to obtain marginal posterior of θ 1 , p ( θ 1 | y ) • Joint posterior of θ 1 and θ 2 : p ( θ 1 , θ 2 | y ) ∝ p ( θ 1 , θ 2 ) p ( y | θ 1 , θ 2 ) • Approaches to obtain marginal posterior density p ( θ 1 | y ) ◦ By averaging or integrating over the nuisance parameter θ 2 : � p ( θ 1 | y ) = p ( θ 1 , θ 2 | y ) d θ 2 . ◦ By factoring the joint posterior: � p ( θ 1 | y ) = p ( θ 1 , θ 2 | y ) d θ 2 � = p ( θ 1 | θ 2 , y ) p ( θ 2 | y ) d θ 2 . (1) • p ( θ 1 | y ) is a mixture of the conditional posterior distributions given the nuisance parameter θ 2 , p ( θ 1 | θ 2 , y ). • The weighting function p ( θ 2 | y ) combines evidence from data and prior. • θ 2 can be categorical (discrete) and may take only a few possible values representing, for example, different sub-models. 9 of 17

  10. A strategy for computation: Simulations instead of integration We rarely evaluate integral (1) explicitly, but it suggests an important strategy for constructing and computing with multiparameter models, using simulations. • Successive conditional simulations ◦ Draw θ 2 from its marginal posterior distribution, p ( θ 2 | y ). ◦ Draw θ 1 from conditional posterior distribution given the drawn value of θ 2 , p ( θ 1 | θ 2 , y ). • All-Others conditional simulations (Gibbs sampler) ◦ Draw θ ( t +1) from conditional posterior distribution given the previous 1 drawn value of θ ( t ) 2 , p ( θ 1 | θ ( t ) 2 , y ). ◦ Draw θ ( t +1) from conditional posterior distribution given the drawn 2 value of θ ( t ) 1 , p ( θ 2 | θ ( t ) 1 , y ). ◦ Iterating the procedure will ultimately generate samples from the marginal posterior distribution of p ( θ 1 , θ 2 | y ). 10 of 17

  11. Multiparameter model: the normal model (I) iid ∼ normal( µ, σ 2 ), both µ and σ 2 unknown, use Bayesian • y 1 , · · · , y n approach to estimate µ . ◦ choose a prior for ( µ, σ 2 ), take noninformative priors: p ( µ, σ 2 ) = p ( µ ) p ( σ 2 ) ∝ 1 · ( σ 2 ) − 1 = σ − 2 • prior independence of location and scale • p ( µ ) ∝ 1: noninformative or uniform but improper prior • p (log σ 2 ) ∝ 1 ⇒ p ( σ 2 ) ∝ ( σ 2 ) − 1 : noninformative or uniform on log σ 2 ◦ likelihood: n 1 � − 1 � p ( y | µ, σ 2 ) � 2 σ 2 ( y i − µ ) 2 = √ exp 2 πσ i =1 � n � − 1 � σ − n exp ( y i − µ ) 2 ∝ 2 σ 2 ( i =1 11 of 17

  12. Joint posterior distribution, p ( µ, σ 2 | y ) iid ∼ normal( µ, σ 2 ) • y 1 , · · · , y n ◦ prior of ( µ, σ 2 ): p ( µ, σ 2 ) = p ( µ ) p ( σ 2 ) ∝ 1 · ( σ 2 ) − 1 = σ − 2 ◦ find the joint posterior distribution of ( µ, σ 2 ): p ( µ, σ 2 | y ) p ( µ, σ 2 ) p ( y | µ, σ 2 ) ∝ � n � − 1 � σ − n − 2 exp ( y i − µ ) 2 ∝ 2 σ 2 ( i =1 � n � − 1 y ) 2 + n (¯ σ − n − 2 exp � y − µ ) 2 = 2 σ 2 ( ( y i − ¯ i =1 � � − 1 2 σ 2 [( n − 1) s 2 + n (¯ σ − n − 2 exp y − µ ) 2 ] = . where s 2 = � n 1 y ) 2 , the sample variance. The sufficient i =1 ( y i − ¯ n − 1 statistics are s 2 and ¯ y . 12 of 17

  13. Conditional posterior distribution, p ( µ | σ 2 , y ) • p ( µ, σ 2 | y ) = p ( µ | σ 2 , y ) p ( σ 2 | y ) • Use the case with single parameter µ with known σ 2 and non informative prior p ( µ ) ∝ 1, we have y , σ 2 p ( µ | σ 2 , y ) ∼ normal(¯ n ) . 13 of 17

  14. Marginal posterior distribution, p ( σ 2 | y ) • p ( µ, σ 2 | y ) = p ( µ | σ 2 , y ) p ( σ 2 | y ) • p ( σ 2 | y ) requires averaging the joint distribution 2 σ 2 [( n − 1) s 2 + n (¯ p ( µ, σ 2 | y ) ∝ σ − n − 2 exp − 1 y − µ ) 2 ] � � over µ , that is, evaluating the simple normal integral � 2 πσ 2 � � − 1 � y − µ ) 2 exp 2 σ 2 n (¯ d µ = , n thus, − ( n − 1) s 2 � � p ( σ 2 | y ) ( σ 2 ) − ( n +1) / 2 exp ∝ 2 σ 2 σ 2 | y Inv − χ 2 ( n − 1 , s 2 ) , ∼ which is a scaled inverse- χ 2 distribution. 14 of 17

  15. Analytic form of marginal posterior distribution of µ • µ is typically the estimand of interest, so ultimate objective of the Bayesian analysis is the marginal posterior distribution of µ . This can be obtained by integrating σ 2 out of the joint posterior distribution. Easily done by simulation: first draw σ 2 from p ( σ 2 | y ), then draw µ from p ( µ | σ 2 , y ). • The posterior distribution of µ , p ( µ | y ), can be thought of as a mixture of normal distributions mixed over the scaled inverse chi-squared distribution for the variance - a rare case where analytic results are available. 15 of 17

  16. Performing the integration • We start by integrating the joint posterior density over σ 2 � ∞ p ( µ, σ 2 | y ) d σ 2 p ( µ | y ) = 0 2 σ 2 , A = ( n − 1) s 2 + n ( µ − ¯ A y ) 2 , the • With the substitution z = result is an unnormalized gamma integral: � ∞ A − n / 2 z ( n − 2) / 2 exp( − z ) dz p ( µ | y ) ∝ 0 [( n − 1) s 2 + n ( µ − ¯ y ) 2 ] − n / 2 ∝ � − n / 2 y ) 2 � 1 + n ( µ − ¯ ∝ ( n − 1) s 2 y , s 2 • µ | y ∼ t n − 1 (¯ n ) . 16 of 17

Recommend


More recommend