Introduction to MCMC and BUGS ◮ Basic recipes, and a sample of some techniques for getting started. ◮ Two simple worked out examples. ◮ R-code will be provided for those. ◮ No background in MCMC assumed. ◮ Not for experts!
MCMC: Reminder Slide Outline: ◮ Motivation ◮ Monte Carlo integration ◮ Markov chains ◮ MCMC
Bayesian Inference ◮ Data: Y (realisation y ) ◮ Parameters, latent variables: θ = ( θ 1 , θ 2 , . . . , θ k ) Assume: ◮ Likelihood: f ( y | θ ) . Distribution of data given parameters. ◮ Prior: π ( θ ) . Prior distribution, before we observe data. Inference is based on the joint posterior distribution. f ( y | θ ) π ( θ ) π ( θ | y ) = f ( y | θ ) π ( θ ) d θ R f ( y | θ ) π ( θ ) ∝ i . e . Posterior Likelihood × Prior ∝
Example 1: Normal-Cauchy iid ◮ Suppose Y 1 , . . . , Y n ∼ N ( θ, 1 ) . Don’t know θ . ◮ iid=independently and identically distributed. 1 ◮ Assume Cauchy prior distribution: π ( θ ) = π ( 1 + θ 2 ) . Posterior density: � � P n i = 1 ( y i − θ ) 2 π ( θ | y ) 1 ∝ exp − × 1 + θ 2 2 � � − n ( θ − ¯ y ) 2 1 ∝ × exp (can be shown) 1 + θ 2 2 Things of interest to Bayesians: � ∞ ◮ Posterior Mean = E ( θ | y ) = −∞ θ π ( θ | y ) d θ . ◮ Posterior Variance = Var ( θ | y ) = E ( θ 2 | y ) − { E ( θ | y ) } 2 . ◮ Credible interval: { a ( y ) , b ( y ) } for θ such that Pr { a ( y ) < θ < b ( y ) | y } = 0 . 95, for example.
Example 2: One sample normal. iid ◮ Data: Y 1 , . . . , Y n ∼ N ( µ, σ 2 ) . ◮ Let precision = τ = 1 /σ 2 . ◮ Assume a ‘flat’ prior distribution for µ , i.e. π ( µ ) = 1 ◮ and let τ ∼ Gamma ( a , b ) , a , b > 0, has mean a / b . π ( µ, τ ) ∝ τ a − 1 e − b τ . ◮ Joint posterior density: � � ( τ ) n / 2 + a − 1 exp − τ P ( y i − µ ) 2 π ( µ, τ | y ) − b τ ∝ 2 For making inference, we want quantities like: � ∞ � ∞ ◮ Posterior Mean = E ( µ | y ) = µ π ( µ, τ | y ) d µ d τ . −∞ 0 ◮ Posterior Variance = Var ( µ | y ) = Horrible double integrals! Difficult to solve in general. ◮ Credible intervals involve even more of these!
MCMC: Reminder Slide Outline: ◮ Motivation ◮ Monte Carlo integration ◮ Markov chains ◮ MCMC
Monte Carlo (MC) integration ◮ General problem: Evaluate the expectation of any general function h ( X ) of the random variable X : � E π [ h ( X )] = h ( x ) π ( x ) dx � | h ( x ) | π ( x ) dx < ∞ . assuming that it exists, i.e. ◮ This can be very difficult, but if we can draw samples X ( 1 ) , X ( 2 ) , . . . , X ( N ) ∼ π ( x ) then we can estimate E π [ h ( X )] by N � � X ( t ) � h N = 1 h ¯ . N t = 1 ◮ This is the basic idea of statistics: “Estimate unknowns by drawing samples.” ◮ Change notation: Parameters = θ ≡ x ; Posterior = π ( θ | y ) = π ( x ) . Replace y by its value and suppress.
Consistency ◮ For independent samples, by Law of Large Numbers (LLN), N � � X ( t ) � 1 h N ¯ h = N t = 1 E π [ h ( X )] as N → ∞ . → (1) ◮ But independent sampling from π ( x ) is usually difficult. ◮ It turns out that (1), LLN, still applies if we generate samples using a Markov chain. But first, some revision of Markov chains.
MCMC: Reminder Slide Outline: ◮ Motivation ◮ Monte Carlo integration ◮ Markov chains ◮ MCMC
Markov chains ◮ A Markov chain is generated by sampling X ( t + 1 ) ∼ p ( x | x ( t ) ) , t = 0 , 1 , 2 , . . . . � ✒ � � � p is called the transition kernel. ◮ So X ( t + 1 ) depends only on X ( t ) , not on X ( 0 ) , X ( 1 ) , . . . , X ( t − 1 ) . p ( X ( t + 1 ) | x ( t ) , x ( t − 1 ) , . . . ) = p ( X ( t + 1 ) | x ( t ) ) ◮ For example: X ( t + 1 ) | x ( t ) ∼ N ( 0 . 5 x ( t ) , 1 . 0 ) . ◮ This is called a first order auto-regressive process with lag-1 auto-correlation 0.5.
◮ Simulation of the chain: X ( t + 1 ) | x ( t ) ∼ N ( 0 . 5 x ( t ) , 1 . 0 ) with two different starting points x ( 0 ) (one red and one blue). 4 2 0 –2 –4 0 10 20 30 40 50 60 i ◮ After about 5–7 iterations the chains seemed to have forgotten their starting positions. ◮ Caution: Have p ( X ( t + 1 ) | x ( t ) ) = p ( X ( t + 1 ) | x ( t ) , x ( t − 1 ) ) , it does not mean that X ( t + 1 ) and X ( t − 1 ) are independent marginally.
Stationarity As t → ∞ , the Markov chain converges to its stationary distribution. � ✒ ✻ � � in distribution or invariant ◮ In the above example (it can be proved that), this is X ( t ) | x ( 0 ) ∼ N ( 0 . 0 , 1 . 33 ) , as t → ∞ which does not depend on x ( 0 ) . ◮ Does this happen for all Markov chains?
Irreducibility ◮ Assuming that a stationary distribution exists, it is unique if the chain is irreducible . ◮ Irreducible means any set of states (values) can be reached from any other state (value) in a finite number of moves. ◮ An example of a reducible Markov chain: Suppose p ( x | y ) = 0 for x ∈ A and y ∈ B and vice versa. ✗✔ B ✗✔ ✖✕ A ✖✕ ◮ Here the whole set of values is the rectangle.
Aperiodicity A Markov chain taking only a finite number of values is aperiodic if greatest common divisor (g.c.d.) of return times to any particular state i say, is 1. ◮ Think of recording the number of steps taken to return to the state 1. The g.c.d. of those numbers should be 1. ◮ If the g.c.d. is bigger than 1, 2 say, then the chain will return in cycles of 2, 4, 6, ... number of steps. This is not allowed for aperiodicity. ◮ Definition can be extended to general state space cases (where the Markov chain can take any value in a continuous interval).
Ergodicity Suppose that the Markov chain has the stationary distribution π ( x ) and is aperiodic and irreducible. Then it is said to be ergodic . ◮ Ergodic theorem : N � � X ( t ) � 1 h N h ¯ = N t = 1 E π [ h ( X )] as N → ∞ . → h N is called an ergodic average. ◮ ¯ ◮ If h = Var π [ h ( X )] < ∞ , σ 2 then the central limit theorem holds, i.e, ¯ h N converges to the normal distribution, ◮ and this convergence occurs geometrically fast.
Numerical standard errors (nse) � h N is h N ) , and for large N The nse of ¯ Var π (¯ � � � � N − 1 � � � ¯ � � σ 2 h N h ρ ℓ ( h ) nse ≈ 1 + 2 N ℓ = 1 � � where ρ ℓ ( h ) is the lag- ℓ auto-correlation in h ( X ( t ) ) . ◮ In general no simpler expression exist for the nse. ◮ See Geyer (1992), and Besag and Green (1993) for ideas and further references.
Numerical standard errors (nse) ... � � h ( X ( t ) ) ◮ If can be approximated as a first order auto-regressive process then � � ¯ � σ 2 1 + ρ h N h nse ≈ 1 − ρ, N � � h ( X ( t ) ) where ρ is the lag-1 auto-correlation of . ◮ The first factor is the usual term under independent sampling. ◮ The second term is usually > 1. ◮ And thus is the penalty to be paid for using a Markov chain instead of independent sampling.
Numerical standard errors (nse) ... Moreover, ◮ the nse may not be finite in general. ◮ the nse is finite if the chain converges geometrically. ◮ if the nse is finite, then we can make it as small as we like by increasing N . ◮ it can be proved that the following simple minded estimator of nse is not consistent. � � � � N − 1 � � � ¯ � � σ 2 h N h r ℓ ( h ) ˆ nse = 1 + 2 N ℓ = 1 where r ℓ ( h ) is the sample lag- ℓ auto-correlation in � � h ( x ( t ) ) . Will return to the estimation of nse later.
Markov chains – summary ◮ A Markov chain may have a stationary distribution. ◮ The stationary distribution is unique if the chain is irreducible. ◮ We can estimate nse’s if the chain is also geometrically convergent. Where does this all get us?
MCMC: Reminder Slide Outline: ◮ Motivation ◮ Monte Carlo integration ◮ Markov chains ◮ MCMC
MCMC ◮ How do we construct a Markov chain whose stationary distribution is our posterior distribution, π ( x ) ? ◮ Metropolis et al (1953) showed us how. ◮ The method was generalized by Hastings (1970). This is called Markov chain Monte Carlo (MCMC).
Metropolis-Hastings algorithm At each iteration t : � y | x ( t ) � y ∼ q . Step 1 Sample ❅ ■ ❅ � ✒ � “Proposal” distribution � “candidate” point Step 2 Then with probability � � � � q x ( t ) | y 1 , π ( y ) α ( x ( t ) , y ) = min π ( x ( t ) ) q ( y | x ( t ) ) set x ( t + 1 ) = y (acceptance). If not accepted, set x ( t + 1 ) = x ( t ) (rejection).
Metropolis-Hastings algorithm... ◮ The normalising constant in π ( x ) is not required to run the algorithm. It cancels out in the ratio. ◮ If q ( y | x ) = π ( y ) , then we obtain independent samples, α ( x , y ) = 1 always. ◮ Usually q is chosen so that q ( y | x ) is easy to sample from. ◮ Theoretically, any density q ( ·| x ) having the same support (range) should work. However, some q ’s are better than others. See later. ◮ The induced Markov chains have the desirable properties under mild conditions on π ( x ) . ◮ π ( x ) is often called the target distribution . Recall the notation change θ to x .
Implementing MCMC : Reminder Slide ◮ Flavours of Metropolis-Hastings ◮ Gibbs Sampler ◮ Number of Chains ◮ Burn-in and run length ◮ Numerical standard errors
Recommend
More recommend