Stat 451 Lecture Notes 07 12 Markov Chain Monte Carlo Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapters 8–9 in Givens & Hoeting, Chapters 25–27 in Lange 2 Updated: April 4, 2016 1 / 42
Outline 1 Introduction 2 Crash course on Markov chains 3 Motivation, revisited 4 Metropolis–Hastings algorithm 5 Gibbs sampler 6 Some MCMC diagnostics 7 Conclusion 2 / 42
Motivation We know how to sample independent random variables from the target distribution f ( x ), at least approximately. Monte Carlo uses these simulated random variables to approximate integrals. But the random variables don’t need to be independent in order to accurately approximate integrals! Markov Chain Monte Carlo (MCMC) constructs a dependent sequence of random variables that can be used to approximate the integrals just like for ordinary Monte Carlo. The advantage of introducing this dependence is that very general “black-box” algorithms (and corresponding theory) are available to perform the required simulations. 3 / 42
Initial remarks In some sense, MCMC is basically a black-box approach that works for almost all problems. The previous remark is dangerous — it’s always a bad idea to use tools without knowing that they will work. Here we will discuss some basics of Markov chains and MCMC but know that there are very important unanswered questions about how and when MCMC works. 3 3 MCMC is an active area of research; despite the many developments in MCMC, according to Diaconis ( ∼ 2008), still “almost nothing” is known! 4 / 42
Outline 1 Introduction 2 Crash course on Markov chains 3 Motivation, revisited 4 Metropolis–Hastings algorithm 5 Gibbs sampler 6 Some MCMC diagnostics 7 Conclusion 5 / 42
Markov chains A Markov chain is just a sequence of random variables { X 1 , X 2 , . . . } with a specific type of dependence structure. In particular, a Markov chain satisfies P( X n +1 ∈ B | X 1 , . . . , X n ) = P( X n +1 ∈ B | X n ) , ( ⋆ ) i.e., the future, given the past and present, depends only on the present. Independence is a trivial Markov chain. From ( ⋆ ) we can argue that the probabilistic properties of the chain are completely determined by: initial distribution for X 0 , and the transition distribution , distribution of X n +1 , given X n . 4 4 Assume the Markov chain is homogeneous , so that the transition distribution does not depend on n . 6 / 42
Example – simple random walk iid Let U 1 , U 2 , . . . ∼ Unif( {− 1 , 1 } ). Set X 0 = 0 and let X n = � n i =1 U i = X n − 1 + U n . The initial distribution is P { X 0 = 0 } = 1. The transition distribution is determined by � X n − 1 − 1 with prob 1 / 2 X n = X n − 1 + 1 with prob 1 / 2 While very simple, the random walk is an important example in probability, having connections to advanced things like Brownian motion. 7 / 42
Keywords 5 A state A is recurrent if a chain starting in A will eventually return to A with probability 1. This state is nonull if the expected time to return is finite. A chain is called recurrent if each state is recurrent. A Markov chain is irreducible if there is positive probability that a chain starting in a state A can reach any other state B . A Markov chain is aperiodic if, for a starting state A , there is no constraint on the times at which the chain can return to A . An irreducible, aperiodic Markov chain with all states being nonnull recurrent is called ergodic . 5 Not mathematically precise! 8 / 42
Limit theory 6 f is a stationary distribution if X 0 ∼ f implies X n ∼ f for all n . An ergodic Markov chain has at most one stationary dist. Furthermore, if the chain is ergodic, then � n →∞ P( X m + n ∈ B | X m ∈ A ) = lim f ( x ) dx , for all A , B , m . B Even further, if ϕ ( x ) is integrable, then n 1 � � ϕ ( X t ) → ϕ ( x ) f ( x ) dx , with prob 1 . n t =1 This is a version of the famous ergodic theorem . There are also central limit theorems for Markov chains, but I won’t say anything about this. 6 Again, not mathematically precise! 9 / 42
Outline 1 Introduction 2 Crash course on Markov chains 3 Motivation, revisited 4 Metropolis–Hastings algorithm 5 Gibbs sampler 6 Some MCMC diagnostics 7 Conclusion 10 / 42
Why MCMC? In Monte Carlo applications, we want to generate random variables with distribution f . This could be difficult or impossible to do exactly. MCMC is designed to construct an ergodic Markov chain with f as its stationary distribution. Asymptotically, the chain will resemble samples from f . In particular, by the ergodic theorem, expectations with respect to f can be approximated by averages. Somewhat surprising is that it is actually quite easy to construct and simulate a suitable Markov chain, explaining why MCMC methods have become so popular. But, of course, there are practical and theoretical challenges... 11 / 42
Outline 1 Introduction 2 Crash course on Markov chains 3 Motivation, revisited 4 Metropolis–Hastings algorithm 5 Gibbs sampler 6 Some MCMC diagnostics 7 Conclusion 12 / 42
Details Let f ( x ) denote the target distribution pdf. Let q ( x | y ) denote a conditional pdf for X , given Y = y ; this pdf should be easy to sample from. Given X 0 , the Metropolis–Hasting algorithm (MH) produces a sequence of random variables as follows: 1 Sample X ⋆ t ∼ q ( x | X t − 1 ). 2 Compute f ( X ⋆ t ) q ( X t − 1 | X ⋆ t ) � � R = min 1 , . f ( X t − 1 ) q ( X ⋆ t | X t − 1 ) 3 Set X t = X ⋆ t with probability R ; otherwise, X t = X t − 1 . General R code to implement MH is on course website. 13 / 42
Details (cont.) The proposal distribution is not easy to choose, and the performance of the algorithm depends on this choice. Two general strategies are: Take the proposal q ( x | y ) = q ( x ); i.e., at each stage of the MH algorithm, X ⋆ t does not depend on X t − 1 . Take q ( x | y ) = q 0 ( x − y ), for a symmetric distribution with pdf q 0 , which amounts to a random walk proposal. This is one aspect of the MCMC implementation that requires a lot of care from the user; deeper understanding is needed to really see how the proposal affects the performance. In my examples, I will just pick a proposal that seems to work reasonably well... 14 / 42
Details (cont.) 7 Assuming the proposal is not too bad, then a number of things can be shown about the sequence { X t : t ≥ 1 } : the chain is ergodic, and the target f is the stationary distribution. Consequently, the sequence converges to the stationary distribution and, for any integrable function ϕ ( x ), we can approximate integrals with sample averages. So, provided that we run the simulation “long enough,” we should be able to get arbitrarily good approximations. This is an interesting scenario where we, as statisticians, are able to control the sample size! 7 Again, not mathematically precise! 15 / 42
Example: “cosine model” Consider the problem from old homework, where the likelihood function is n � L ( θ ) ∝ { 1 − cos( X i − θ ) } , − π ≤ θ ≤ π. i =1 Observed data ( X 1 , . . . , X n ) given in the code. Assume that θ is given a Unif( − π, π ) prior distribution. Use MH to sample from the posterior: Proposal: q ( θ ′ | θ ) = Unif( θ ′ | θ ± 0 . 5). Burn-in: B = 5000. Sample size: M = 10000. 16 / 42
Example: “cosine model” (cont.) Left figure shows histogram of the MCMC sample with posterior density overlaid. Right figure shows a “trace plot” of the chain. 2.0 1.0 1.5 0.5 Density 1.0 θ 0.0 0.5 −0.5 0.0 −0.5 0.0 0.5 1.0 0 2000 4000 6000 8000 10000 θ Iteration 17 / 42
Example: Weibull model Data X 1 , . . . , X n has Weibull likelihood n n L ( α, η ) ∝ α n η n exp � � � � α log X i − η X α . i i =1 i =1 Prior: π ( α, η ) ∝ e − α η b − 1 e − c η , for some ( b , c ). Posterior density is proportional to � n n α n η n + b − 1 exp � � � �� � � X α α log X i − 1 − η c + . i i =1 i =1 Exponential is a special case of Weibull, when α = 1. Goal: informal Bayesian test of H 0 : α = 1. 18 / 42
Example: Weibull model (cont.) Data from Problem 7.11 in Ghosh et al (2006). Use MH to sample from posterior of ( α, η ). Proposal: ( α ′ , η ′ ) | ( α, η ) ∼ Exp( α ) × Exp( η ). b = 2 and c = 1; B = 1000 and M = 10000. Histogram shows marginal posterior of α . Is an exponential model ( α = 1) reasonable? 0.8 0.6 Density 0.4 0.2 0.0 1 2 3 4 α 19 / 42
Example: logistic regression Based on Examples 1.13 and 7.11 in Robert & Casella’s book. In 1986, the Challenger space shuttle exploded during take-off, the result of an “o-ring” failure. Failure may have been due to the cold temperature (31 ◦ F). Goal: Analyze the relationship between temperature and o-ring failure. In particular, fit a logistic regression model. 20 / 42
Example: logistic regression (cont.) Model: Y | x ∼ Ber( p ( x )), x = temperature. Failure probability , p ( x ), is of the form exp( α + β x ) p ( x ) = 1 + exp( α + β x ) . Using available data, fit logistic regression using glm . Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 15.0429 7.3786 2.039 0.0415 * x -0.2322 0.1082 -2.145 0.0320 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Note that � p (31) ≈ 0 . 999!!! 21 / 42
Recommend
More recommend