Advanced Machine Learning MCMC Methods Amit Sethi Electrical Engineering, IIT Bombay
Objectives • We have talked about: – Exact inference in Factor Graphs using Sum- Product algorithm (aka Belief Propagation) – Limitations and some remedies thereof of the Sum-Product algorithm • Today we will learn: – Sampling methods (aka Monte Carlo methods) when exact inference is intractable 2
We want to find expected value of a function, e.g. when calculating messages p( z) E[ f ] = ∫ f(z) p(z) dz f( z) z • It may not be feasible to compute this, but – Computing f(z) may be easy, and – So, now we need to draw samples from p(z) E[ f ] ≈ 1/L ∑ l f(z (l) ) 3 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
We will look at following ways to sample from a distribution • Rejection Sampling • Importance Sampling • Gibbs Sampling 4
Sampling marginals • Note that this procedure can be applied to generate samples for marginals as well • Simply discard portions of sample which are not needed – e.g. For marginal p(rain), sample (cloudy = t; sprinkler = f ; rain = t; w = t) just becomes (rain = t) • Still a fair sampling procedure • But, anything more complex can be a problem 5 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
When the partition function is unknown • Consider the case of p( z) f( z) an arbitrary, continuous p(z) z • How can we draw samples from it? • Assume that we can evaluate p(z) up to some constant, efficiently (e.g. MRF). 6 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
Rejection sampling makes use of an easier “proposal” distribution kq( z) p( z) z • Let’s also assume that we have some simpler distribution q(z) called a proposal distribution from which we can easily draw samples – e.g. q(z) is a Gaussian • We can then draw samples from q(z) and use these, if we had a way to convert these into samples from p(z) 7 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
Now, we reject samples according to the ratio of p and q at z kq( z) p( z) z • Introduce constant k such that kq(z) >= p(z) for all z • Rejection sampling procedure: – Generate z 0 from q(z) – Generate u 0 from [ 0 ; kq(z 0 ) ] uniformly – If u 0 > p(z 0 ) reject sample z 0 , otherwise keep it • Original samples are under the red curve • Kept samples from under the blue curve – hence samples from p(z) 8 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
Rejection sampling can end up rejecting a lot of samples from q • How likely are we to keep samples? • Probability a sample is accepted is: p(accept) = ∫ p(z)/ kq(z) q(z) dz = 1/k ∫ p(z) dz • Smaller k is better subject to kq(z) >= p(z) for all z – If q(z) is similar to p(z) , this is easier • In high-dim spaces, acceptance ratio falls off exponentially, and finding a suitable k challenging 9 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
In Importance sampling, we scale the weight of sample by the ratio of p and q p( z) kq( z) f( z) • Approximate expectation by drawing points from E[f] = ∫ f(z) p(z) dz = ∫ f(z) p(z)/q(z) q(z) dz ≈ 1/L ∑ l f(z (l) ) p(z (l) )/q(z (l) ) • The quantity p(z (l) )/q(z (l) ) is known as importance weight Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
MCMC methods generate samples sequentially • Markov chain Monte Carlo methods use a Markov chain, i.e. a sequence where a sample is dependent on the previous one, i.e. z (1) , z (2) , … , z ( τ ) • Transitions of the Markov chain form the proposal distribution q(z|z ( τ ) ) • Asymptotically, these sample are drawn from the desired distribution p(z) 11 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
Metropolis algorithm assumes the proposal distribution is symmetric 12 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
Visualizing Metropolis algorithm 13 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
Metropolis-Hastings algorithm generalizes MA for non- symmetric transitions 14 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
Gibbs Sampling is a simple coordinate-wise MCMC method without using proposal dist. 15 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
Markov Blanket of an MRF • It is simply the set of neighbouring nodes 16 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
Example: Estimate prob. of one node given others in an image denoising MRF • Potentials: – For observing: - η x i y i – For spatial coherence: - β x i x j – For prior: -hx i • P(x i | ~x i ) or P(x i | X\x i ,Y) • What is the Markov blanket of x i ? Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
Gibbs sampling as a special case of MH • Proposal distribution: • By holding other dimensions constant: • Also, • So, acceptance probability is: • So, the step is always accepted 18 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
Issues with Gibbs sampling • Initialization is random • Samples are not independent – Burn-in should be discarded (random initialization may start and wander in low probability region for a time) • Time taken is linear in number of samples • Number of iterations scale with dimensionality
RBM and its energy function defined • RBM is a bi-partite graph between its visible and hidden sets of nodes • Its energy function is: Source: “An Introduction to Restricted Boltzmann Machines”, Asja Fischer and Christian Igel, CIARP 2012
Marginals in an RBM • The hidden nodes can be used to learn a code explaining the visible nodes – In a Deep Belief Net (DBN), more layers can be added on top • Due to its bi-partite nature, the Markov blanket of a node from any set is very simply the other set of nodes • This leads to a simple product form for nodes from a set • This is leading towards a formulation called Product of Experts (PoE) Source: “An Introduction to Restricted Boltzmann Machines”, Asja Fischer and Christian Igel, CIARP 2012
Gibbs Sampling in RBM • Let us look at the marginal of the visible node Source: “An Introduction to Restricted Boltzmann Machines”, Asja Fischer and Christian Igel, CIARP 2012
Gibbs Sampling in RBM • The RBM can be interpreted as a stochastic neural network, for which block Gibbs sampling can be used Source: “An Introduction to Restricted Boltzmann Machines”, Asja Fischer and Christian Igel, CIARP 2012
In summary • Monte Carlo methods are often preferred over analytical methods to estimate probability distributions and their marginals in complex PGMs • Rejection sampling and importance sampling do not use samples effectively – Finding a good proposal distribution can be tricky • Markov Chain Monte Carlo is often preferred over simple Monte Carlo • Initial few samples of MCMC methods are rejected • Metropolis (Hastings) uses a proposal step distribution • Gibbs Sampling is the most preferred MCMC method – It makes use of Markov blankets to compute single variable marginals
Recommend
More recommend