Advanced Machine Learning MCMC Methods Amit Sethi Electrical - PowerPoint PPT Presentation

Advanced Machine Learning MCMC Methods Amit Sethi Electrical Engineering, IIT Bombay

Objectives • We have talked about: – Exact inference in Factor Graphs using Sum- Product algorithm (aka Belief Propagation) – Limitations and some remedies thereof of the Sum-Product algorithm • Today we will learn: – Sampling methods (aka Monte Carlo methods) when exact inference is intractable 2

We want to find expected value of a function, e.g. when calculating messages p( z) E[ f ] = ∫ f(z) p(z) dz f( z) z • It may not be feasible to compute this, but – Computing f(z) may be easy, and – So, now we need to draw samples from p(z) E[ f ] ≈ 1/L ∑ l f(z (l) ) 3 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

We will look at following ways to sample from a distribution • Rejection Sampling • Importance Sampling • Gibbs Sampling 4

Sampling marginals • Note that this procedure can be applied to generate samples for marginals as well • Simply discard portions of sample which are not needed – e.g. For marginal p(rain), sample (cloudy = t; sprinkler = f ; rain = t; w = t) just becomes (rain = t) • Still a fair sampling procedure • But, anything more complex can be a problem 5 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

When the partition function is unknown • Consider the case of p( z) f( z) an arbitrary, continuous p(z) z • How can we draw samples from it? • Assume that we can evaluate p(z) up to some constant, efficiently (e.g. MRF). 6 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

Rejection sampling makes use of an easier “proposal” distribution kq( z) p( z) z • Let’s also assume that we have some simpler distribution q(z) called a proposal distribution from which we can easily draw samples – e.g. q(z) is a Gaussian • We can then draw samples from q(z) and use these, if we had a way to convert these into samples from p(z) 7 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

Now, we reject samples according to the ratio of p and q at z kq( z) p( z) z • Introduce constant k such that kq(z) >= p(z) for all z • Rejection sampling procedure: – Generate z 0 from q(z) – Generate u 0 from [ 0 ; kq(z 0 ) ] uniformly – If u 0 > p(z 0 ) reject sample z 0 , otherwise keep it • Original samples are under the red curve • Kept samples from under the blue curve – hence samples from p(z) 8 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

Rejection sampling can end up rejecting a lot of samples from q • How likely are we to keep samples? • Probability a sample is accepted is: p(accept) = ∫ p(z)/ kq(z) q(z) dz = 1/k ∫ p(z) dz • Smaller k is better subject to kq(z) >= p(z) for all z – If q(z) is similar to p(z) , this is easier • In high-dim spaces, acceptance ratio falls off exponentially, and finding a suitable k challenging 9 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

In Importance sampling, we scale the weight of sample by the ratio of p and q p( z) kq( z) f( z) • Approximate expectation by drawing points from E[f] = ∫ f(z) p(z) dz = ∫ f(z) p(z)/q(z) q(z) dz ≈ 1/L ∑ l f(z (l) ) p(z (l) )/q(z (l) ) • The quantity p(z (l) )/q(z (l) ) is known as importance weight Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

MCMC methods generate samples sequentially • Markov chain Monte Carlo methods use a Markov chain, i.e. a sequence where a sample is dependent on the previous one, i.e. z (1) , z (2) , … , z ( τ ) • Transitions of the Markov chain form the proposal distribution q(z|z ( τ ) ) • Asymptotically, these sample are drawn from the desired distribution p(z) 11 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

Metropolis algorithm assumes the proposal distribution is symmetric 12 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

Visualizing Metropolis algorithm 13 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

Metropolis-Hastings algorithm generalizes MA for non- symmetric transitions 14 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

Gibbs Sampling is a simple coordinate-wise MCMC method without using proposal dist. 15 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

Markov Blanket of an MRF • It is simply the set of neighbouring nodes 16 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

Example: Estimate prob. of one node given others in an image denoising MRF • Potentials: – For observing: - η x i y i – For spatial coherence: - β x i x j – For prior: -hx i • P(x i | ~x i ) or P(x i | X\x i ,Y) • What is the Markov blanket of x i ? Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

Gibbs sampling as a special case of MH • Proposal distribution: • By holding other dimensions constant: • Also, • So, acceptance probability is: • So, the step is always accepted 18 Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

Issues with Gibbs sampling • Initialization is random • Samples are not independent – Burn-in should be discarded (random initialization may start and wander in low probability region for a time) • Time taken is linear in number of samples • Number of iterations scale with dimensionality

RBM and its energy function defined • RBM is a bi-partite graph between its visible and hidden sets of nodes • Its energy function is: Source: “An Introduction to Restricted Boltzmann Machines”, Asja Fischer and Christian Igel, CIARP 2012

Marginals in an RBM • The hidden nodes can be used to learn a code explaining the visible nodes – In a Deep Belief Net (DBN), more layers can be added on top • Due to its bi-partite nature, the Markov blanket of a node from any set is very simply the other set of nodes • This leads to a simple product form for nodes from a set • This is leading towards a formulation called Product of Experts (PoE) Source: “An Introduction to Restricted Boltzmann Machines”, Asja Fischer and Christian Igel, CIARP 2012

Gibbs Sampling in RBM • Let us look at the marginal of the visible node Source: “An Introduction to Restricted Boltzmann Machines”, Asja Fischer and Christian Igel, CIARP 2012

Gibbs Sampling in RBM • The RBM can be interpreted as a stochastic neural network, for which block Gibbs sampling can be used Source: “An Introduction to Restricted Boltzmann Machines”, Asja Fischer and Christian Igel, CIARP 2012

In summary • Monte Carlo methods are often preferred over analytical methods to estimate probability distributions and their marginals in complex PGMs • Rejection sampling and importance sampling do not use samples effectively – Finding a good proposal distribution can be tricky • Markov Chain Monte Carlo is often preferred over simple Monte Carlo • Initial few samples of MCMC methods are rejected • Metropolis (Hastings) uses a proposal step distribution • Gibbs Sampling is the most preferred MCMC method – It makes use of Markov blankets to compute single variable marginals

Advanced Machine Learning MCMC Methods Amit Sethi Electrical - PowerPoint PPT Presentation

Advanced Machine Learning MCMC Methods Amit Sethi Electrical Engineering, IIT Bombay Objectives We have talked about: Exact inference in Factor Graphs using Sum- Product algorithm (aka Belief Propagation) Limitations and some

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

ADVANCED MACHINE LEARNING Kernel PCA 11 ADVANCED MACHINE LEARNING Overview Todays Lecture

ADVANCED MACHINE LEARNING Non-linear regression techniques 1 1 ADVANCED MACHINE LEARNING

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Discrete and Continuous Reinforcement Learning (not part of exam material) 1 1 ADVANCED

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

The EDELWEISS-III Experiment Silvia Scorza on behalf of the EDELWEISS collaboration Institut

( M ) { abba } = M Take DFA M a , b Definition: L M ( ) The language

Directed Graphs Artur Czumaj DIMAP and Department of Computer Science University of Warwick

Introduction to Statistics with R Anne Segonds-Pichon v2019-07 Outline of the course Short

How to Manage Final Office Actions and Responses and RCE Practice How to Manage Final Office

Introduction Webinar: Registering nanoforms practical advice 24 February 2020 Jenny

tstr r r

The rise and decline of an open collaboration system: How Wikipedia's reaction to popularity is