Probabilistic Graphical Models David Sontag New York University - PowerPoint PPT Presentation

Probabilistic Graphical Models David Sontag New York University Lecture 8, March 22, 2012 David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 1 / 20

Approximate marginal inference Given the joint p ( x 1 , . . . , x n ) represented as a graphical model, how do we perform marginal inference , e.g. to compute p ( x 1 )? We showed in Lecture 5 that doing this exactly is NP-hard Nearly all approximate inference algorithms are either: Monte-carlo methods (e.g., likelihood reweighting, MCMC) 1 Variational algorithms (e.g., mean-field, TRW, loopy belief 2 propagation) These next two lectures will be on variational methods David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 2 / 20

Variational methods Goal : Approximate difficult distribution p ( x ) with a new distribution q ( x ) such that: p ( x ) and q ( x ) are “close” 1 Computation on q ( x ) is easy 2 How should we measure distance between distributions? The Kullback-Leibler divergence (KL-divergence) between two distributions p and q is defined as p ( x ) log p ( x ) � D ( p � q ) = q ( x ) . x (measures the expected number of extra bits required to describe samples from p ( x ) using a code based on q instead of p ) As you showed in your homework, D ( p � q ) ≥ 0 for all p , q , with equality if and only if p = q Notice that KL-divergence is asymmetric David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 3 / 20

KL-divergence (see Section 8.5 of K&F) p ( x ) log p ( x ) � D ( p � q ) = q ( x ) . x Suppose p is the true distribution we wish to do inference with What is the difference between the solution to arg min q D ( p � q ) (called the M-projection of q onto p ) and arg min q D ( q � p ) (called the I-projection )? These two will differ only when q is minimized over a restricted set of probability distributions Q = { q 1 , . . . } , and in particular when p �∈ Q David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 4 / 20

KL-divergence – M-projection p ( x ) log p ( x ) q ∗ = arg min � q ∈ Q D ( p � q ) = q ( x ) . x For example, suppose that p ( z ) is a 2D Gaussian and Q is the set of all Gaussian distributions with diagonal covariance matrices: 1 z 2 0.5 0 0 0.5 1 z 1 (b) p =Green, q =Red David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 5 / 20

KL-divergence – I-projection q ( x ) log q ( x ) q ∗ = arg min � q ∈ Q D ( q � p ) = p ( x ) . x For example, suppose that p ( z ) is a 2D Gaussian and Q is the set of all Gaussian distributions with diagonal covariance matrices: 1 z 2 0.5 0 0 0.5 1 z 1 (a) p =Green, q =Red David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 6 / 20

KL-divergence (single Gaussian) In this simple example, both the M-projection and I-projection find an approximate q ( x ) that has the correct mean (i.e. E p [ z ] = E q [ z ]): 1 1 z 2 z 2 0.5 0.5 0 0 0 0.5 1 0 0.5 1 z 1 z 1 (b) (a) What if p ( x ) is multi-modal? David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 7 / 20

KL-divergence – M-projection (mixture of Gaussians) p ( x ) log p ( x ) q ∗ = arg min � q ∈ Q D ( p � q ) = q ( x ) . x Now suppose that p ( x ) is mixture of two 2D Gaussians and Q is the set of all 2D Gaussian distributions (with arbitrary covariance matrices): p =Blue, q =Red M-projection yields distribution q ( x ) with the correct mean and covariance. David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 8 / 20

KL-divergence – I-projection (mixture of Gaussians) q ( x ) log q ( x ) q ∗ = arg min � q ∈ Q D ( q � p ) = p ( x ) . x p =Blue, q =Red (two equivalently good solutions!) Unlike the M-projection, the I-projection does not necessarily yield the correct moments. David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 9 / 20

Finding the M-projection is the same as exact inference M-projection is: p ( x ) log p ( x ) q ∗ = arg min � q ∈ Q D ( p � q ) = q ( x ) . x Recall the definition of probability distributions in the exponential family: q ( x ; η ) = h ( x ) exp { η · f ( x ) − ln Z ( η ) } f ( x ) are called the sufficient statistics In the exponential family, there is a one-to-one correspondance between distributions q ( x ; η ) and marginal vectors E q [ f ( x )] Suppose that Q is an exponential family ( p ( x ) can be arbitrary) It can be shown (see Thm 8.6) that the expected sufficient statistics, with respect to q ∗ ( x ), are exactly the corresponding marginals under p ( x ): E q ∗ [ f ( x )] = E p [ f ( x )] Thus, solving for the M-projection is just as hard as the original inference problem David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 10 / 20

Most variational inference algorithms make use of the I-projection David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 11 / 20

Variational methods Suppose that we have an arbitrary graphical model: 1 � � � � p ( x ; θ ) = φ c ( x c ) = exp θ c ( x c ) − ln Z ( θ ) Z ( θ ) c ∈ C c ∈ C All of the approaches begin as follows: q ( x ) ln q ( x ) � D ( q � p ) = p ( x ) x 1 � � = − q ( x ) ln p ( x ) − q ( x ) ln q ( x ) x x � � � � = − q ( x ) θ c ( x c ) − ln Z ( θ ) − H ( q ( x )) x c ∈ C � � � = − q ( x ) θ c ( x c ) + q ( x ) ln Z ( θ ) − H ( q ( x )) x x c ∈ C � = − E q [ θ c ( x c )] + ln Z ( θ ) − H ( q ( x )) . c ∈ C David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 12 / 20

Variational approach Since D ( q � p ) ≥ 0, we have � − E q [ θ c ( x c )] + ln Z ( θ ) − H ( q ( x )) ≥ 0 , c ∈ C which implies that � ln Z ( θ ) ≥ E q [ θ c ( x c )] + H ( q ( x )) . c ∈ C Thus, any approximating distribution q ( x ) gives a lower bound on the log-partition function Recall that D ( q � p ) = 0 if and only if p = q .Thus, if we allow ourselves to optimize over all distributions, we have: � ln Z ( θ ) = max E q [ θ c ( x c )] + H ( q ( x )) . q c ∈ C David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 13 / 20

Mean-field algorithms � ln Z ( θ ) = max E q [ θ c ( x c )] + H ( q ( x )) . q c ∈ C Although this function is concave and thus in theory should be easy to optimize, we need some compact way of representing q ( x ) Mean-field algorithms assume a factored representation of the joint distribution: � q ( x ) = q i ( x i ) i ∈ V The objective function to use for variational inference then becomes: � � � � max θ c ( x c ) q i ( x i ) + H ( q i ) { q i ( x i ) ≥ 0 , � xi q i ( x i )=1 } x c c ∈ C i ∈ c i ∈ V Key difficulties: (1) highly non-convex optimization problem, and (2) factored distribution is usually too big of an approximation David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 14 / 20

Convex relaxation � ln Z ( θ ) = max E q [ θ c ( x c )] + H ( q ( x )) . q c ∈ C Assume that p ( x ) is in the exponential family, and let f ( x ) be its sufficient statistic vector Let Q be the exponential family with sufficient statistics f ( x ) Define µ q = E q [ f ( x )] be the marginals of q ( x ) We can re-write the objective as � � θ c ( x c ) µ c ln Z ( θ ) = max q ( x c ) + H ( µ q ) , q c ∈ C x c where we define H ( µ q ) to be the entropy of the maximum entropy distribution with marginals µ q Next, instead of optimizing over distributions q ( x ), optimize over valid marginal vectors µ . We obtain: � � ln Z ( θ ) = max θ c ( x c ) µ c ( x c ) + H ( µ ) µ ∈ M x c c ∈ C David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 15 / 20

Marginal polytope (same as from Lecture 7!) 1 ! Assignment for X 1 " 0 ! Marginal polytope ! 0 ! 1 ! 0 ! 0 ! Assignment for X 2 " (Wainwright & Jordan, ’03) ! 1 ! 1 ! 1 ! Assignment for X 3 ! 1 ! µ � = � 0 ! 0 ! 1 " 0 " Edge assignment for " 0 " 0 " X 1 X 3 ! µ = � 0 " 1 " 0 " 0 " 0 ! 0 ! Edge assignment for " 1 � � 1 ! µ � + � 0 ! � X 1 X 2 ! µ 0 ! 0 ! 2 0 ! valid marginal probabilities ! 1 ! 0 ! Edge assignment for " 0 ! 0 ! X 1 ! = 1 ! X 2 X 3 ! 0 ! 1 ! 1 ! X 1 ! 0 " = 0 ! 0 " X 3 ! X 2 ! = 0 ! = 1 ! X 3 ! X 2 ! = 0 ! = 1 ! David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 16 / 20

Convex relaxation � � ln Z ( θ ) = max θ c ( x c ) µ c ( x c ) + H ( µ ) µ ∈ M c ∈ C x c We still haven’t achieved anything, because: The marginal polytope M is complex to describe (in general, 1 exponentially many vertices and facets) H ( µ ) is very difficult to compute or optimize over 2 We now make two approximations: We replace M with a relaxation of the marginal polytope, e.g. the local 1 consistency constraints M L We replace H ( µ ) with a concave function ˜ H ( µ ) which upper bounds 2 H ( µ ), i.e. H ( µ ) ≤ ˜ H ( µ ) As a result, we obtain the following upper bound on the log-partition function, which is concave and easy to optimize: � � θ c ( x c ) µ c ( x c ) + ˜ ln Z ( θ ) ≤ max H ( µ ) µ ∈ M L x c c ∈ C David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 17 / 20

Probabilistic Graphical Models David Sontag New York University - PowerPoint PPT Presentation

Probabilistic Graphical Models David Sontag New York University Lecture 8, March 22, 2012 David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 1 / 20 Approximate marginal inference Given the joint p ( x 1 , . . . , x n ) represented

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Point processes characterized by their one dimensional distributions Aihua Xia Department of

Health Affairs Committee Compliance Update November 5, 2018 OPEN HEALTH AFF INFO 2 1

A review of numerical relativity and black-hole collisions U. Sperhake DAMTP , University of

Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works

Statistics and Imaging Jon Clayden <j.clayden@ucl.ac.uk> DIBS Teaching Seminar, 11 Nov 2015

Third Quarter 2019 Earnings Webcast and Conference Call October 31, 2019 1 Brad Windbigler

Fourth Quarter Replay Through March 7, 2018 2017 Results 877-660-6853 Domestic 201-612-7415

On the LBNF Spectrometer Duty Factor: How much beam time do we need ? Paul Lebrun, Rowan Zaki

Probabilistic Graphical Models David Sontag New York University - PowerPoint PPT Presentation

Probabilistic Graphical Models David Sontag New York University Lecture 8, March 22, 2012 David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 1 / 20 Approximate marginal inference Given the joint p ( x 1 , . . . , x n ) represented

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Point processes characterized by their one dimensional distributions Aihua Xia Department of

Health Affairs Committee Compliance Update November 5, 2018 OPEN HEALTH AFF INFO 2 1

A review of numerical relativity and black-hole collisions U. Sperhake DAMTP , University of

Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works

Statistics and Imaging Jon Clayden &lt;j.clayden@ucl.ac.uk&gt; DIBS Teaching Seminar, 11 Nov 2015

Third Quarter 2019 Earnings Webcast and Conference Call October 31, 2019 1 Brad Windbigler

Fourth Quarter Replay Through March 7, 2018 2017 Results 877-660-6853 Domestic 201-612-7415

On the LBNF Spectrometer Duty Factor: How much beam time do we need ? Paul Lebrun, Rowan Zaki

Statistics and Imaging Jon Clayden <j.clayden@ucl.ac.uk> DIBS Teaching Seminar, 11 Nov 2015