Probabilistic Graphical Models David Sontag New York University Lecture 8, March 22, 2012 David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 1 / 20
Approximate marginal inference Given the joint p ( x 1 , . . . , x n ) represented as a graphical model, how do we perform marginal inference , e.g. to compute p ( x 1 )? We showed in Lecture 5 that doing this exactly is NP-hard Nearly all approximate inference algorithms are either: Monte-carlo methods (e.g., likelihood reweighting, MCMC) 1 Variational algorithms (e.g., mean-field, TRW, loopy belief 2 propagation) These next two lectures will be on variational methods David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 2 / 20
Variational methods Goal : Approximate difficult distribution p ( x ) with a new distribution q ( x ) such that: p ( x ) and q ( x ) are “close” 1 Computation on q ( x ) is easy 2 How should we measure distance between distributions? The Kullback-Leibler divergence (KL-divergence) between two distributions p and q is defined as p ( x ) log p ( x ) � D ( p � q ) = q ( x ) . x (measures the expected number of extra bits required to describe samples from p ( x ) using a code based on q instead of p ) As you showed in your homework, D ( p � q ) ≥ 0 for all p , q , with equality if and only if p = q Notice that KL-divergence is asymmetric David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 3 / 20
KL-divergence (see Section 8.5 of K&F) p ( x ) log p ( x ) � D ( p � q ) = q ( x ) . x Suppose p is the true distribution we wish to do inference with What is the difference between the solution to arg min q D ( p � q ) (called the M-projection of q onto p ) and arg min q D ( q � p ) (called the I-projection )? These two will differ only when q is minimized over a restricted set of probability distributions Q = { q 1 , . . . } , and in particular when p �∈ Q David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 4 / 20
KL-divergence – M-projection p ( x ) log p ( x ) q ∗ = arg min � q ∈ Q D ( p � q ) = q ( x ) . x For example, suppose that p ( z ) is a 2D Gaussian and Q is the set of all Gaussian distributions with diagonal covariance matrices: 1 z 2 0.5 0 0 0.5 1 z 1 (b) p =Green, q =Red David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 5 / 20
KL-divergence – I-projection q ( x ) log q ( x ) q ∗ = arg min � q ∈ Q D ( q � p ) = p ( x ) . x For example, suppose that p ( z ) is a 2D Gaussian and Q is the set of all Gaussian distributions with diagonal covariance matrices: 1 z 2 0.5 0 0 0.5 1 z 1 (a) p =Green, q =Red David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 6 / 20
KL-divergence (single Gaussian) In this simple example, both the M-projection and I-projection find an approximate q ( x ) that has the correct mean (i.e. E p [ z ] = E q [ z ]): 1 1 z 2 z 2 0.5 0.5 0 0 0 0.5 1 0 0.5 1 z 1 z 1 (b) (a) What if p ( x ) is multi-modal? David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 7 / 20
KL-divergence – M-projection (mixture of Gaussians) p ( x ) log p ( x ) q ∗ = arg min � q ∈ Q D ( p � q ) = q ( x ) . x Now suppose that p ( x ) is mixture of two 2D Gaussians and Q is the set of all 2D Gaussian distributions (with arbitrary covariance matrices): p =Blue, q =Red M-projection yields distribution q ( x ) with the correct mean and covariance. David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 8 / 20
KL-divergence – I-projection (mixture of Gaussians) q ( x ) log q ( x ) q ∗ = arg min � q ∈ Q D ( q � p ) = p ( x ) . x p =Blue, q =Red (two equivalently good solutions!) Unlike the M-projection, the I-projection does not necessarily yield the correct moments. David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 9 / 20
Finding the M-projection is the same as exact inference M-projection is: p ( x ) log p ( x ) q ∗ = arg min � q ∈ Q D ( p � q ) = q ( x ) . x Recall the definition of probability distributions in the exponential family: q ( x ; η ) = h ( x ) exp { η · f ( x ) − ln Z ( η ) } f ( x ) are called the sufficient statistics In the exponential family, there is a one-to-one correspondance between distributions q ( x ; η ) and marginal vectors E q [ f ( x )] Suppose that Q is an exponential family ( p ( x ) can be arbitrary) It can be shown (see Thm 8.6) that the expected sufficient statistics, with respect to q ∗ ( x ), are exactly the corresponding marginals under p ( x ): E q ∗ [ f ( x )] = E p [ f ( x )] Thus, solving for the M-projection is just as hard as the original inference problem David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 10 / 20
Most variational inference algorithms make use of the I-projection David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 11 / 20
Variational methods Suppose that we have an arbitrary graphical model: 1 � � � � p ( x ; θ ) = φ c ( x c ) = exp θ c ( x c ) − ln Z ( θ ) Z ( θ ) c ∈ C c ∈ C All of the approaches begin as follows: q ( x ) ln q ( x ) � D ( q � p ) = p ( x ) x 1 � � = − q ( x ) ln p ( x ) − q ( x ) ln q ( x ) x x � � � � = − q ( x ) θ c ( x c ) − ln Z ( θ ) − H ( q ( x )) x c ∈ C � � � = − q ( x ) θ c ( x c ) + q ( x ) ln Z ( θ ) − H ( q ( x )) x x c ∈ C � = − E q [ θ c ( x c )] + ln Z ( θ ) − H ( q ( x )) . c ∈ C David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 12 / 20
Variational approach Since D ( q � p ) ≥ 0, we have � − E q [ θ c ( x c )] + ln Z ( θ ) − H ( q ( x )) ≥ 0 , c ∈ C which implies that � ln Z ( θ ) ≥ E q [ θ c ( x c )] + H ( q ( x )) . c ∈ C Thus, any approximating distribution q ( x ) gives a lower bound on the log-partition function Recall that D ( q � p ) = 0 if and only if p = q .Thus, if we allow ourselves to optimize over all distributions, we have: � ln Z ( θ ) = max E q [ θ c ( x c )] + H ( q ( x )) . q c ∈ C David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 13 / 20
Mean-field algorithms � ln Z ( θ ) = max E q [ θ c ( x c )] + H ( q ( x )) . q c ∈ C Although this function is concave and thus in theory should be easy to optimize, we need some compact way of representing q ( x ) Mean-field algorithms assume a factored representation of the joint distribution: � q ( x ) = q i ( x i ) i ∈ V The objective function to use for variational inference then becomes: � � � � max θ c ( x c ) q i ( x i ) + H ( q i ) { q i ( x i ) ≥ 0 , � xi q i ( x i )=1 } x c c ∈ C i ∈ c i ∈ V Key difficulties: (1) highly non-convex optimization problem, and (2) factored distribution is usually too big of an approximation David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 14 / 20
Convex relaxation � ln Z ( θ ) = max E q [ θ c ( x c )] + H ( q ( x )) . q c ∈ C Assume that p ( x ) is in the exponential family, and let f ( x ) be its sufficient statistic vector Let Q be the exponential family with sufficient statistics f ( x ) Define µ q = E q [ f ( x )] be the marginals of q ( x ) We can re-write the objective as � � θ c ( x c ) µ c ln Z ( θ ) = max q ( x c ) + H ( µ q ) , q c ∈ C x c where we define H ( µ q ) to be the entropy of the maximum entropy distribution with marginals µ q Next, instead of optimizing over distributions q ( x ), optimize over valid marginal vectors µ . We obtain: � � ln Z ( θ ) = max θ c ( x c ) µ c ( x c ) + H ( µ ) µ ∈ M x c c ∈ C David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 15 / 20
Marginal polytope (same as from Lecture 7!) 1 ! Assignment for X 1 " 0 ! Marginal polytope ! 0 ! 1 ! 0 ! 0 ! Assignment for X 2 " (Wainwright & Jordan, ’03) ! 1 ! 1 ! 1 ! Assignment for X 3 ! 1 ! µ � = � 0 ! 0 ! 1 " 0 " Edge assignment for " 0 " 0 " X 1 X 3 ! µ = � 0 " 1 " 0 " 0 " 0 ! 0 ! Edge assignment for " 1 � � 1 ! µ � + � 0 ! � X 1 X 2 ! µ 0 ! 0 ! 2 0 ! valid marginal probabilities ! 1 ! 0 ! Edge assignment for " 0 ! 0 ! X 1 ! = 1 ! X 2 X 3 ! 0 ! 1 ! 1 ! X 1 ! 0 " = 0 ! 0 " X 3 ! X 2 ! = 0 ! = 1 ! X 3 ! X 2 ! = 0 ! = 1 ! David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 16 / 20
Convex relaxation � � ln Z ( θ ) = max θ c ( x c ) µ c ( x c ) + H ( µ ) µ ∈ M c ∈ C x c We still haven’t achieved anything, because: The marginal polytope M is complex to describe (in general, 1 exponentially many vertices and facets) H ( µ ) is very difficult to compute or optimize over 2 We now make two approximations: We replace M with a relaxation of the marginal polytope, e.g. the local 1 consistency constraints M L We replace H ( µ ) with a concave function ˜ H ( µ ) which upper bounds 2 H ( µ ), i.e. H ( µ ) ≤ ˜ H ( µ ) As a result, we obtain the following upper bound on the log-partition function, which is concave and easy to optimize: � � θ c ( x c ) µ c ( x c ) + ˜ ln Z ( θ ) ≤ max H ( µ ) µ ∈ M L x c c ∈ C David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 17 / 20
Recommend
More recommend