Inference and Representation David Sontag New York University Lecture 9, Nov. 11, 2014 David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 1 / 19
Variational methods Suppose that we have an arbitrary graphical model: 1 � � � � p ( x ; θ ) = φ c ( x c ) = exp θ c ( x c ) − ln Z ( θ ) Z ( θ ) c ∈ C c ∈ C All of the approaches begin as follows: q ( x ) ln q ( x ) � D ( q � p ) = p ( x ) x 1 � � = − q ( x ) ln p ( x ) − q ( x ) ln q ( x ) x x � � � � = − q ( x ) θ c ( x c ) − ln Z ( θ ) − H ( q ( x )) x c ∈ C � � � = − q ( x ) θ c ( x c ) + q ( x ) ln Z ( θ ) − H ( q ( x )) c ∈ C x x � = − E q [ θ c ( x c )] + ln Z ( θ ) − H ( q ( x )) . c ∈ C David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 2 / 19
The log-partition function Since D ( q � p ) ≥ 0, we have � − E q [ θ c ( x c )] + ln Z ( θ ) − H ( q ( x )) ≥ 0 , c ∈ C which implies that � ln Z ( θ ) ≥ E q [ θ c ( x c )] + H ( q ( x )) . c ∈ C Thus, any approximating distribution q ( x ) gives a lower bound on the log-partition function (for a BN, this is the log probability of the observed variables) Recall that D ( q � p ) = 0 if and only if p = q .Thus, if we allow ourselves to optimize over all distributions, we have: � ln Z ( θ ) = max E q [ θ c ( x c )] + H ( q ( x )) . q c ∈ C David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 3 / 19
Re-writing objective in terms of moments � ln Z ( θ ) = max E q [ θ c ( x c )] + H ( q ( x )) q c ∈ C � � = max q ( x ) θ c ( x c ) + H ( q ( x )) q c ∈ C x � � = max q ( x c ) θ c ( x c ) + H ( q ( x )) . q c ∈ C x c Assume that p ( x ) is in the exponential family, and let f ( x ) be its sufficient statistic vector Define µ q = E q [ f ( x )] to be the marginals of q ( x ) We can re-write the objective as � � ln Z ( θ ) = max max θ c ( x c ) µ c ( x c ) + H ( q ( x )) , µ ∈ M q : E q [ f ( x )]= µ c ∈ C x c where M , the marginal polytope , consists of all valid marginal vectors David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 4 / 19
Re-writing objective in terms of moments Next, push the max over q instead to obtain: � � ln Z ( θ ) = max θ c ( x c ) µ c ( x c ) + H ( µ ) , where µ c ∈ C x c H ( µ ) = q : E q [ f ( x )]= µ H ( q ) . max For discrete random variables, the marginal polytope M is given by � µ ∈ R d | µ = � � � M = p ( x ) f ( x ) for some p ( x ) ≥ 0 , p ( x ) = 1 x ∈X m x ∈X m � f ( x ) , x ∈ X m � = ( conv denotes the convex hull operation ) conv For a discrete-variable MRF, the sufficient statistic vector f ( x ) is simply the concatenation of indicator functions for each clique of variables that appear together in a potential function For example, if we have a pairwise MRF on binary variables with m = | V | variables and | E | edges, d = 2 m + 4 | E | David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 5 / 19
Marginal polytope for discrete MRFs 1 ! Assignment for X 1 " 0 ! Marginal polytope ! 0 ! 1 ! 0 ! 0 ! Assignment for X 2 " (Wainwright & Jordan, ’03) ! 1 ! 1 ! 1 ! Assignment for X 3 ! 1 ! µ � = � 0 ! 0 ! 1 " 0 " Edge assignment for " 0 " 0 " X 1 X 3 ! µ = � 0 " 1 " 0 " 0 " 0 ! 0 ! Edge assignment for " 1 � � 1 ! µ � + � 0 ! � X 1 X 2 ! µ 0 ! 0 ! 2 0 ! valid marginal probabilities ! 1 ! 0 ! Edge assignment for " 0 ! 0 ! X 1 ! = 1 ! X 2 X 3 ! 0 ! 1 ! 1 ! X 1 ! 0 " = 0 ! 0 " X 3 ! X 2 ! = 0 ! = 1 ! X 3 ! X 2 ! = 0 ! = 1 ! David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 6 / 19
Relaxation � � ln Z ( θ ) = max θ c ( x c ) µ c ( x c ) + H ( µ ) µ ∈ M c ∈ C x c We still haven’t achieved anything, because: The marginal polytope M is complex to describe (in general, 1 exponentially many vertices and facets) H ( µ ) is very difficult to compute or optimize over 2 We now make two approximations: We replace M with a relaxation of the marginal polytope, e.g. the local 1 consistency constraints M L We replace H ( µ ) with a function ˜ H ( µ ) which approximates H ( µ ) 2 David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 7 / 19
Local consistency constraints Force every “cluster” of variables to choose a local assignment: µ i ( x i ) ≥ 0 ∀ i ∈ V , x i � µ i ( x i ) = 1 ∀ i ∈ V x i µ ij ( x i , x j ) ≥ 0 ∀ ij ∈ E , x i , x j � µ ij ( x i , x j ) = 1 ∀ ij ∈ E x i , x j Enforce that these local assignments are globally consistent: � µ i ( x i ) = µ ij ( x i , x j ) ∀ ij ∈ E , x i x j � µ j ( x j ) = µ ij ( x i , x j ) ∀ ij ∈ E , x j x i The local consistency polytope , M L is defined by these constraints Look familiar? Same local consistency constraints as used in Lecture 6 for the linear programming relaxation of MAP inference! David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 8 / 19
Local consistency constraints are exact for trees The marginal polytope depends on the specific sufficient statistic vector f ( x ) Theorem: The local consistency constraints exactly define the marginal polytope for a tree-structured MRF Proof: Consider any pseudo-marginal vector � µ ∈ M L . We will specify a distribution p T ( x ) for which µ i ( x i ) and µ ij ( x i , x j ) are the pairwise and singleton marginals of the distribution p T Let X 1 be the root of the tree, and direct edges away from root. Then, µ i , pa ( i ) ( x i , x pa ( i ) ) � p T ( x ) = µ 1 ( x 1 ) . µ pa ( i ) ( x pa ( i ) ) i ∈ V \ X 1 Because of the local consistency constraints, each term in the product can be interpreted as a conditional probability. David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 9 / 19
Example for non-tree models For non-trees, the local consistency constraints are an outer bound on the marginal polytope Example of � µ ∈ M L \ M for a MRF on binary variables: X j" ="1" X j" ="0" X 1 ! X i" ="0" 0" .5" µ ij ( x i , x j ) = X 2 ! X i" ="1" X 3 ! .5" 0" To see that this is not in M , note that it violates the following triangle inequality (valid for marginals of MRFs on binary variables ): � � � µ 1 , 2 ( x 1 , x 2 ) + µ 2 , 3 ( x 2 , x 3 ) + µ 1 , 3 ( x 1 , x 3 ) ≤ 2 . x 1 � = x 2 x 2 � = x 3 x 1 � = x 3 David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 10 / 19
Maximum entropy (MaxEnt) Recall that H ( µ ) = max q : E q [ f ( x )]= µ H ( q ) is the entropy of the maximum entropy distribution with marginals µ This yields the optimization problem: � max H ( q ( x )) = − q ( x ) log q ( x ) q x � s.t. q ( x ) f i ( x ) = α i x � q ( x ) = 1 (strictly concave w.r.t. q ( x ) ) x E.g., when doing inference in a pairwise MRF, the α i will correspond to µ l ( x l ) and µ lk ( x l , x k ) for all ( l , k ) ∈ E , x l , x k David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 11 / 19
What does the MaxEnt solution look like? To solve the MaxEnt problem, we form the Lagrangian: �� � �� � � � L = − q ( x ) log q ( x ) − λ i q ( x ) f i ( x ) − α i − λ sum q ( x ) − 1 x i x x Then, taking the derivative of the Lagrangian, ∂ L � ∂ q ( x ) = − 1 − log q ( x ) − λ i f i ( x ) − λ sum i And setting to zero, we obtain: � � q ∗ ( x ) = exp � = e − 1 − λ sum e − � i λ i f i ( x ) − 1 − λ sum − λ i f i ( x ) i i λ i f i ( x ) = Z ( λ ) x q ( x ) = 1 we obtain e 1+ λ sum = � x e − � From constraint � We conclude that the maximum entropy distribution has the form (substituting � θ for − � λ ) 1 q ∗ ( x ) = Z ( θ ) exp( θ · f ( x )) David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 12 / 19
Entropy for tree-structured models Suppose that p is a tree-structured distribution, so that we are optimizing only over marginals µ ij ( x i , x j ) for ij ∈ T We conclude from the previous slide that the arg max q : E q [ f ( x )]= µ H ( q ) is a tree-structured MRF The entropy of q as a function of its marginals can be shown to be � � H ( � µ ) = H ( µ i ) − I ( µ ij ) i ∈ V ij ∈ T where � H ( µ i ) = − µ i ( x i ) log µ i ( x i ) x i µ ij ( x i , x j ) � I ( µ ij ) = µ ij ( x i , x j ) log µ i ( x i ) µ j ( x j ) x i , x j Can we use this for non-tree structured models? David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 13 / 19
Bethe-free energy approximation The Bethe entropy approximation is (for any graph) � � H bethe ( � µ ) = H ( µ i ) − I ( µ ij ) i ∈ V ij ∈ E This gives the following variational approximation: � � max θ c ( x c ) µ c ( x c ) + H bethe ( � µ ) µ ∈ M L x c c ∈ C For non tree-structured models this is not concave, and is hard to maximize Loopy belief propagation, if it converges, finds a saddle point! David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 14 / 19
Recommend
More recommend