Probabilistic Graphical Models 10-708 Learning Completely Observed Learning Completely Observed Undirected Graphical Models Undirected Graphical Models Eric Xing Eric Xing Lecture 12, Oct 19, 2005 Reading: MJ-Chap. 9,19,20 Recap: MLE for BNs � If we assume the parameters for each CPD are globally independent, and all nodes are fully observed, then the log- likelihood function decomposes into a sum of local terms, one per node: ⎛ ⎞ ⎛ ⎞ ∏ ∏ ∑ ∑ D p D p x p x l θ = θ = ⎜ θ ⎟ = ⎜ θ ⎟ ( ; ) log ( | ) log ( | , ) log ( | , ) ⎜ x ⎟ x n i i n i i π π , , ⎝ i ⎠ ⎝ i ⎠ i n n i n ijk ML θ = ∑ ijk n ij k ' i j k , ' , 1
MLE for undirected graphical models � For directed graphical models, the log-likelihood decomposes into a sum of terms, one per family (node plus parents). � For undirected graphical models, the log-likelihood does not decompose, because the normalization constant Z is a function of all the parameters 1 ∏ ∑ ∏ P x x = ψ Z = ψ 1 K ( , , ) ( ) x ( ) x n c c Z c c c C ∈ x x c ∈ C , K , 1 n � In general, we will need to do inference (i.e., marginalization) to learn parameters for undirected models, even in the fully observed case. Log Likelihood for UGMs with tabular clique potentials � Sufficient statistics: for a UGM ( V , E ), the number of times that a configuration x (i.e., X V = x ) is observed in a dataset D ={ x 1 ,…, x N } can be represented as follows: def def ∑ ∑ m m m = δ = ( ) ( , ) (total count) , and ( ) ( ) (clique count) x x x x x n c n x V c \ � In terms of the counts, the log likelihood is given by: ∏∏ p D p θ = θ δ ( , ) x x n ( ) ( x | ) n x ∑∑ ∑∑ p D θ = δ p θ = δ p θ log ( ) ( x , x ) log ( x | ) ( x , x ) log ( x | ) n n n n x x 1 ⎛ ⎞ ∑ ∏ m l = ⎜ ψ ⎟ ( x ) log ( x ) ⎜ ⎟ c c Z ⎝ ⎠ c x ∑∑ m N Z = ψ − ( x ) log ( x ) log c c c c x c � There is a nasty log Z in the likelihood 2
Derivative of log Likelihood = ∑∑ m N Z � Log-likelihood: ψ − l ( ) log ( ) log x x c c c c x c ∂ 1 l m ( ) x = c � First term: ψ ∂ ψ ( ) x ( ) x c c c c Z 1 ∂ ∂ ⎛ ⎞ � Second term: log ∑∏ ~ = ⎜ ψ ⎟ ( ) ⎜ x ⎟ d d Z ∂ ψ ∂ ψ ⎝ ⎠ ( ) ( ) x x ~ c c c c d x 1 ∂ ⎛ ⎞ ∑ ∏ ~ ~ = δ ⎜ ψ ⎟ ( , ) ( ) x x ⎜ x ⎟ ~ c c d d Z ∂ ψ Set the value of variables to x ⎝ ⎠ x ( ) x ~ c c d x 1 1 ∑ ∏ ~ ~ = δ ψ ( , ) ( ) x x x c c ~ Z d d ψ ( ) x ~ d c c x 1 p ( ) ∑ x ~ ~ p c = δ = ( , ) ( ) x x x c c ψ ψ ( ) ( ) x x ~ c c c c x Conditions on Clique Marginals � Derivative of log-likelihood m p ∂ l ( ) ( ) x x c N c = − ∂ ψ ψ ψ ( ) ( ) ( ) x x x c c c c c c � Hence, for the maximum likelihood parameters, we know that: m ( ) def x ~ p p = c = * ( x ) ( x ) MLE c N c � In other words, at the maximum likelihood setting of the parameters, for each clique, the model marginals must be equal to the observed marginals (empirical counts). � This doesn’t tell us how to get the ML parameters, it just gives us a condition that must be satisfied when we have them. 3
MLE for undirected graphical models � Is the graph decomposable (triangulated)? � Are all the clique potentials defined on maximal cliques (not sub-cliques)? e.g., ψ 123 , ψ 234 not ψ 12 , ψ 23 , … X 1 X 2 X X 1 X 2 X 2 2 X 3 X 4 X 3 X 4 X X X X 3 4 3 4 � Are the clique potentials full tables (or Gaussians), or ( ) ∑ f parameterized more compactly, e.g. ψ = θ ? ( ) exp ( ) x x c c k k c c Decomposable? Max clique? Tabular? Method √ √ √ Direct √ - - IPF - - - Gradient - - - GIF MLE for decomposable undirected models � Decomposable models: G is decomposable ⇔ G is triangulated ⇔ G has a junction tree � ∏ ψ ( x ) c c p c = Potential based representation: ( ) � x ∏ ϕ ( ) x s s s � Consider a chain X 1 − X 2 − X 3 . The cliques are ( X 1 , X 2 ) and ( X 2 , X 3 ); the separator is X 2 The empirical marginals must equal the model marginals. � ) ~ ~ p x x x p x x p x x = ( , ) ( , ) � Let us guess that ( , , ) 1 2 2 3 MLE 1 2 3 ~ p x ( ) 2 We can verify that such a guess satisfies the conditions: � ) ) ∑ ~ ∑ ~ ~ p x x p x x x p x x p x x p x x = = = ( , ) ( , , ) ( | ) ( , ) ( , ) MLE 1 2 MLE 1 2 3 1 2 2 3 1 2 x x 3 3 ) ~ p MLE x x p x x = and similarly ( , ) ( , ) 2 3 2 3 4
MLE for decomposable undirected models (cont.) ) ~ ~ p x x x p x x p x x � Let us guess that = ( , ) ( , ) ( , , ) 1 2 2 3 1 2 3 ~ MLE p x ( ) 2 � To compute the clique potentials, just equate them to the empirical marginals (or conditionals), i.e., the separator must be divided into one of its neighbors. Then Z = 1. ~ p x x ) ) ( , ) ~ ~ MLE x x p x x ψ MLE x x = 2 3 = p x x ψ = ( , ) ( , ) ( , ) ( | ) ~ 12 1 2 1 2 23 2 3 2 3 p x ( ) 2 � One more example: ~ ~ p x x x p x x x ) ( , , ) ( , , ) p MLE x x x x = 1 2 3 2 3 4 ( , , , ) ~ 1 2 3 4 p x x X 1 X 2 X ( , ) 2 2 3 ~ p x x x ) ( , , ) ~ x x p x x x ψ MLE = 1 2 3 = ( , ) ( | , ) ~ 123 2 3 1 2 3 p x x ( , ) 2 3 X 3 X X 4 X ) ~ MLE x x x p x x x 3 4 ψ = ( , , ) ( , , ) 234 2 3 4 2 3 4 Non-decomposable and/or with non-maximal clique potentials � If the graph is non-decomposable, and or the potentials are defined on non-maximal cliques (e.g., ψ 12 , ψ 34 ), we could not equate empirical marginals (or conditionals) to MLE of cliques potentials. ∏ p x x x x x x X 1 X X 2 = ψ ( , , , ) ( , ) 2 1 2 3 4 ij i j i j { , } ~ p x x ⎧ ( , ) i j ⎪ ~ ~ p x x p x X 3 X X 4 X i j x x ∃ ψ ≠ ( , ) / ( ) MLE ⎨ ( , ) s.t. ( , ) 3 4 i j i ij i j ⎪ ~ ~ p x x p x ( , ) / ( ) ⎩ i j j X 1 X 2 X 2 Homework! Homework! X 3 X X 4 X 3 4 5
Iterative Proportional Fitting (IPF) � From the derivative of the likelihood: m p ∂ l ( ) ( x ) x N c = c − ∂ ψ ψ ψ ( ) ( ) ( ) x x x c c c c c c � we can derive another relationship: ~ p p ( ) ( ) x x c c = ψ ψ ( ) ( ) x x c c c c in which ψ c appears implicitly in the model marginal p ( x c ). � This is therefore a fixed-point equation for ψ c . Solving ψ c in closed-form is hard, because it appears on both sides of � this implicit nonlinear equation. � The idea of IPF is to hold ψ c fixed on the right hand side (both in the numerator and denominator) and solve for it on the left hand side. We cycle through all cliques, then iterate: ~ p ( ) x t + 1 t c ψ = ψ ( ) ( ) ( ) ( ) x x c c c c p t Need to do inference here Need to do inference here ( ) ( ) x c Properties of IPF Updates � IPF iterates a set of fixed-point equations. � However, we can prove it is also a coordinate ascent algorithm (coordinates = parameters of clique potentials). � Hence at each step, it will increase the log-likelihood, and it will converge to a global maximum. � I-projection: finding a distribution with the correct marginals that has the maximal entropy 6
KL Divergence View � IPF can be seen as coordinate ascent in the likelihood using the way of expressing likelihoods using KL divergences. � Recall that we have shown maximizing the log likelihood is equivalent to minimizing the KL divergence (cross entropy) from the observed distribution to the model distribution: ~ p x ( ) ∑ ( ) ~ ~ p x p x p x ⇔ θ = l max min ( ) || ( | ) ( ) log KL p x θ ( | ) x � Using a property of KL divergence based on the conditional chain rule: p ( x ) = p ( x a ) p ( x b | x a ): q x q x x ( ) ( | ) ( ) ∑ q x x p x x = q x q x x a b a ( , ) || ( , ) ( ) ( | ) log KL a b a b a b a p x p x x ( ) ( | ) x x a b a , a b q x q x x ( ) ( | ) ∑ ∑ q x q x x q x q x x = a + b a ( ) ( | ) log ( ) ( | ) log a b a p x a b a p x x ( ) ( | ) x x x x a b a , , a b a b ( ) ( ) ∑ q x p x q x q x x p x x = + ( ) || ( ) ( ) ( | ) || ( | ) KL KL a a a b a b a x a IPF minimizes KL divergence � Putting things together, we have ( ) ( ) ~ ~ p p p p θ = θ + ( x ) || ( x | ) ( x ) || ( x | ) KL KL c c ( ) ∑ ~ ~ p p p ( x ) ( x | x ) || ( x | x ) KL c c c c c − − x a It can be shown that changing the clique potential ψ c has no effect on the conditional distribution, so the second term in unaffected. � To minimize the first term, we set the marginal to the observed marginal, just as in IPF. � We can interpret IPF updates as retaining the “old” conditional probabilities p ( t ) ( x -c | x c ) while replacing the “old” marginal ~ p x probability p ( t ) ( x c ) with the observed marginal . ( ) c 7
Recommend
More recommend