Probabilistic Graphical Models 10-708 Learning Completely Observed - PDF document

Probabilistic Graphical Models 10-708 Learning Completely Observed Learning Completely Observed Undirected Graphical Models Undirected Graphical Models Eric Xing Eric Xing Lecture 12, Oct 19, 2005 Reading: MJ-Chap. 9,19,20 Recap: MLE for BNs � If we assume the parameters for each CPD are globally independent, and all nodes are fully observed, then the log- likelihood function decomposes into a sum of local terms, one per node: ⎛ ⎞ ⎛ ⎞ ∏ ∏ ∑ ∑ D p D p x p x l θ = θ = ⎜ θ ⎟ = ⎜ θ ⎟ ( ; ) log ( | ) log ( | , ) log ( | , ) ⎜ x ⎟ x n i i n i i π π , , ⎝ i ⎠ ⎝ i ⎠ i n n i n ijk ML θ = ∑ ijk n ij k ' i j k , ' , 1

MLE for undirected graphical models � For directed graphical models, the log-likelihood decomposes into a sum of terms, one per family (node plus parents). � For undirected graphical models, the log-likelihood does not decompose, because the normalization constant Z is a function of all the parameters 1 ∏ ∑ ∏ P x x = ψ Z = ψ 1 K ( , , ) ( ) x ( ) x n c c Z c c c C ∈ x x c ∈ C , K , 1 n � In general, we will need to do inference (i.e., marginalization) to learn parameters for undirected models, even in the fully observed case. Log Likelihood for UGMs with tabular clique potentials � Sufficient statistics: for a UGM ( V , E ), the number of times that a configuration x (i.e., X V = x ) is observed in a dataset D ={ x 1 ,…, x N } can be represented as follows: def def ∑ ∑ m m m = δ = ( ) ( , ) (total count) , and ( ) ( ) (clique count) x x x x x n c n x V c \ � In terms of the counts, the log likelihood is given by: ∏∏ p D p θ = θ δ ( , ) x x n ( ) ( x | ) n x ∑∑ ∑∑ p D θ = δ p θ = δ p θ log ( ) ( x , x ) log ( x | ) ( x , x ) log ( x | ) n n n n x x 1 ⎛ ⎞ ∑ ∏ m l = ⎜ ψ ⎟ ( x ) log ( x ) ⎜ ⎟ c c Z ⎝ ⎠ c x ∑∑ m N Z = ψ − ( x ) log ( x ) log c c c c x c � There is a nasty log Z in the likelihood 2

Derivative of log Likelihood = ∑∑ m N Z � Log-likelihood: ψ − l ( ) log ( ) log x x c c c c x c ∂ 1 l m ( ) x = c � First term: ψ ∂ ψ ( ) x ( ) x c c c c Z 1 ∂ ∂ ⎛ ⎞ � Second term: log ∑∏ ~ = ⎜ ψ ⎟ ( ) ⎜ x ⎟ d d Z ∂ ψ ∂ ψ ⎝ ⎠ ( ) ( ) x x ~ c c c c d x 1 ∂ ⎛ ⎞ ∑ ∏ ~ ~ = δ ⎜ ψ ⎟ ( , ) ( ) x x ⎜ x ⎟ ~ c c d d Z ∂ ψ Set the value of variables to x ⎝ ⎠ x ( ) x ~ c c d x 1 1 ∑ ∏ ~ ~ = δ ψ ( , ) ( ) x x x c c ~ Z d d ψ ( ) x ~ d c c x 1 p ( ) ∑ x ~ ~ p c = δ = ( , ) ( ) x x x c c ψ ψ ( ) ( ) x x ~ c c c c x Conditions on Clique Marginals � Derivative of log-likelihood m p ∂ l ( ) ( ) x x c N c = − ∂ ψ ψ ψ ( ) ( ) ( ) x x x c c c c c c � Hence, for the maximum likelihood parameters, we know that: m ( ) def x ~ p p = c = * ( x ) ( x ) MLE c N c � In other words, at the maximum likelihood setting of the parameters, for each clique, the model marginals must be equal to the observed marginals (empirical counts). � This doesn’t tell us how to get the ML parameters, it just gives us a condition that must be satisfied when we have them. 3

MLE for undirected graphical models � Is the graph decomposable (triangulated)? � Are all the clique potentials defined on maximal cliques (not sub-cliques)? e.g., ψ 123 , ψ 234 not ψ 12 , ψ 23 , … X 1 X 2 X X 1 X 2 X 2 2 X 3 X 4 X 3 X 4 X X X X 3 4 3 4 � Are the clique potentials full tables (or Gaussians), or ( ) ∑ f parameterized more compactly, e.g. ψ = θ ? ( ) exp ( ) x x c c k k c c Decomposable? Max clique? Tabular? Method √ √ √ Direct √ - - IPF - - - Gradient - - - GIF MLE for decomposable undirected models � Decomposable models: G is decomposable ⇔ G is triangulated ⇔ G has a junction tree � ∏ ψ ( x ) c c p c = Potential based representation: ( ) � x ∏ ϕ ( ) x s s s � Consider a chain X 1 − X 2 − X 3 . The cliques are ( X 1 , X 2 ) and ( X 2 , X 3 ); the separator is X 2 The empirical marginals must equal the model marginals. � ) ~ ~ p x x x p x x p x x = ( , ) ( , ) � Let us guess that ( , , ) 1 2 2 3 MLE 1 2 3 ~ p x ( ) 2 We can verify that such a guess satisfies the conditions: � ) ) ∑ ~ ∑ ~ ~ p x x p x x x p x x p x x p x x = = = ( , ) ( , , ) ( | ) ( , ) ( , ) MLE 1 2 MLE 1 2 3 1 2 2 3 1 2 x x 3 3 ) ~ p MLE x x p x x = and similarly ( , ) ( , ) 2 3 2 3 4

MLE for decomposable undirected models (cont.) ) ~ ~ p x x x p x x p x x � Let us guess that = ( , ) ( , ) ( , , ) 1 2 2 3 1 2 3 ~ MLE p x ( ) 2 � To compute the clique potentials, just equate them to the empirical marginals (or conditionals), i.e., the separator must be divided into one of its neighbors. Then Z = 1. ~ p x x ) ) ( , ) ~ ~ MLE x x p x x ψ MLE x x = 2 3 = p x x ψ = ( , ) ( , ) ( , ) ( | ) ~ 12 1 2 1 2 23 2 3 2 3 p x ( ) 2 � One more example: ~ ~ p x x x p x x x ) ( , , ) ( , , ) p MLE x x x x = 1 2 3 2 3 4 ( , , , ) ~ 1 2 3 4 p x x X 1 X 2 X ( , ) 2 2 3 ~ p x x x ) ( , , ) ~ x x p x x x ψ MLE = 1 2 3 = ( , ) ( | , ) ~ 123 2 3 1 2 3 p x x ( , ) 2 3 X 3 X X 4 X ) ~ MLE x x x p x x x 3 4 ψ = ( , , ) ( , , ) 234 2 3 4 2 3 4 Non-decomposable and/or with non-maximal clique potentials � If the graph is non-decomposable, and or the potentials are defined on non-maximal cliques (e.g., ψ 12 , ψ 34 ), we could not equate empirical marginals (or conditionals) to MLE of cliques potentials. ∏ p x x x x x x X 1 X X 2 = ψ ( , , , ) ( , ) 2 1 2 3 4 ij i j i j { , } ~ p x x ⎧ ( , ) i j ⎪ ~ ~ p x x p x X 3 X X 4 X i j x x ∃ ψ ≠ ( , ) / ( ) MLE ⎨ ( , ) s.t. ( , ) 3 4 i j i ij i j ⎪ ~ ~ p x x p x ( , ) / ( ) ⎩ i j j X 1 X 2 X 2 Homework! Homework! X 3 X X 4 X 3 4 5

Iterative Proportional Fitting (IPF) � From the derivative of the likelihood: m p ∂ l ( ) ( x ) x N c = c − ∂ ψ ψ ψ ( ) ( ) ( ) x x x c c c c c c � we can derive another relationship: ~ p p ( ) ( ) x x c c = ψ ψ ( ) ( ) x x c c c c in which ψ c appears implicitly in the model marginal p ( x c ). � This is therefore a fixed-point equation for ψ c . Solving ψ c in closed-form is hard, because it appears on both sides of � this implicit nonlinear equation. � The idea of IPF is to hold ψ c fixed on the right hand side (both in the numerator and denominator) and solve for it on the left hand side. We cycle through all cliques, then iterate: ~ p ( ) x t + 1 t c ψ = ψ ( ) ( ) ( ) ( ) x x c c c c p t Need to do inference here Need to do inference here ( ) ( ) x c Properties of IPF Updates � IPF iterates a set of fixed-point equations. � However, we can prove it is also a coordinate ascent algorithm (coordinates = parameters of clique potentials). � Hence at each step, it will increase the log-likelihood, and it will converge to a global maximum. � I-projection: finding a distribution with the correct marginals that has the maximal entropy 6

KL Divergence View � IPF can be seen as coordinate ascent in the likelihood using the way of expressing likelihoods using KL divergences. � Recall that we have shown maximizing the log likelihood is equivalent to minimizing the KL divergence (cross entropy) from the observed distribution to the model distribution: ~ p x ( ) ∑ ( ) ~ ~ p x p x p x ⇔ θ = l max min ( ) || ( | ) ( ) log KL p x θ ( | ) x � Using a property of KL divergence based on the conditional chain rule: p ( x ) = p ( x a ) p ( x b | x a ): q x q x x ( ) ( | ) ( ) ∑ q x x p x x = q x q x x a b a ( , ) || ( , ) ( ) ( | ) log KL a b a b a b a p x p x x ( ) ( | ) x x a b a , a b q x q x x ( ) ( | ) ∑ ∑ q x q x x q x q x x = a + b a ( ) ( | ) log ( ) ( | ) log a b a p x a b a p x x ( ) ( | ) x x x x a b a , , a b a b ( ) ( ) ∑ q x p x q x q x x p x x = + ( ) || ( ) ( ) ( | ) || ( | ) KL KL a a a b a b a x a IPF minimizes KL divergence � Putting things together, we have ( ) ( ) ~ ~ p p p p θ = θ + ( x ) || ( x | ) ( x ) || ( x | ) KL KL c c ( ) ∑ ~ ~ p p p ( x ) ( x | x ) || ( x | x ) KL c c c c c − − x a It can be shown that changing the clique potential ψ c has no effect on the conditional distribution, so the second term in unaffected. � To minimize the first term, we set the marginal to the observed marginal, just as in IPF. � We can interpret IPF updates as retaining the “old” conditional probabilities p ( t ) ( x -c | x c ) while replacing the “old” marginal ~ p x probability p ( t ) ( x c ) with the observed marginal . ( ) c 7

Probabilistic Graphical Models 10-708 Learning Completely Observed - PDF document

Probabilistic Graphical Models 10-708 Learning Completely Observed Learning Completely Observed Undirected Graphical Models Undirected Graphical Models Eric Xing Eric Xing Lecture 12, Oct 19, 2005 Reading: MJ-Chap. 9,19,20 Recap: MLE for

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

z A single Gaussian might be a poor fit . . . . . Simplest form is 2 layer . . . ... .

Reconstruction Gianpaolo Palma Surface reconstruction Input Point cloud With or without

Probabilistic & Unsupervised Learning Convex Algorithms in Approximate Inference Maneesh

Inference and Representation David Sontag New York University Lecture 11, Nov. 24, 2015 David

Part 2: Generalized output representations and structure Dale Schuurmans University of Alberta

Numerical strategies for efficient control of largescale particle systems Michael Herty IGPM,

Arithmetic, Logic and Control Instructions Systems Design & Programming CMPE 310 Intel

Using the Online Application Portal in the Bundled Payments for Care Improvement Initiative

Probabilistic Graphical Models 10-708 Learning Completely Observed - PDF document

Probabilistic Graphical Models 10-708 Learning Completely Observed Learning Completely Observed Undirected Graphical Models Undirected Graphical Models Eric Xing Eric Xing Lecture 12, Oct 19, 2005 Reading: MJ-Chap. 9,19,20 Recap: MLE for

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

z A single Gaussian might be a poor fit . . . . . Simplest form is 2 layer . . . ... .

Reconstruction Gianpaolo Palma Surface reconstruction Input Point cloud With or without

Probabilistic &amp; Unsupervised Learning Convex Algorithms in Approximate Inference Maneesh

Inference and Representation David Sontag New York University Lecture 11, Nov. 24, 2015 David

Part 2: Generalized output representations and structure Dale Schuurmans University of Alberta

Numerical strategies for efficient control of largescale particle systems Michael Herty IGPM,

Arithmetic, Logic and Control Instructions Systems Design &amp; Programming CMPE 310 Intel

Using the Online Application Portal in the Bundled Payments for Care Improvement Initiative

Probabilistic & Unsupervised Learning Convex Algorithms in Approximate Inference Maneesh

Arithmetic, Logic and Control Instructions Systems Design & Programming CMPE 310 Intel