Probabilistic Graphical Models David Sontag New York University - PowerPoint PPT Presentation

Probabilistic Graphical Models David Sontag New York University Lecture 13, May 2, 2013 David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 1 / 14

Today: learning with partially observed data Identifiability Overview of EM (expectation maximization) algorithm Derivation of EM algorithm Application to mixture models Variational EM Application to learning parameters of LDA David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 2 / 14

Maximum likelihood Recall from Lecture 10, that the density estimation approach to learning leads to maximizing the empirical log-likelihood 1 � max log p ( x ; θ ) |D| θ x ∈D Suppose that our joint distribution is p ( X , Z ; θ ) where our samples X are observed and the variables Z are never observed in D That is, D = { (0 , 1 , 0 , ? , ? , ?) , (1 , 1 , 1 , ? , ? , ?) , (1 , 1 , 0 , ? , ? , ?) , . . . } Assume that the hidden variables are missing completely at random (otherwise, we should explicitly model why the values are missing) David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 3 / 14

Identifiability Suppose we had infinite training data. Is it even possible to uniquely identify the true parameters? David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 4 / 14

Maximum likelihood We can still use the same maximum likelihood approach. The objective that we are maximizing is 1 � � ℓ ( θ ) = log p ( x , z ; θ ) |D| x ∈D z Because of the sum over z , there is no longer a closed-form solution for θ ∗ in the case of Bayesian networks Furthermore, the objective is no longer convex, and potentially can have a different mode for every possible assignment z One option is to apply (projected) gradient ascent to reach a local maxima of ℓ ( θ ) David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 5 / 14

Expectation maximization The expectation maximization (EM) algorithm is an alternative approach to reach a local maximum of ℓ ( θ ) Particularly useful in settings where there exists a closed form solution for θ ML if we had fully observed data For example, in Bayesian networks, we have the closed form ML solution N x i , x pa ( i ) θ ML x i | x pa ( i ) = � x i N ˆ x i , x pa ( i ) ˆ where N x i , x pa ( i ) is the number of times that the (partial) assignment x i , x pa ( i ) is observed in the training data David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 6 / 14

Expectation maximization Algorithm is as follows: 1 Write down the complete log-likelihood log p ( x , z ; θ ) in such a way that it is linear in z 2 Initialize θ 0 , e.g. at random or using a good first guess 3 Repeat until convergence: M � θ t +1 = arg max E p ( z m | x m ; θ t ) [log p ( x m , Z ; θ )] θ m =1 Notice that log p ( x m , Z ; θ ) is a random function because Z is unknown By linearity of expectation, objective decomposes into expectation terms and data terms “E” step corresponds to computing the objective (i.e., the expectations ) “M” step corresponds to maximizing the objective David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 7 / 14

Derivation of EM algorithm L ( θ n +1 ) l ( θ n +1 | θ n ) L ( θ n ) = l ( θ n | θ n ) L ( θ ) l ( θ | θ n ) L ( θ ) l ( θ | θ n ) θ θ n θ n +1 (Figure from tutorial by Sean Borman) David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 8 / 14

Application to mixture models Dirichlet α hyperparameters Prior distribution Topic distribution θ θ d over topics for document Topic-word Topic-word β β Topic of doc d Topic of word i of doc d distributions z d distributions z id w id Word w id Word i = 1 to N i = 1 to N d = 1 to D d = 1 to D Model on left is a mixture model Document is generated from a single topic Model on right (latent Dirichlet Allocation) is an admixture model Document is generated from a distribution over topics David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 9 / 14

EM for mixture models Prior distribution θ over topics Topic-word β Topic of doc d distributions z d w id Word i = 1 to N d = 1 to D The complete likelihood is p ( w , Z ; θ, β ) = � D d =1 p ( w d , Z d ; θ, β ), where N � p ( w d , Z d ; θ, β ) = θ Z d β Z d , w id i =1 Trick #1: re-write this as K N K � θ 1[ Z d = k ] � � β 1[ Z d = k ] p ( w d , Z d ; θ, β ) = k k , w id k =1 i =1 k =1 David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 10 / 14

EM for mixture models Thus, the complete log-likelihood is: � K D N K � � � � � log p ( w , Z ; θ, β ) = 1[ Z d = k ] log θ k + 1[ Z d = k ] log β k , w id d =1 k =1 i =1 k =1 In the “E” step, we take the expectation of the complete log-likelihood with respect to p ( z | w ; θ t , β t ), applying linearity of expectation, i.e. E p ( z | w ; θ t ,β t ) [log p ( w , z ; θ, β )] = � K D N K � � � p ( Z d = k | w ; θ t , β t ) log θ k + � � p ( Z d = k | w ; θ t , β t ) log β k , w id d =1 k =1 i =1 k =1 In the “M” step, we maximize this with respect to θ and β David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 11 / 14

EM for mixture models Just as with complete data, this maximization can be done in closed form First, re-write expected complete log-likelihood from � K D N K � � � � � p ( Z d = k | w ; θ t , β t ) log θ k + p ( Z d = k | w ; θ t , β t ) log β k , w id d =1 k =1 i =1 k =1 to K D K W D � � � � � p ( Z d = k | w d ; θ t , β t )+ N dw p ( Z d = k | w d ; θ t , β t ) log θ k log β k , w k =1 d =1 k =1 w =1 d =1 We then have that � D d =1 p ( Z d = k | w d ; θ t , β t ) θ t +1 = k d =1 p ( Z d = ˆ � K � D k | w d ; θ t , β t ) ˆ k =1 David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 12 / 14

Application to latent Dirichlet Allocation Dirichlet α hyperparameters Topic distribution θ d for document Topic-word β Topic of word i of doc d distributions z id w id Word i = 1 to N d = 1 to D Parameters are α and β Both θ d and z d are unobserved The difficulty here is that inference is intractable Could use Monte carlo methods to approximate the expectations David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 13 / 14

Variational EM Mean-field is ideally suited for this type of approximate inference together with learning Use the variational distribution N � q ( θ d , z d | γ d , φ d ) = q ( θ d | γ d ) q ( z n | φ dn ) n =1 We then lower bound the log-likelihood using Jensen’s inequality: � � � log p ( w | α, β ) = log p ( θ d , z d , w d | α, β ) d θ d d z d � � p ( θ d , z d , w d | α, β ) q ( θ, z ) � = log d θ d q ( θ, z ) z d d � ≥ E q [log p ( θ d , z d , w d | α, β )] − E q [log q ( θ, z )] . d Finally, we maximize the lower bound with respect to α, β , and q . David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 14 / 14

Probabilistic Graphical Models David Sontag New York University - PowerPoint PPT Presentation

Probabilistic Graphical Models David Sontag New York University Lecture 13, May 2, 2013 David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 1 / 14 Today: learning with partially observed data Identifiability Overview of EM

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Testing Identifiable Kernel P Systems using an X-machine Approach Marian Gheorghe 1 , Florentin

Outline GP hyperparameter inference Priors on GP hyperparameters Benefits of

EBLL Response in Public Housing Units Segment 1: The Basics EBLL Response in Public Housing

Twitter De-Identification Jonathon Storrick Jon.Storrick@gmail.com Center for Computational

Lecture 6: GLMs Author: Nicholas Reich Transcribed by Nutcha Wattanachit/Edited by Bianca Doone

TOWARDS A SOLUTION TO THE TOWARDS A SOLUTION TO THE "SAMEAS PROBLEM" "SAMEAS

PRIVACY-PRESERVING ALIBI SYSTEMS Benjamin Davis , Hao Chen, Matthew Franklin University of

The Key to Intelligent Transportation: Identity and Credential Management in Vehicular

Probabilistic Graphical Models David Sontag New York University - PowerPoint PPT Presentation

Probabilistic Graphical Models David Sontag New York University Lecture 13, May 2, 2013 David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 1 / 14 Today: learning with partially observed data Identifiability Overview of EM

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Testing Identifiable Kernel P Systems using an X-machine Approach Marian Gheorghe 1 , Florentin

Outline GP hyperparameter inference Priors on GP hyperparameters Benefits of

EBLL Response in Public Housing Units Segment 1: The Basics EBLL Response in Public Housing

Twitter De-Identification Jonathon Storrick Jon.Storrick@gmail.com Center for Computational

Lecture 6: GLMs Author: Nicholas Reich Transcribed by Nutcha Wattanachit/Edited by Bianca Doone

TOWARDS A SOLUTION TO THE TOWARDS A SOLUTION TO THE &quot;SAMEAS PROBLEM&quot; &quot;SAMEAS

PRIVACY-PRESERVING ALIBI SYSTEMS Benjamin Davis , Hao Chen, Matthew Franklin University of

The Key to Intelligent Transportation: Identity and Credential Management in Vehicular

TOWARDS A SOLUTION TO THE TOWARDS A SOLUTION TO THE "SAMEAS PROBLEM" "SAMEAS