Probabilistic Graphical Models David Sontag New York University Lecture 13, May 2, 2013 David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 1 / 14
Today: learning with partially observed data Identifiability Overview of EM (expectation maximization) algorithm Derivation of EM algorithm Application to mixture models Variational EM Application to learning parameters of LDA David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 2 / 14
Maximum likelihood Recall from Lecture 10, that the density estimation approach to learning leads to maximizing the empirical log-likelihood 1 � max log p ( x ; θ ) |D| θ x ∈D Suppose that our joint distribution is p ( X , Z ; θ ) where our samples X are observed and the variables Z are never observed in D That is, D = { (0 , 1 , 0 , ? , ? , ?) , (1 , 1 , 1 , ? , ? , ?) , (1 , 1 , 0 , ? , ? , ?) , . . . } Assume that the hidden variables are missing completely at random (otherwise, we should explicitly model why the values are missing) David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 3 / 14
Identifiability Suppose we had infinite training data. Is it even possible to uniquely identify the true parameters? David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 4 / 14
Maximum likelihood We can still use the same maximum likelihood approach. The objective that we are maximizing is 1 � � ℓ ( θ ) = log p ( x , z ; θ ) |D| x ∈D z Because of the sum over z , there is no longer a closed-form solution for θ ∗ in the case of Bayesian networks Furthermore, the objective is no longer convex, and potentially can have a different mode for every possible assignment z One option is to apply (projected) gradient ascent to reach a local maxima of ℓ ( θ ) David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 5 / 14
Expectation maximization The expectation maximization (EM) algorithm is an alternative approach to reach a local maximum of ℓ ( θ ) Particularly useful in settings where there exists a closed form solution for θ ML if we had fully observed data For example, in Bayesian networks, we have the closed form ML solution N x i , x pa ( i ) θ ML x i | x pa ( i ) = � x i N ˆ x i , x pa ( i ) ˆ where N x i , x pa ( i ) is the number of times that the (partial) assignment x i , x pa ( i ) is observed in the training data David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 6 / 14
Expectation maximization Algorithm is as follows: 1 Write down the complete log-likelihood log p ( x , z ; θ ) in such a way that it is linear in z 2 Initialize θ 0 , e.g. at random or using a good first guess 3 Repeat until convergence: M � θ t +1 = arg max E p ( z m | x m ; θ t ) [log p ( x m , Z ; θ )] θ m =1 Notice that log p ( x m , Z ; θ ) is a random function because Z is unknown By linearity of expectation, objective decomposes into expectation terms and data terms “E” step corresponds to computing the objective (i.e., the expectations ) “M” step corresponds to maximizing the objective David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 7 / 14
Derivation of EM algorithm L ( θ n +1 ) l ( θ n +1 | θ n ) L ( θ n ) = l ( θ n | θ n ) L ( θ ) l ( θ | θ n ) L ( θ ) l ( θ | θ n ) θ θ n θ n +1 (Figure from tutorial by Sean Borman) David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 8 / 14
Application to mixture models Dirichlet α hyperparameters Prior distribution Topic distribution θ θ d over topics for document Topic-word Topic-word β β Topic of doc d Topic of word i of doc d distributions z d distributions z id w id Word w id Word i = 1 to N i = 1 to N d = 1 to D d = 1 to D Model on left is a mixture model Document is generated from a single topic Model on right (latent Dirichlet Allocation) is an admixture model Document is generated from a distribution over topics David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 9 / 14
EM for mixture models Prior distribution θ over topics Topic-word β Topic of doc d distributions z d w id Word i = 1 to N d = 1 to D The complete likelihood is p ( w , Z ; θ, β ) = � D d =1 p ( w d , Z d ; θ, β ), where N � p ( w d , Z d ; θ, β ) = θ Z d β Z d , w id i =1 Trick #1: re-write this as K N K � θ 1[ Z d = k ] � � β 1[ Z d = k ] p ( w d , Z d ; θ, β ) = k k , w id k =1 i =1 k =1 David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 10 / 14
EM for mixture models Thus, the complete log-likelihood is: � K D N K � � � � � log p ( w , Z ; θ, β ) = 1[ Z d = k ] log θ k + 1[ Z d = k ] log β k , w id d =1 k =1 i =1 k =1 In the “E” step, we take the expectation of the complete log-likelihood with respect to p ( z | w ; θ t , β t ), applying linearity of expectation, i.e. E p ( z | w ; θ t ,β t ) [log p ( w , z ; θ, β )] = � K D N K � � � p ( Z d = k | w ; θ t , β t ) log θ k + � � p ( Z d = k | w ; θ t , β t ) log β k , w id d =1 k =1 i =1 k =1 In the “M” step, we maximize this with respect to θ and β David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 11 / 14
EM for mixture models Just as with complete data, this maximization can be done in closed form First, re-write expected complete log-likelihood from � K D N K � � � � � p ( Z d = k | w ; θ t , β t ) log θ k + p ( Z d = k | w ; θ t , β t ) log β k , w id d =1 k =1 i =1 k =1 to K D K W D � � � � � p ( Z d = k | w d ; θ t , β t )+ N dw p ( Z d = k | w d ; θ t , β t ) log θ k log β k , w k =1 d =1 k =1 w =1 d =1 We then have that � D d =1 p ( Z d = k | w d ; θ t , β t ) θ t +1 = k d =1 p ( Z d = ˆ � K � D k | w d ; θ t , β t ) ˆ k =1 David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 12 / 14
Application to latent Dirichlet Allocation Dirichlet α hyperparameters Topic distribution θ d for document Topic-word β Topic of word i of doc d distributions z id w id Word i = 1 to N d = 1 to D Parameters are α and β Both θ d and z d are unobserved The difficulty here is that inference is intractable Could use Monte carlo methods to approximate the expectations David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 13 / 14
Variational EM Mean-field is ideally suited for this type of approximate inference together with learning Use the variational distribution N � q ( θ d , z d | γ d , φ d ) = q ( θ d | γ d ) q ( z n | φ dn ) n =1 We then lower bound the log-likelihood using Jensen’s inequality: � � � log p ( w | α, β ) = log p ( θ d , z d , w d | α, β ) d θ d d z d � � p ( θ d , z d , w d | α, β ) q ( θ, z ) � = log d θ d q ( θ, z ) z d d � ≥ E q [log p ( θ d , z d , w d | α, β )] − E q [log q ( θ, z )] . d Finally, we maximize the lower bound with respect to α, β , and q . David Sontag (NYU) Graphical Models Lecture 13, May 2, 2013 14 / 14
Recommend
More recommend