Introduction To Machine Learning David Sontag New York University Lecture 21, April 14, 2016 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 1 / 14
Expectation maximization Algorithm is as follows: 1 Write down the complete log-likelihood log p ( x , z ; θ ) in such a way that it is linear in z 2 Initialize θ 0 , e.g. at random or using a good first guess 3 Repeat until convergence: M � θ t +1 = arg max E p ( z m | x m ; θ t ) [log p ( x m , Z ; θ )] θ m =1 Notice that log p ( x m , Z ; θ ) is a random function because Z is unknown By linearity of expectation, objective decomposes into expectation terms and data terms “E” step corresponds to computing the objective (i.e., the expectations ) “M” step corresponds to maximizing the objective David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 2 / 14
Derivation of EM algorithm L ( θ n +1 ) l ( θ n +1 | θ n ) L ( θ n ) = l ( θ n | θ n ) L ( θ ) l ( θ | θ n ) L ( θ ) l ( θ | θ n ) θ θ n θ n +1 (Figure from tutorial by Sean Borman) David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 3 / 14
Application to mixture models Prior distribution θ over topics Topic-word β Topic of doc d distributions z d w id Word i = 1 to N d = 1 to D This model is a type of (discrete) mixture model Called multinomial naive Bayes (a word can appear multiple times) Document is generated from a single topic David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 4 / 14
EM for mixture models Prior distribution θ over topics Topic-word β Topic of doc d distributions z d w id Word i = 1 to N d = 1 to D The complete likelihood is p ( w , Z ; θ, β ) = � D d =1 p ( w d , Z d ; θ, β ), where N � p ( w d , Z d ; θ, β ) = θ Z d β Z d , w id i =1 Trick #1: re-write this as K N K � θ 1[ Z d = k ] � � β 1[ Z d = k ] p ( w d , Z d ; θ, β ) = k k , w id k =1 i =1 k =1 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 5 / 14
EM for mixture models Thus, the complete log-likelihood is: � K D N K � � � � � log p ( w , Z ; θ, β ) = 1[ Z d = k ] log θ k + 1[ Z d = k ] log β k , w id d =1 k =1 i =1 k =1 In the “E” step, we take the expectation of the complete log-likelihood with respect to p ( z | w ; θ t , β t ), applying linearity of expectation, i.e. E p ( z | w ; θ t ,β t ) [log p ( w , z ; θ, β )] = � K D N K � � � p ( Z d = k | w ; θ t , β t ) log θ k + � � p ( Z d = k | w ; θ t , β t ) log β k , w id d =1 k =1 i =1 k =1 In the “M” step, we maximize this with respect to θ and β David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 6 / 14
EM for mixture models Just as with complete data, this maximization can be done in closed form First, re-write expected complete log-likelihood from � K D N K � � � � � p ( Z d = k | w ; θ t , β t ) log θ k + p ( Z d = k | w ; θ t , β t ) log β k , w id d =1 k =1 i =1 k =1 to K D K W D � � � � � p ( Z d = k | w d ; θ t , β t )+ N dw p ( Z d = k | w d ; θ t , β t ) log θ k log β k , w k =1 d =1 k =1 w =1 d =1 We then have that � D d =1 p ( Z d = k | w d ; θ t , β t ) θ t +1 = k d =1 p ( Z d = ˆ � K � D k | w d ; θ t , β t ) ˆ k =1 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 7 / 14
Latent Dirichlet allocation (LDA) Topic models are powerful tools for exploring large data sets and for making inferences about the content of documents !"#$%&'() *"+,#) .&/,0,"'1 )+".()1 +"/,9#)1 65)&65//1 +.&),3&'(1 2,'3$1 )"##&.1 "65%51 4$3,5)%1 &(2,#)1 65)7&(65//1 :5)2,'0("'1 6$332,)%1 8""(65//1 .&/,0,"'1 - - - Many applications in information retrieval, document summarization, and classification New+document+ What+is+this+document+about?+ weather+ .50+ finance+ .49+ sports+ .01+ Words+w 1 ,+…,+w N+ Distribu6on+of+topics + θ LDA is one of the simplest and most widely used topic models David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 8 / 14
Generative model for a document in LDA 1 Sample the document’s topic distribution θ (aka topic vector) θ ∼ Dirichlet ( α 1: T ) where the { α t } T t =1 are fixed hyperparameters. Thus θ is a distribution over T topics with mean θ t = α t / � t ′ α t ′ 2 For i = 1 to N , sample the topic z i of the i ’th word z i | θ ∼ θ 3 ... and then sample the actual word w i from the z i ’th topic w i | z i ∼ β z i where { β t } T t =1 are the topics (a fixed collection of distributions on words) David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 9 / 14
Generative model for a document in LDA Sample the document’s topic distribution θ (aka topic vector) 1 θ ∼ Dirichlet ( α 1: T ) where the { α t } T t =1 are hyperparameters.The Dirichlet density, defined over θ ∈ R T : ∀ t θ t ≥ 0 , � T ∆ = { � t =1 θ t = 1 } , is: T � θ α t − 1 p ( θ 1 , . . . , θ T ) ∝ t t =1 For example, for T =3 ( θ 3 = 1 − θ 1 − θ 2 ): α 1 = α 2 = α 3 = α 1 = α 2 = α 3 = log Pr( θ ) log Pr( θ ) θ 1 θ 2 θ 1 θ 2 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 10 / 14
Generative model for a document in LDA 3 ... and then sample the actual word w i from the z i ’th topic w i | z i ∼ β z i where { β t } T t =1 are the topics (a fixed collection of distributions on words) Documents+ Topics+ poli6cs+.0100+ religion+.0500+ sports+.0105+ president+.0095+ hindu+.0092+ baseball+.0100+ obama+.0090+ judiasm+.0080+ soccer+.0055+ washington+.0085+ ethics+.0075+ basketball+.0050+ religion+.0060+ buddhism+.0016+ football+.0045+ …+ …+ …+ � � β t = p ( w | z = t ) David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 11 / 14 θ
Example of using LDA Topic proportions and Topics Documents assignments gene 0.04 β 1 dna 0.02 z 1 d genetic 0.01 .,, θ d life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... z Nd data 0.02 number 0.02 β T computer 0.01 .,, (Blei, Introduction to Probabilistic Topic Models , 2011) David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 12 / 14
“Plate” notation for LDA model Dirichlet α hyperparameters Topic distribution θ d for document Topic-word β Topic of word i of doc d distributions z id w id Word i = 1 to N d = 1 to D Variables within a plate are replicated in a conditionally independent manner David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 13 / 14
Comparison of mixture and admixture models Dirichlet α hyperparameters Prior distribution Topic distribution θ θ d over topics for document Topic-word Topic-word β β Topic of doc d Topic of word i of doc d distributions z d distributions z id w id Word w id Word i = 1 to N i = 1 to N d = 1 to D d = 1 to D Model on left is a mixture model Called multinomial naive Bayes (a word can appear multiple times) Document is generated from a single topic Model on right (LDA) is an admixture model Document is generated from a distribution over topics David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 14 / 14
Recommend
More recommend