latent variable generative models and the expectation
play

Latent-Variable Generative Models and the Expectation Maximization - PowerPoint PPT Presentation

CS 533: Natural Language Processing Latent-Variable Generative Models and the Expectation Maximization (EM) Algorithm Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/32 Motivation: Bag-Of-Words (BOW)


  1. CS 533: Natural Language Processing Latent-Variable Generative Models and the Expectation Maximization (EM) Algorithm Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/32

  2. Motivation: Bag-Of-Words (BOW) Document Model ◮ Fixed-length documents x ∈ V T ◮ BOW parameters: word distribution p W over V defining T � p X ( x ) = p W ( x t ) t =1 ◮ Model’s generative story: any word in any document is independently generated. ◮ What if the true generative story underlying data is different? x (1) = ( a, a, a, a, a, a, a, a, a, a ) V = { a, b } x (2) = ( b, b, b, b, b, b, b, b, b, b ) T = 10 ◮ MLE: p X ( x (1) ) = p X ( x (2) ) = (1 / 2) 10 Karl Stratos CS 533: Natural Language Processing 2/32

  3. Latent-Variable BOW (LV-BOW) Document Model ◮ LV-BOW parameters ◮ p Z : “topic” distribution over { 1 . . . K } ◮ p W | Z : conditional word distribution over V defining T � p X | Z ( x | z ) = p W | Z ( x t | z ) ∀ z ∈ { 1 . . . K } t =1 K � p X ( x ) = p Z ( z ) × p X | Z ( x | z ) z =1 ◮ Model’s generative story: for each document, a topic is generated and conditioning on that words are independently generated Karl Stratos CS 533: Natural Language Processing 3/32

  4. Back to the Example x (1) = ( a, a, a, a, a, a, a, a, a, a ) V = { a, b } x (2) = ( b, b, b, b, b, b, b, b, b, b ) T = 10 ◮ K = 2 with p Z (1) = p Z (2) = 1 / 2 ◮ p W | Z ( a | 1) = p W | Z ( b | 2) = 1 ◮ p X ( x (1) ) = p X ( x (2) ) = 1 / 2 ≫ (1 / 2) 10 Key idea: introduce a latent variable Z to model true generative process more faithfully Karl Stratos CS 533: Natural Language Processing 4/32

  5. The Latent-Variable Generative Model Paradigm Model. Joint distribution over X and Z p XZ ( x, z ) = p Z ( z ) × p X | Z ( x | z ) Learning. We don’t observe Z ! � � � max E log p XZ ( x, z ) p XZ x ∼ pop X z ∈Z � �� � p X ( x ) Karl Stratos CS 533: Natural Language Processing 5/32

  6. The Learning Problem ◮ How can we solve � � � max E log p XZ ( x, z ) p XZ x ∼ pop X z ∈Z ◮ Specifically for LV-BOW, given N documents x (1) . . . x ( N ) ∈ V T , how can we learn topic distribution p Z and conditional word distribution p W | Z that maximize �� � N T � � p W | Z ( x ( i ) log p Z ( z ) × t | z ) i =1 t =1 z ∈Z Karl Stratos CS 533: Natural Language Processing 6/32

  7. A Proposed Algorithm 1. Initialize p Z and p W | Z as random distributions. 2. Repeat until convergence: 2.1 For i = 1 . . . N compute conditional posterior distribution p Z ( z ) × � T t =1 p W | Z ( x ( i ) t | z ) p Z | X ( z | x ( i ) ) = � K z ′ =1 p Z ( z ′ ) × � T t =1 p W | Z ( x ( i ) t | z ′ ) 2.2 Update model parameters by � N i =1 p Z | X ( z | x ( i ) ) p Z ( z ) = � K � N i =1 p Z | X ( z ′ | x ( i ) ) z ′ =1 � N i =1 p Z | X ( z | x ( i ) ) × count ( w | x ( i ) ) p W | Z ( w | z ) = � � N i =1 p Z | X ( z | x ( i ) ) × count ( w ′ | x ( i ) ) w ′ ∈ V where count ( w | x ( i ) ) is number of times w ∈ V appears in x ( i ) . Karl Stratos CS 533: Natural Language Processing 7/32

  8. Code Karl Stratos CS 533: Natural Language Processing 8/32

  9. Code in Action Karl Stratos CS 533: Natural Language Processing 9/32

  10. Code in Action: Bad Initialization Karl Stratos CS 533: Natural Language Processing 10/32

  11. Another Example Initial values After convergence Karl Stratos CS 533: Natural Language Processing 11/32

  12. Again Possible to Get Stuck in a Local Optimum Initial values After convergence Karl Stratos CS 533: Natural Language Processing 12/32

  13. Why Does It Work? ◮ A special case of the expectation maximization (EM) algorithm adapted for LV-BOW ◮ EM is an extremely important and general concept ◮ Another special case: variational autoencoders (VAEs, next class) Karl Stratos CS 533: Natural Language Processing 13/32

  14. Setting ◮ Original problem: difficult to optimize (nonconvex) � � � max E log p XZ ( x, z ) p XZ x ∼ pop X z ∈Z ◮ Alternative problem: easy to optimize (often concave) max E [log p XZ ( x, z )] x ∼ pop X p XZ z ∼ q Z | X ( ·| x ) where q Z | X is some arbitrary posterior distribution that is easy to compute Karl Stratos CS 533: Natural Language Processing 14/32

  15. Solving the Alternative Problem ◮ Many models we considered (LV-BOW, HMM, PCFG) can be written as � p τ ( a ) count τ ( a | x,z ) p XZ ( x, z ) = ( τ,a ) ∈E ◮ E is a set of possible event type-value pairs. ◮ count τ ( a | x, z ) is number of times τ = a happens in ( x, z ) ◮ Model has a distribution p τ over possible values of type τ ◮ Example p XZ (( a, a, a, b, b ) , 2) = p Z (2) × p W | Z ( a | 2) 3 × p W | Z ( b | 2) 2 (LV-BOW) p XZ (( La , La , La ) , ( N, N, N )) = o ( La | N ) 3 × t ( N |∗ ) × t ( N | N ) 2 × t ( STOP | N ) (HMM) Karl Stratos CS 533: Natural Language Processing 15/32

  16. Closed-Form Solution If x (1) . . . x ( N ) ∼ pop X are iid samples, max E [log p XZ ( x, z )] x ∼ pop X p XZ z ∼ q Z | X ( ·| x ) N � � q Z | X ( z | x ( i ) ) log p XZ ( x ( i ) , z ) ≈ max p XZ i =1 z N � � � q Z | X ( z | x ( i ) ) count τ ( a | x ( i ) , z ) log p τ ( a ) = max p τ i =1 z ( τ,a ) ∈E � N � � � � q Z | X ( z | x ( i ) ) count τ ( a | x ( i ) , z ) = max log p τ ( a ) p τ ( τ,a ) ∈E i =1 z MLE solution! � N � z q Z | X ( z | x ( i ) ) count τ ( a | x ( i ) , z ) i =1 p τ ( a ) = � � N � z q Z | X ( z | x ( i ) ) count τ ( a ′ | x ( i ) , z ) a ′ i =1 Karl Stratos CS 533: Natural Language Processing 16/32

  17. This is How We Derived LV-BOW EM Updates Using q Z | X = p Z | X � N � z ′ p Z | X ( z ′ | x ( i ) ) count τ ( z ′ = z | x ( i ) , z ′ ) i =1 p Z ( z ) = � � N � z ′ p Z | X ( z ′ | x ( i ) ) count τ ( z ′ = z ′′ | x ( i ) , z ′ ) z ′′ i =1 � N i =1 p Z | X ( z | x ( i ) ) = � � N i =1 p Z | X ( z ′′ | x ( i ) ) z ′′ � N � z ′ p Z | X ( z ′ | x ( i ) ) count τ ( z ′ = z, w | x ( i ) , z ′ ) i =1 p W | Z ( w | z ) = � � N � z ′ p Z | X ( z ′ | x ( i ) ) count τ ( z ′ = z, w ′ | x ( i ) , z ′ ) w ′ ∈ V i =1 � N i =1 p Z | X ( z | x ( i ) ) count ( w | x ( i ) ) = � � N i =1 p Z | X ( z | x ( i ) ) count ( w ′ | x ( i ) ) w ′ ∈ V Karl Stratos CS 533: Natural Language Processing 17/32

  18. Game Plan ◮ So we have established that it is often easy to solve the alternative problem max E [log p XZ ( x, z )] x ∼ pop X p XZ z ∼ q Z | X ( ·| x ) where q Z | X is any posterior distribution easy to compute ◮ We will relate the original log likelihood objective with this quantity by the following slide. Karl Stratos CS 533: Natural Language Processing 18/32

  19. ELBO: Evidence Lower Bound For any q Z | X we define ELBO( p XZ , q Z | X ) = E [log p XZ ( x, z )] + H ( q Z | X ) x ∼ pop X z ∼ q Z | X ( ·| x ) � � where H ( q Z | X ) = E − log q Z | X ( z | x ) . x ∼ pop X z ∼ q Z | X ( ·| x ) Claim. For all q Z | X , � � � ELBO( p XZ , q Z | X ) ≤ E log p XZ ( x, z ) x ∼ pop X z ∈Z with equality iff q Z | X = p Z | X . (Proof on board) Karl Stratos CS 533: Natural Language Processing 19/32

  20. EM: Coordinate Ascent on ELBO Input : sampling access to pop X , definition of p XZ Output : local optimum of � � � max E log p XZ ( x, z ) x ∼ pop X p XZ z ∈Z 1. Initialize p XZ (e.g., random distribution). 2. Repeat until convergence: q Z | X ← arg max ELBO( p XZ , ¯ q Z | X ) ¯ q Z | X p XZ ← arg max ELBO(¯ p XZ , q Z | X ) ¯ p XZ 3. Return p XZ Karl Stratos CS 533: Natural Language Processing 20/32

  21. Equivalently Input : sampling access to pop X , definition of p XZ Output : local optimum of � � � max E log p XZ ( x, z ) p XZ x ∼ pop X z ∈Z 1. Initialize p XZ (e.g., random distribution). 2. Repeat until convergence: p XZ ← arg max E [log p XZ ( x, z )] x ∼ pop X p XZ z ∼ p Z | X ( ·| x ) 3. Return p XZ Karl Stratos CS 533: Natural Language Processing 21/32

  22. EM Can Only Increase the Objective (Or Leave It Unchanged) LL( p ′ XZ ) ELBO( p ′ XZ , q ′ Z | X ) LL( p XZ ) = ELBO( p XZ , p Z | X ) LL( p XZ ) LL( p XZ ) ⇒ ⇒ ELBO( p XZ , q Z | X )   � LL( p XZ ) = E  log p XZ ( x, z )  x ∼ pop X z ∈Z ELBO( p XZ , q Z | X ) = LL( p XZ ) − D KL ( q Z | X || p Z | X ) = E [log p XZ ( x, z )] + H ( q Z | X ) x ∼ pop X z ∼ qZ | X ( ·| x ) Karl Stratos CS 533: Natural Language Processing 22/32

  23. EM Can Only Increase the Objective (Or Leave It Unchanged) From https://media.nature.com/full/nature-assets/nbt/ journal/v26/n8/extref/nbt1406-S1.pdf Karl Stratos CS 533: Natural Language Processing 23/32

  24. Sample Version Input : N iid samples from pop X , definition of p XZ Output : local optimum of N 1 � � p XZ ( x ( i ) , z ) max log N p XZ i =1 z ∈Z 1. Initialize p XZ (e.g., random distribution). 2. Repeat until convergence: N � � p Z | X ( z | x ( i ) ) log ¯ p XZ ( x ( i ) , z ) p XZ ← arg max ¯ p XZ i =1 z ∈Z 3. Return p XZ Karl Stratos CS 533: Natural Language Processing 24/32

Recommend


More recommend