Latent-Variable Generative Models and the Expectation Maximization - PowerPoint PPT Presentation

CS 533: Natural Language Processing Latent-Variable Generative Models and the Expectation Maximization (EM) Algorithm Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/32

Motivation: Bag-Of-Words (BOW) Document Model ◮ Fixed-length documents x ∈ V T ◮ BOW parameters: word distribution p W over V defining T � p X ( x ) = p W ( x t ) t =1 ◮ Model’s generative story: any word in any document is independently generated. ◮ What if the true generative story underlying data is different? x (1) = ( a, a, a, a, a, a, a, a, a, a ) V = { a, b } x (2) = ( b, b, b, b, b, b, b, b, b, b ) T = 10 ◮ MLE: p X ( x (1) ) = p X ( x (2) ) = (1 / 2) 10 Karl Stratos CS 533: Natural Language Processing 2/32

Latent-Variable BOW (LV-BOW) Document Model ◮ LV-BOW parameters ◮ p Z : “topic” distribution over { 1 . . . K } ◮ p W | Z : conditional word distribution over V defining T � p X | Z ( x | z ) = p W | Z ( x t | z ) ∀ z ∈ { 1 . . . K } t =1 K � p X ( x ) = p Z ( z ) × p X | Z ( x | z ) z =1 ◮ Model’s generative story: for each document, a topic is generated and conditioning on that words are independently generated Karl Stratos CS 533: Natural Language Processing 3/32

Back to the Example x (1) = ( a, a, a, a, a, a, a, a, a, a ) V = { a, b } x (2) = ( b, b, b, b, b, b, b, b, b, b ) T = 10 ◮ K = 2 with p Z (1) = p Z (2) = 1 / 2 ◮ p W | Z ( a | 1) = p W | Z ( b | 2) = 1 ◮ p X ( x (1) ) = p X ( x (2) ) = 1 / 2 ≫ (1 / 2) 10 Key idea: introduce a latent variable Z to model true generative process more faithfully Karl Stratos CS 533: Natural Language Processing 4/32

The Latent-Variable Generative Model Paradigm Model. Joint distribution over X and Z p XZ ( x, z ) = p Z ( z ) × p X | Z ( x | z ) Learning. We don’t observe Z ! � � � max E log p XZ ( x, z ) p XZ x ∼ pop X z ∈Z � �� p X ( x ) Karl Stratos CS 533: Natural Language Processing 5/32

The Learning Problem ◮ How can we solve � � � max E log p XZ ( x, z ) p XZ x ∼ pop X z ∈Z ◮ Specifically for LV-BOW, given N documents x (1) . . . x ( N ) ∈ V T , how can we learn topic distribution p Z and conditional word distribution p W | Z that maximize �� N T � � p W | Z ( x ( i ) log p Z ( z ) × t | z ) i =1 t =1 z ∈Z Karl Stratos CS 533: Natural Language Processing 6/32

Code Karl Stratos CS 533: Natural Language Processing 8/32

Code in Action Karl Stratos CS 533: Natural Language Processing 9/32

Code in Action: Bad Initialization Karl Stratos CS 533: Natural Language Processing 10/32

Another Example Initial values After convergence Karl Stratos CS 533: Natural Language Processing 11/32

Again Possible to Get Stuck in a Local Optimum Initial values After convergence Karl Stratos CS 533: Natural Language Processing 12/32

Why Does It Work? ◮ A special case of the expectation maximization (EM) algorithm adapted for LV-BOW ◮ EM is an extremely important and general concept ◮ Another special case: variational autoencoders (VAEs, next class) Karl Stratos CS 533: Natural Language Processing 13/32

Setting ◮ Original problem: difficult to optimize (nonconvex) � � � max E log p XZ ( x, z ) p XZ x ∼ pop X z ∈Z ◮ Alternative problem: easy to optimize (often concave) max E [log p XZ ( x, z )] x ∼ pop X p XZ z ∼ q Z | X ( ·| x ) where q Z | X is some arbitrary posterior distribution that is easy to compute Karl Stratos CS 533: Natural Language Processing 14/32

Solving the Alternative Problem ◮ Many models we considered (LV-BOW, HMM, PCFG) can be written as � p τ ( a ) count τ ( a | x,z ) p XZ ( x, z ) = ( τ,a ) ∈E ◮ E is a set of possible event type-value pairs. ◮ count τ ( a | x, z ) is number of times τ = a happens in ( x, z ) ◮ Model has a distribution p τ over possible values of type τ ◮ Example p XZ (( a, a, a, b, b ) , 2) = p Z (2) × p W | Z ( a | 2) 3 × p W | Z ( b | 2) 2 (LV-BOW) p XZ (( La , La , La ) , ( N, N, N )) = o ( La | N ) 3 × t ( N |∗ ) × t ( N | N ) 2 × t ( STOP | N ) (HMM) Karl Stratos CS 533: Natural Language Processing 15/32

Game Plan ◮ So we have established that it is often easy to solve the alternative problem max E [log p XZ ( x, z )] x ∼ pop X p XZ z ∼ q Z | X ( ·| x ) where q Z | X is any posterior distribution easy to compute ◮ We will relate the original log likelihood objective with this quantity by the following slide. Karl Stratos CS 533: Natural Language Processing 18/32

EM: Coordinate Ascent on ELBO Input : sampling access to pop X , definition of p XZ Output : local optimum of � � � max E log p XZ ( x, z ) x ∼ pop X p XZ z ∈Z 1. Initialize p XZ (e.g., random distribution). 2. Repeat until convergence: q Z | X ← arg max ELBO( p XZ , ¯ q Z | X ) ¯ q Z | X p XZ ← arg max ELBO(¯ p XZ , q Z | X ) ¯ p XZ 3. Return p XZ Karl Stratos CS 533: Natural Language Processing 20/32

Equivalently Input : sampling access to pop X , definition of p XZ Output : local optimum of � � � max E log p XZ ( x, z ) p XZ x ∼ pop X z ∈Z 1. Initialize p XZ (e.g., random distribution). 2. Repeat until convergence: p XZ ← arg max E [log p XZ ( x, z )] x ∼ pop X p XZ z ∼ p Z | X ( ·| x ) 3. Return p XZ Karl Stratos CS 533: Natural Language Processing 21/32

EM Can Only Increase the Objective (Or Leave It Unchanged) LL( p ′ XZ ) ELBO( p ′ XZ , q ′ Z | X ) LL( p XZ ) = ELBO( p XZ , p Z | X ) LL( p XZ ) LL( p XZ ) ⇒ ⇒ ELBO( p XZ , q Z | X )   � LL( p XZ ) = E  log p XZ ( x, z )  x ∼ pop X z ∈Z ELBO( p XZ , q Z | X ) = LL( p XZ ) − D KL ( q Z | X || p Z | X ) = E [log p XZ ( x, z )] + H ( q Z | X ) x ∼ pop X z ∼ qZ | X ( ·| x ) Karl Stratos CS 533: Natural Language Processing 22/32

EM Can Only Increase the Objective (Or Leave It Unchanged) From https://media.nature.com/full/nature-assets/nbt/ journal/v26/n8/extref/nbt1406-S1.pdf Karl Stratos CS 533: Natural Language Processing 23/32

Sample Version Input : N iid samples from pop X , definition of p XZ Output : local optimum of N 1 � � p XZ ( x ( i ) , z ) max log N p XZ i =1 z ∈Z 1. Initialize p XZ (e.g., random distribution). 2. Repeat until convergence: N � � p Z | X ( z | x ( i ) ) log ¯ p XZ ( x ( i ) , z ) p XZ ← arg max ¯ p XZ i =1 z ∈Z 3. Return p XZ Karl Stratos CS 533: Natural Language Processing 24/32

Latent-Variable Generative Models and the Expectation Maximization - PowerPoint PPT Presentation

CS 533: Natural Language Processing Latent-Variable Generative Models and the Expectation Maximization (EM) Algorithm Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/32 Motivation: Bag-Of-Words (BOW)

Outline Latent Variable Generative Models Cooperative Vector Quantizer Model Model

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Maximum Reconstruction Estimation for Generative Latent-Variable Models Yong Cheng joint work

1 Latent variable models In the next section we will discuss latent variable models for

Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 6 Stefano

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 15

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

Learning the Structure of Generative Models without Labeled Data Stephen Bach Bryan He

Latent Variable Models Volodymyr Kuleshov Cornell Tech Lecture 5 Volodymyr Kuleshov (Cornell

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Probabilistic latent variable

Learning Latent Variable Models through Tensor Methods Anima Anandkumar U.C. Irvine Challenges

Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano

Learning Overcomplete Latent Variable Models through Tensor Methods Majid Janzamin UC Irvine

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of

Evaluating Generative Models Stefano Ermon, Aditya Grover Stanford University Lecture 11

Evaluating Generative Models Stefano Ermon, Aditya Grover Stanford University Lecture 13

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

Latent Variable models for GWAs Oliver Stegle Machine Learning and Computational Biology Research

Sample-Efficient Optimization in the Latent Space of Deep Generative Models via Weighted

Discriminative Regularization for Latent Variable Models with Applications to Electrocardiography

Sunday Homework 3 : an Diniohlet Allocation Model Latent Generative : Generative model