CS 533: Natural Language Processing Autoencoders and VAEs Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/30
Aside: Protective Measures are Meaningful Karl Stratos CS 533: Natural Language Processing 2/30
Logistics ◮ Set up 1-1 meeting for proposal feedback (March 25-27) ◮ Proposal and A4 due March 24 ◮ Exam: discussion Karl Stratos CS 533: Natural Language Processing 3/30
Agenda ◮ EM: loose ends (hard EM) ◮ Autoencoders and VAEs ◮ VAE training techniques Karl Stratos CS 533: Natural Language Processing 4/30
Recap: Latent-Variable Generative Models (LVGMs) ◮ Observed data comes from the population distribution pop X ◮ LVGM: Model defining a joint distribution over X and Z p XZ ( x, z ) = p Z ( z ) × p X | Z ( x | z ) ◮ Learning: Estimate p XZ by maximizing log-likelihood of data x (1) . . . x ( N ) ∼ pop X N � � p XZ ( x ( i ) , z ) max log p XZ i =1 z ∈Z � �� � p X ( x ( i ) ) Karl Stratos CS 533: Natural Language Processing 5/30
EM: Coordinate Ascent on ELBO Input : data x (1) . . . x ( N ) ∼ pop X , definition of p XZ Output : local optimum of N � � p XZ ( x ( i ) , z ) max log p XZ i =1 z ∈Z 1. Initialize p XZ (e.g., random distribution). 2. Repeat until convergence: p XZ ( x ( i ) , z ) q Z | X ( z | x ( i ) ) ← z ′ ∈Z p XZ ( x ( i ) , z ′ ) ∀ z ∈ Z , i = 1 . . . N � N � � q Z | X ( z | x ( i ) ) log p XZ ( x ( i ) , z ) p XZ ← arg max ¯ p XZ i =1 z ∈Z 3. Return p XZ Karl Stratos CS 533: Natural Language Processing 6/30
Hard EM: Coordinate Ascent on a Different Objective Input : data x (1) . . . x ( N ) ∼ pop X , definition of p XZ Output : local optimum of N � log p XZ ( x ( i ) , z i ) max p XZ , ( z 1 ...z N ) ∈Z N i =1 1. Initialize p XZ (e.g., random distribution). 2. Repeat until convergence: N � log p XZ ( x ( i ) , ¯ ( z 1 . . . z N ) ← arg max z i ) (¯ z 1 ... ¯ z N ) ∈Z N i =1 N � log p XZ ( x ( i ) , z i ) p XZ ← arg max p XZ ¯ i =1 3. Return p XZ Karl Stratos CS 533: Natural Language Processing 7/30
K -Means: Special Case of Hard EM ◮ x ∈ R d , z ∈ { 1 . . . K } p XZ ( x, z ) = 1 K × N ( x ; µ z , I d ) ◮ Model parameters to learn: µ 1 . . . µ K ∈ R d ◮ Negative log joint probability as a function of parameters − log p XZ ( x, z ) ≡ || x − µ z || 2 ◮ Observed x (1) . . . x ( N ) ∈ R d , latents z 1 . . . z N ∈ { 1 . . . K } � � � � 2 � x ( i ) − µ z � � � � z i ← arg min � � � z ∈{ 1 ...K } N N � � � � 2 1 � � � x ( i ) − µ z i � � � � x ( i ) µ k ← arg min = � � � count ( z = k ) µ ∈{ 1 ...K } i =1 i =1: z i = k Karl Stratos CS 533: Natural Language Processing 8/30
Setting ◮ Neural autoencoding: observed X , latent Z ◮ Running example ◮ X : sentence ◮ Z : m -dimensional real-valued vector ◮ We need to define ◮ q Z | X : encoder that transforms a sentence into a distribution over R m ◮ p X | Z : decoder that transforms a vector z ∈ R m into a distribution over sentences ◮ p Z : prior that defines a distribution over R m ◮ Distributions parameterized by neural networks Karl Stratos CS 533: Natural Language Processing 9/30
Example Encoder: LSTM + Gaussian ◮ Input. Sentence x ∈ V T ◮ Parameters. Word embeddings E ∈ R | V |× d , LSTMCell R d × R d → R d , feedforward FF 1 : R d → R 2 m ◮ Forward. h 1 , c 1 ← LSTMCell( E x 1 , (0 d , 0 d )) h 2 , c 2 ← LSTMCell( E x 2 , ( h 1 , c 1 )) . . . h T , c T ← LSTMCell( E x T , ( h T − 1 , c T − 1 )) � µ ( x ) � ← FF 1 ( h T ) σ 2 ( x ) ◮ Distribution over R m conditioned on x q Z | X ( ·| x ) = N ( µ ( x ) , diag ( σ 2 ( x ))) Karl Stratos CS 533: Natural Language Processing 10/30
Example Decoder: Conditional Language Model ◮ Input. Vector z ∈ R m ◮ Parameters. Word embeddings E ∈ R | V |× d (often tied with encoder), LSTMCell R d × R d → R d , feedforward FF 2 : R m → R d × R d ◮ Forward. Given sentence y ∈ V L compute its probability conditioned on z by h 1 , c 1 ← LSTMCell( E y 1 , FF 2 ( z )) h 2 , c 2 ← LSTMCell( E y 2 , ( h 1 , c 1 )) . . . h L , c L ← LSTMCell( E y L , ( h L − 1 , c L − 1 )) L � p X | Z ( y | z ) = softmax y l ( Eh l − 1 ) � �� � l =1 p ( y l | z,y <l ) Karl Stratos CS 533: Natural Language Processing 11/30
Example Prior: Isotropic Gaussian ◮ Simplest: fixed standard normal p Z = N (0 m , I m ) . ◮ Parameters. None ◮ Can also make it more expressive, for instance a mixture of K diagonal Gaussians K � softmax k ( γ ) × N ( µ k , diag ( σ 2 p Z = k )) k =1 ◮ Parameters. γ ∈ R m and µ k , σ 2 k ∈ R m for k = 1 . . . K ◮ Multimodal instead of unimodal Karl Stratos CS 533: Natural Language Processing 12/30
Summary ◮ Sentence X , d -dimensional vector Z ◮ Learnable parameters ◮ Word embeddings E shared by encoder and decoder ◮ LSTM and feedforward parameters in q Z | X ◮ LSTM and feedforward parameters in p X | Z ◮ (Optional) Parameters in the prior p Z ◮ We will now consider learning all these parameters together in the autoencoding framework Karl Stratos CS 533: Natural Language Processing 13/30
Autoencoders (AEs) p Z z q Z | X : encoder q Z | X p X | Z p X | Z : decoder p Z : prior x pop X Objective. � � max E log p X | Z ( x | z ) + R ( pop X , p Z , p X | Z , q Z | X ) x ∼ pop X p Z , p X | Z , q Z | X � �� � z ∼ q Z | X ( ·| x ) regularization � �� � reconstruction Karl Stratos CS 533: Natural Language Processing 14/30
Naive Autoencoders Objective � � log p X | Z ( x | LSTM( x )) max E x ∼ pop X p X | Z , LSTM ◮ Deterministic encoding: equivalent to learning a point-mass encoder q Z | X (LSTM( x ) | x ) = 1 ◮ No regularization (hence no role for prior) Karl Stratos CS 533: Natural Language Processing 15/30
Denoising Autoencoders Objective � � log p X | Z ( x | LSTM( x + ǫ )) max E x ∼ pop X p X | Z , LSTM ǫ ∼ p E ◮ Noise introduced at input, reconstruct original input ◮ Equivalent to learning encoder q Z | X (LSTM( x + ǫ ) | x ) = p E ( ǫ ) ◮ Still no regularization, so no prior ◮ Example: masked language modeling Karl Stratos CS 533: Natural Language Processing 16/30
BERT as Denoising AE (Devlin et al., 2019) ran IsNext barked Transformer (Vaswani et al., 2017) away dog [CLS] the [MASK] [SEP] the cat [MASK] [SEP] Karl Stratos CS 533: Natural Language Processing 17/30
Variational Autoencoders (VAEs) Objective � � max log p X | Z ( x | z ) − D KL ( q Z | X || p Z ) E x ∼ pop X p Z , p X | Z , q Z | X z ∼ q Z | X ( ·| x ) ◮ Great deal of flexibility in terms of how to optimize it ◮ Popular approach for the current setting ◮ Optimize the reconstruction term by sampling + reparameterization trick z ∼ q Z | X ( ·| x ) ⇔ ǫ ∼ N (0 m , I m ) z = µ ( x ) + σ ( x ) ⊙ ǫ ◮ Optimize the KL term in closed form D KL ( N ( µ ( x ) , diag ( σ 2 ( x ))) ||N (0 m , I m )) � m � = 1 � σ 2 i ( x ) + µ 2 i ( x ) − 1 − log σ 2 i ( x ) 2 i =1 Karl Stratos CS 533: Natural Language Processing 18/30
VAE Loss: Concrete Steps Given a sentence x ∼ pop X (in general a minibatch) 1. Encoding . Run the encoder to calculate the Gaussian parameters µ ( x ) , σ 2 ( x ) ∈ R m µ ( x ) , σ 2 ( x ) ← Encoder ( x ) 2. KL . Calculate the KL term � m � κ ← 1 � σ 2 i ( x ) + µ 2 i ( x ) − 1 − log σ 2 i ( x ) 2 i =1 3. Reconstruction . Estimate the reconstruction term by sampling + reparameterization trick ρ ← DecoderNLL ( x, µ ( x ) + σ ( x ) ⊙ ǫ ) ǫ ∼ N (0 m , I m ) 4. Loss . Take a gradient step (wrt. all parameters) on ρ − βκ where β is some weight. Karl Stratos CS 533: Natural Language Processing 19/30
Uses of VAEs ◮ Representation learning. Run encoder on a sentence x to obtain its m -dimensional “meaning” vector ◮ Controlled generation. Run decoder on some seed vector to conditionally generate sentences ◮ Can “interpolate” between two sentences x 1 , x 2 z 1 ∼ q Z | X ( ·| x 1 ) z 2 ∼ q Z | X ( ·| x 2 ) x α ← Decode ( αz 1 + (1 − α ) z 2 ) α ∈ [0 , 1] Karl Stratos CS 533: Natural Language Processing 20/30
Interpolation Examples A Surprisingly Effective Fix for Deep Latent Variable Modeling of Text (Li et al., 2019) Karl Stratos CS 533: Natural Language Processing 21/30
VAEs in Computer Vision Random (never before seen) faces sampled from VAE decoder! Generating Diverse High-Fidelity Images with VQ-VAE-2 (Razavi et al., 2019) Karl Stratos CS 533: Natural Language Processing 22/30
VAE is EM VAE Objective � � log p X | Z ( x | z ) − D KL ( q Z | X || p Z ) = ELBO( p XZ , q Z | X ) E x ∼ pop X z ∼ q Z | X ( ·| x ) ◮ Thus when you optimize VAE you are maximizing a lower bound on marginal log likelihood defined by your LVGM ◮ Taking gradient steps for decoder/encoder/prior simultaneously is alternating optimization of ELBO ◮ Difference with the classical EM: we no longer insist on solving the E step exactly (i.e., setting q Z | X = p Z | X ) ◮ Train a separate variational model q Z | X alongside p XZ Karl Stratos CS 533: Natural Language Processing 23/30
Practical Issues ◮ Posterior collapse ◮ Quantities to monitor Karl Stratos CS 533: Natural Language Processing 24/30
Recommend
More recommend