Autoencoders and VAEs Karl Stratos Rutgers University Karl Stratos - PowerPoint PPT Presentation

CS 533: Natural Language Processing Autoencoders and VAEs Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/30

Aside: Protective Measures are Meaningful Karl Stratos CS 533: Natural Language Processing 2/30

Logistics ◮ Set up 1-1 meeting for proposal feedback (March 25-27) ◮ Proposal and A4 due March 24 ◮ Exam: discussion Karl Stratos CS 533: Natural Language Processing 3/30

Agenda ◮ EM: loose ends (hard EM) ◮ Autoencoders and VAEs ◮ VAE training techniques Karl Stratos CS 533: Natural Language Processing 4/30

Recap: Latent-Variable Generative Models (LVGMs) ◮ Observed data comes from the population distribution pop X ◮ LVGM: Model defining a joint distribution over X and Z p XZ ( x, z ) = p Z ( z ) × p X | Z ( x | z ) ◮ Learning: Estimate p XZ by maximizing log-likelihood of data x (1) . . . x ( N ) ∼ pop X N � � p XZ ( x ( i ) , z ) max log p XZ i =1 z ∈Z � �� p X ( x ( i ) ) Karl Stratos CS 533: Natural Language Processing 5/30

EM: Coordinate Ascent on ELBO Input : data x (1) . . . x ( N ) ∼ pop X , definition of p XZ Output : local optimum of N � � p XZ ( x ( i ) , z ) max log p XZ i =1 z ∈Z 1. Initialize p XZ (e.g., random distribution). 2. Repeat until convergence: p XZ ( x ( i ) , z ) q Z | X ( z | x ( i ) ) ← z ′ ∈Z p XZ ( x ( i ) , z ′ ) ∀ z ∈ Z , i = 1 . . . N � N � � q Z | X ( z | x ( i ) ) log p XZ ( x ( i ) , z ) p XZ ← arg max ¯ p XZ i =1 z ∈Z 3. Return p XZ Karl Stratos CS 533: Natural Language Processing 6/30

Hard EM: Coordinate Ascent on a Different Objective Input : data x (1) . . . x ( N ) ∼ pop X , definition of p XZ Output : local optimum of N � log p XZ ( x ( i ) , z i ) max p XZ , ( z 1 ...z N ) ∈Z N i =1 1. Initialize p XZ (e.g., random distribution). 2. Repeat until convergence: N � log p XZ ( x ( i ) , ¯ ( z 1 . . . z N ) ← arg max z i ) (¯ z 1 ... ¯ z N ) ∈Z N i =1 N � log p XZ ( x ( i ) , z i ) p XZ ← arg max p XZ ¯ i =1 3. Return p XZ Karl Stratos CS 533: Natural Language Processing 7/30

K -Means: Special Case of Hard EM ◮ x ∈ R d , z ∈ { 1 . . . K } p XZ ( x, z ) = 1 K × N ( x ; µ z , I d ) ◮ Model parameters to learn: µ 1 . . . µ K ∈ R d ◮ Negative log joint probability as a function of parameters − log p XZ ( x, z ) ≡ || x − µ z || 2 ◮ Observed x (1) . . . x ( N ) ∈ R d , latents z 1 . . . z N ∈ { 1 . . . K } � � � � 2 � x ( i ) − µ z � � � � z i ← arg min � � � z ∈{ 1 ...K } N N � � � � 2 1 � � � x ( i ) − µ z i � � � � x ( i ) µ k ← arg min = � � � count ( z = k ) µ ∈{ 1 ...K } i =1 i =1: z i = k Karl Stratos CS 533: Natural Language Processing 8/30

Setting ◮ Neural autoencoding: observed X , latent Z ◮ Running example ◮ X : sentence ◮ Z : m -dimensional real-valued vector ◮ We need to define ◮ q Z | X : encoder that transforms a sentence into a distribution over R m ◮ p X | Z : decoder that transforms a vector z ∈ R m into a distribution over sentences ◮ p Z : prior that defines a distribution over R m ◮ Distributions parameterized by neural networks Karl Stratos CS 533: Natural Language Processing 9/30

Example Encoder: LSTM + Gaussian ◮ Input. Sentence x ∈ V T ◮ Parameters. Word embeddings E ∈ R | V |× d , LSTMCell R d × R d → R d , feedforward FF 1 : R d → R 2 m ◮ Forward. h 1 , c 1 ← LSTMCell( E x 1 , (0 d , 0 d )) h 2 , c 2 ← LSTMCell( E x 2 , ( h 1 , c 1 )) . . . h T , c T ← LSTMCell( E x T , ( h T − 1 , c T − 1 )) � µ ( x ) � ← FF 1 ( h T ) σ 2 ( x ) ◮ Distribution over R m conditioned on x q Z | X ( ·| x ) = N ( µ ( x ) , diag ( σ 2 ( x ))) Karl Stratos CS 533: Natural Language Processing 10/30

Example Decoder: Conditional Language Model ◮ Input. Vector z ∈ R m ◮ Parameters. Word embeddings E ∈ R | V |× d (often tied with encoder), LSTMCell R d × R d → R d , feedforward FF 2 : R m → R d × R d ◮ Forward. Given sentence y ∈ V L compute its probability conditioned on z by h 1 , c 1 ← LSTMCell( E y 1 , FF 2 ( z )) h 2 , c 2 ← LSTMCell( E y 2 , ( h 1 , c 1 )) . . . h L , c L ← LSTMCell( E y L , ( h L − 1 , c L − 1 )) L � p X | Z ( y | z ) = softmax y l ( Eh l − 1 ) � �� l =1 p ( y l | z,y <l ) Karl Stratos CS 533: Natural Language Processing 11/30

Example Prior: Isotropic Gaussian ◮ Simplest: fixed standard normal p Z = N (0 m , I m ) . ◮ Parameters. None ◮ Can also make it more expressive, for instance a mixture of K diagonal Gaussians K � softmax k ( γ ) × N ( µ k , diag ( σ 2 p Z = k )) k =1 ◮ Parameters. γ ∈ R m and µ k , σ 2 k ∈ R m for k = 1 . . . K ◮ Multimodal instead of unimodal Karl Stratos CS 533: Natural Language Processing 12/30

Summary ◮ Sentence X , d -dimensional vector Z ◮ Learnable parameters ◮ Word embeddings E shared by encoder and decoder ◮ LSTM and feedforward parameters in q Z | X ◮ LSTM and feedforward parameters in p X | Z ◮ (Optional) Parameters in the prior p Z ◮ We will now consider learning all these parameters together in the autoencoding framework Karl Stratos CS 533: Natural Language Processing 13/30

Autoencoders (AEs) p Z z q Z | X : encoder q Z | X p X | Z p X | Z : decoder p Z : prior x pop X Objective. � � max E log p X | Z ( x | z ) + R ( pop X , p Z , p X | Z , q Z | X ) x ∼ pop X p Z , p X | Z , q Z | X � �� z ∼ q Z | X ( ·| x ) regularization � �� reconstruction Karl Stratos CS 533: Natural Language Processing 14/30

Naive Autoencoders Objective � � log p X | Z ( x | LSTM( x )) max E x ∼ pop X p X | Z , LSTM ◮ Deterministic encoding: equivalent to learning a point-mass encoder q Z | X (LSTM( x ) | x ) = 1 ◮ No regularization (hence no role for prior) Karl Stratos CS 533: Natural Language Processing 15/30

Denoising Autoencoders Objective � � log p X | Z ( x | LSTM( x + ǫ )) max E x ∼ pop X p X | Z , LSTM ǫ ∼ p E ◮ Noise introduced at input, reconstruct original input ◮ Equivalent to learning encoder q Z | X (LSTM( x + ǫ ) | x ) = p E ( ǫ ) ◮ Still no regularization, so no prior ◮ Example: masked language modeling Karl Stratos CS 533: Natural Language Processing 16/30

BERT as Denoising AE (Devlin et al., 2019) ran IsNext barked Transformer (Vaswani et al., 2017) away dog [CLS] the [MASK] [SEP] the cat [MASK] [SEP] Karl Stratos CS 533: Natural Language Processing 17/30

Variational Autoencoders (VAEs) Objective � � max log p X | Z ( x | z ) − D KL ( q Z | X || p Z ) E x ∼ pop X p Z , p X | Z , q Z | X z ∼ q Z | X ( ·| x ) ◮ Great deal of flexibility in terms of how to optimize it ◮ Popular approach for the current setting ◮ Optimize the reconstruction term by sampling + reparameterization trick z ∼ q Z | X ( ·| x ) ⇔ ǫ ∼ N (0 m , I m ) z = µ ( x ) + σ ( x ) ⊙ ǫ ◮ Optimize the KL term in closed form D KL ( N ( µ ( x ) , diag ( σ 2 ( x ))) ||N (0 m , I m )) � m � = 1 � σ 2 i ( x ) + µ 2 i ( x ) − 1 − log σ 2 i ( x ) 2 i =1 Karl Stratos CS 533: Natural Language Processing 18/30

VAE Loss: Concrete Steps Given a sentence x ∼ pop X (in general a minibatch) 1. Encoding . Run the encoder to calculate the Gaussian parameters µ ( x ) , σ 2 ( x ) ∈ R m µ ( x ) , σ 2 ( x ) ← Encoder ( x ) 2. KL . Calculate the KL term � m � κ ← 1 � σ 2 i ( x ) + µ 2 i ( x ) − 1 − log σ 2 i ( x ) 2 i =1 3. Reconstruction . Estimate the reconstruction term by sampling + reparameterization trick ρ ← DecoderNLL ( x, µ ( x ) + σ ( x ) ⊙ ǫ ) ǫ ∼ N (0 m , I m ) 4. Loss . Take a gradient step (wrt. all parameters) on ρ − βκ where β is some weight. Karl Stratos CS 533: Natural Language Processing 19/30

Uses of VAEs ◮ Representation learning. Run encoder on a sentence x to obtain its m -dimensional “meaning” vector ◮ Controlled generation. Run decoder on some seed vector to conditionally generate sentences ◮ Can “interpolate” between two sentences x 1 , x 2 z 1 ∼ q Z | X ( ·| x 1 ) z 2 ∼ q Z | X ( ·| x 2 ) x α ← Decode ( αz 1 + (1 − α ) z 2 ) α ∈ [0 , 1] Karl Stratos CS 533: Natural Language Processing 20/30

Interpolation Examples A Surprisingly Effective Fix for Deep Latent Variable Modeling of Text (Li et al., 2019) Karl Stratos CS 533: Natural Language Processing 21/30

VAEs in Computer Vision Random (never before seen) faces sampled from VAE decoder! Generating Diverse High-Fidelity Images with VQ-VAE-2 (Razavi et al., 2019) Karl Stratos CS 533: Natural Language Processing 22/30

VAE is EM VAE Objective � � log p X | Z ( x | z ) − D KL ( q Z | X || p Z ) = ELBO( p XZ , q Z | X ) E x ∼ pop X z ∼ q Z | X ( ·| x ) ◮ Thus when you optimize VAE you are maximizing a lower bound on marginal log likelihood defined by your LVGM ◮ Taking gradient steps for decoder/encoder/prior simultaneously is alternating optimization of ELBO ◮ Difference with the classical EM: we no longer insist on solving the E step exactly (i.e., setting q Z | X = p Z | X ) ◮ Train a separate variational model q Z | X alongside p XZ Karl Stratos CS 533: Natural Language Processing 23/30

Practical Issues ◮ Posterior collapse ◮ Quantities to monitor Karl Stratos CS 533: Natural Language Processing 24/30

Autoencoders and VAEs Karl Stratos Rutgers University Karl Stratos - PowerPoint PPT Presentation

CS 533: Natural Language Processing Autoencoders and VAEs Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/30 Aside: Protective Measures are Meaningful Karl Stratos CS 533: Natural Language Processing 2/30

Neural Photo Editing Andrew Brock Introduction Background: VAEs Background: VAEs Background:

Learning Hierarchical Priors in VAEs Alexej Klushyn, Nutan Chen, Richard Kurle, Botond Cseke,

LUC HENDRIKS RADBOUD UNIVERSITY, NIJMEGEN (NL) VARIATIONAL

Unsupervised Learning There is no direct ground truth for the quantity of interest

CSC321 Lecture 20: Autoencoders Roger Grosse Roger Grosse CSC321 Lecture 20: Autoencoders 1 /

Lecture 25: Autoencoders Kernel PCA Aykut Erdem January 2017 Hacettepe University Today

Discrete quantum groups and their probabilistic boundaries Stefaan Vaes, K.U.Leuven,

Variational Autoencoders Presented by: Jason Yu and Rajshree Daulatabad Topics Covered

CSC421/2516 Lecture 17: Variational Autoencoders Roger Grosse and Jimmy Ba Roger Grosse and

Variational Laplace Autoencoders Yookoon Park, Chris Dongjoo Kim and Gunhee Kim Vision and

Adversarially Regularized Autoencoders Junbo (Jake) Zhao, Yoon Kim, Kelly Zhang, Alexander M.

Variational Autoencoders Tom Fletcher March 25, 2019 Talking about this paper: Diederik Kingma

CSCE 496/896 Lecture 5: Stephen Scott Autoencoders Introduction Basic Idea Stacked AE Stephen

CSCE 496/896 Lecture 5: Stephen Scott Autoencoders Introduction Basic Idea Stacked AE Stephen

CSCE 479/879 Lecture 5: Stephen Scott Autoencoders Introduction Basic Idea Stacked AE Stephen

Understanding Geometric Attributes with Autoencoders Alasdair Newson Tlcom ParisTech

Electrical Test Technique Designed for the CLAS 12 Toroid

Program & Service Array Workgroup Workgroup Co-Chairs: Elizabeth Duryea 10/22/2020

Quality Assurance: Test Development & Execution Implementing Testing Ian S. King Test

Smart Micro-Grid Dr. H.K. Verma Distinguished Professor Department of Electrical and Electronics

Synapse a j a i Dendrite Axon Input Input Activation Output Output Nucleus Links

Representation of LTI Systems Prof. Seungchul Lee Industrial AI Lab. Transfer Function

Topic # 2 First-order Systems Reference textbook : Control Systems, Dhanesh N. Manik, Cengage

Correspondence with intuitionistic propositional logic This interpretation correponds exactly to

Autoencoders and VAEs Karl Stratos Rutgers University Karl Stratos - PowerPoint PPT Presentation

CS 533: Natural Language Processing Autoencoders and VAEs Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/30 Aside: Protective Measures are Meaningful Karl Stratos CS 533: Natural Language Processing 2/30

Neural Photo Editing Andrew Brock Introduction Background: VAEs Background: VAEs Background:

Learning Hierarchical Priors in VAEs Alexej Klushyn, Nutan Chen, Richard Kurle, Botond Cseke,

LUC HENDRIKS RADBOUD UNIVERSITY, NIJMEGEN (NL) VARIATIONAL

Unsupervised Learning There is no direct ground truth for the quantity of interest

CSC321 Lecture 20: Autoencoders Roger Grosse Roger Grosse CSC321 Lecture 20: Autoencoders 1 /

Lecture 25: Autoencoders Kernel PCA Aykut Erdem January 2017 Hacettepe University Today

Discrete quantum groups and their probabilistic boundaries Stefaan Vaes, K.U.Leuven,

Variational Autoencoders Presented by: Jason Yu and Rajshree Daulatabad Topics Covered

CSC421/2516 Lecture 17: Variational Autoencoders Roger Grosse and Jimmy Ba Roger Grosse and

Variational Laplace Autoencoders Yookoon Park, Chris Dongjoo Kim and Gunhee Kim Vision and

Adversarially Regularized Autoencoders Junbo (Jake) Zhao, Yoon Kim, Kelly Zhang, Alexander M.

Variational Autoencoders Tom Fletcher March 25, 2019 Talking about this paper: Diederik Kingma

CSCE 496/896 Lecture 5: Stephen Scott Autoencoders Introduction Basic Idea Stacked AE Stephen

CSCE 496/896 Lecture 5: Stephen Scott Autoencoders Introduction Basic Idea Stacked AE Stephen

CSCE 479/879 Lecture 5: Stephen Scott Autoencoders Introduction Basic Idea Stacked AE Stephen

Understanding Geometric Attributes with Autoencoders Alasdair Newson Tlcom ParisTech

Electrical Test Technique Designed for the CLAS 12 Toroid

Program &amp; Service Array Workgroup Workgroup Co-Chairs: Elizabeth Duryea 10/22/2020

Quality Assurance: Test Development &amp; Execution Implementing Testing Ian S. King Test

Smart Micro-Grid Dr. H.K. Verma Distinguished Professor Department of Electrical and Electronics

Synapse a j a i Dendrite Axon Input Input Activation Output Output Nucleus Links

Representation of LTI Systems Prof. Seungchul Lee Industrial AI Lab. Transfer Function

Topic # 2 First-order Systems Reference textbook : Control Systems, Dhanesh N. Manik, Cengage

Correspondence with intuitionistic propositional logic This interpretation correponds exactly to

Program & Service Array Workgroup Workgroup Co-Chairs: Elizabeth Duryea 10/22/2020

Quality Assurance: Test Development & Execution Implementing Testing Ian S. King Test