Variational Autoencoders Tom Fletcher March 25, 2019
Talking about this paper: Diederik Kingma and Max Welling, Auto-Encoding Variational Bayes, In International Conference on Learning Representation (ICLR) , 2014.
Autoencoders Input Latent Space Output x ′ ∈ R D x ∈ R D z ∈ R d d << D
Autoencoders ◮ Linear activation functions give you PCA ◮ Training: 1. Given data x , feedforward to x ′ output 2. Compute loss, e.g., L ( x , x ′ ) = � x − x ′ � 2 3. Backpropagate loss gradient to update weights ◮ Not a generative model!
Variational Autoencoders Input Latent Space Output μ σ 2 x ′ ∈ R D x ∈ R D z ∼ N ( µ, σ 2 )
Generative Models z Sample a new x in two steps: θ p ( z ) Prior: p θ ( x | z ) Generator: x Now the analogy to the “encoder” is: Posterior: p ( z | x )
Posterior Inference Posterior via Bayes’ Rule: p θ ( x | z ) p ( z ) p ( z | x ) = � p θ ( x | z ) p ( z ) dz Integral in denominator is (usually) intractable! Could use Monte Carlo to approximate, but it’s expensive
Kullback-Leibler Divergence � p ( z ) � � D KL ( q � p ) = − q ( z ) log dz q ( z ) � � p �� = E q − log q The average information gained from moving from q to p
Variational Inference Approximate intractable posterior p ( z | x ) with a manageable distribution q ( z ) Minimize the KL divergence: D KL ( q ( z ) � p ( z | x ))
Evidence Lower Bound (ELBO) D KL ( q ( z ) � p ( z | x )) � � p ( z | x ) �� = E q − log q ( z ) � � − log p ( z , x ) = E q q ( z ) p ( x ) = E q [ − log p ( z , x ) − log q ( z ) + log p ( x )] = − E q [log p ( z , x )] + E q [log q ( z )] + log p ( x ) log p ( x ) = D KL ( q ( z ) � p ( z | x )) + L [ q ( z )] ELBO: L [ q ( z )] = E q [log p ( z , x )] − E q [log q ( z )]
Variational Autoencoder q φ ( z | x ) p θ ( x | z ) Encoder Network Decoder Network Maximize ELBO: L ( θ, φ, x ) = E q φ [log p θ ( x , z ) − log q φ ( z | x )]
VAE ELBO L ( θ, φ, x ) = E q φ [log p θ ( x , z ) − log q φ ( z | x )] = E q φ [log p θ ( z ) + log p θ ( x | z ) − log q φ ( z | x )] � p θ ( z ) � = E q φ log q φ ( z | x ) + log p θ ( x | z ) = − D KL ( q φ ( z | x ) � p θ ( z )) + E q φ [log p θ ( x | z )] Problem: Gradient ∇ φ E q φ [log p θ ( x | z )] is intractable! Use Monte Carlo approx., sampling z ( s ) ∼ q φ ( z | x ) : S ∇ φ E q φ [log p θ ( x | z )] ≈ 1 � log p θ ( x | z ) ∇ φ log q φ ( z ( s ) ) S s = 1
Reparameterization Trick What about the other term? − D KL ( q φ ( z | x ) � p θ ( z )) Says encoder, q φ ( z | x ) , should make code z look like prior distribution Instead of encoding z , encode parameters for a normal distribution, N ( µ, σ 2 )
Reparameterization Trick q φ ( z j | x ( i ) ) = N ( µ ( i ) j , σ 2 ( i ) ) j p θ ( z ) = N ( 0 , I ) KL divergence between these two is: d D KL ( q φ ( z | x ( i ) ) � p θ ( z )) = − 1 � j ) 2 − σ 2 ( i ) � � 1 + log( σ 2 ( i ) ) − ( µ ( i ) j j 2 j = 1
Results from Kingma & Welling
Recommend
More recommend