Variants and Combinations of Basic Models Stefano Ermon, Aditya Grover Stanford University Lecture 12 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 12 1 / 19
Summary Story so far Representation: Latent variable vs. fully observed Objective function and optimization algorithm: Many divergences and distances optimized via likelihood-free (two sample test) or likelihood based methods Each have Pros and Cons Plan for today: Combining models Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 12 2 / 19
Variational Autoencoder A mixture of an infinite number of Gaussians: 1 z ∼ N (0 , I ) 2 p ( x | z ) = N ( µ θ ( z ) , Σ θ ( z )) where µ θ ,Σ θ are neural networks 3 p ( x | z ) and p ( z ) usually simple, e.g., Gaussians or conditionally independent Bernoulli vars (i.e., pixel values chosen independently given z ) 4 Idea : increase complexity using an autoregressive model Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 12 3 / 19
PixelVAE (Gulrajani et al.,2017) z is a feature map with the same resolution as the image x Autoregressive structure: p ( x | z ) = � i p ( x i | x 1 , · · · , x i − 1 , z ) p ( x | z ) is a PixelCNN Prior p ( z ) can also be autoregressive Can be hierarchical: p ( x | z 1 ) p ( z 1 | z 2 ) State-of-the art log-likelihood on some datasets; learns features (unlike PixelCNN); computationally cheaper than PixelCNN (shallower) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 12 4 / 19
Autoregressive flow Z f θ X Flow model, the marginal likelihood p ( x ) is given by � � �� ∂ f − 1 � � θ ( x ) � � f − 1 � � p X ( x ; θ ) = p Z θ ( x ) � det � � ∂ x � where p Z ( z ) is typically simple (e.g., a Gaussian). More complex prior? Prior p Z ( z ) can be autoregressive p Z ( z ) = � i p ( z i | z 1 , · · · , z i − 1 ). Autoregressive models are flows. Just another MAF layer. See also neural autoregressive flows (Huang et al., ICML-18) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 12 5 / 19
VAE + Flow Model z φ θ x � ≥ q ( z | x ; φ ) log p ( z , x ; θ ) + H ( q ( z | x ; φ )) = L ( x ; θ, φ ) log p ( x ; θ ) � �� � z ELBO log p ( x ; θ ) = L ( x ; θ, φ ) + D KL ( q ( z | x ; φ ) � p ( z | x ; θ )) � �� � Gap between true log-likelihood and ELBO q ( z | x ; φ ) is often too simple (Gaussian) compared to the true posterior p ( z | x ; θ ), hence ELBO bound is loose Idea: Make posterior more flexible: z ′ ∼ q ( z ′ | x ; φ ), z = f φ ′ ( z ′ ) for an invertible f φ ′ (Rezende and Mohamed, 2015; Kingma et al., 2016) Still easy to sample from, and can evaluate density. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 12 6 / 19
VAE + Flow Model Posterior approximation is more flexible, hence we can get tighter ELBO (closer to true log-likelihood). Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 12 7 / 19
Multimodal variants Goal: Learn a joint distribution over the two domains p ( x 1 , x 2 ), e.g., color and gray-scale images Can use a VAE style model: z x 1 x 2 Learn p θ ( x 1 , x 2 ), use inference nets q φ ( z | x 1 ), q φ ( z | x 2 ), q φ ( z | x 1 , x 2 ). Conceptually similar to semi-supervised VAE in HW2. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 12 8 / 19
Variational RNN Goal: Learn a joint distribution over a sequence p ( x 1 , · · · , x T ) VAE for sequential data, using latent variables z 1 , · · · , z T . Instead of training separate VAEs z i → x i , train a joint model: T � p ( x ≤ T , z ≤ T ) = p ( x t | z ≤ t , x < t ) p ( z t | z < t , x < t ) t =1 z t z t z t z t h t − 1 h t h t − 1 h t h t − 1 h t h t − 1 h t x t x t x t x t (a) Prior (b) Generation (c) Recurrence (d) Inference Chung et al, 2016 Use RNNs to model the conditionals (similar to PixelRNN) Use RNNs for inference q ( z ≤ T | x ≤ T ) = � T t =1 q ( z t | z < t , x ≤ t ) Train like VAE to maximize ELBO. Conceptually similar to PixelVAE. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 12 9 / 19
Combining losses Z f θ X Flow model, the marginal likelihood p ( x ) is given by � � �� ∂ f − 1 � � θ ( x ) � � f − 1 � � p X ( x ; θ ) = p Z θ ( x ) � det � � ∂ x � Can also be thought of as the generator of a GAN Should we train by min θ D KL ( p data , p θ ) or min θ JSD ( p data , p θ )? Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 12 10 / 19
FlowGAN Although D KL ( p data , p θ ) = 0 if and only if JSD ( p data , p θ ) = 0, optimizing one does not necessarily optimize the other. If z , x have same dimensions, can optimize min θ KL ( p data , p θ ) + λ JSD ( p data , p θ ) Interpolates between a GAN and a flow model Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 12 11 / 19
Adversarial Autoencoder (VAE + GAN) z φ θ x log p ( x ; θ ) = L ( x ; θ, φ ) + D KL ( q ( z | x ; φ ) � p ( z | x ; θ )) � �� � ELBO E x ∼ p data [ L ( x ; θ, φ )] = E x ∼ p data [log p ( x ; θ ) − D KL ( q ( z | x ; φ ) � p ( z | x ; θ ))] � �� � ≈ training obj. up to const. ≡ − D KL ( p data ( x ) � p ( x ; θ )) − E x ∼ p data [ D KL ( q ( z | x ; φ ) � p ( z | x ; θ ))] � �� � equiv. to MLE Note: regularized maximum likelihood estimation (Shu et al, Amortized inference regularization ) Can add in a GAN objective − JSD ( p data , p ( x ; θ )) to get sharper samples, i.e., discriminator attempting to distinguish VAE samples from real ones. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 12 12 / 19
An alternative interpretation z φ θ x E x ∼ p data [ L ( x ; θ, φ )] = E x ∼ p data [log p ( x ; θ ) − D KL ( q ( z | x ; φ ) � p ( z | x ; θ ))] � �� � ≈ training obj. up to const. ≡ − D KL ( p data ( x ) � p ( x ; θ )) − E x ∼ p data [ D KL ( q ( z | x ; φ ) � p ( z | x ; θ ))] � � log p data ( x ) q ( z | x ; φ ) log q ( z | x ; φ ) � � = − p data ( x ) p ( x ; θ ) + p ( z | x ; θ ) x z �� � q ( z | x ; φ ) log q ( z | x ; φ ) p data ( x ) � = − p data ( x ) p ( z | x ; θ ) p ( x ; θ ) x z p data ( x ) q ( z | x ; φ ) log p data ( x ) q ( z | x ; φ ) � = − p ( x ; θ ) p ( z | x ; θ ) x , z = − D KL ( p data ( x ) q ( z | x ; φ ) � p ( x ; θ ) p ( z | x ; θ ) ) � �� � � �� � q ( z , x ; φ ) p ( z , x ; θ ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 12 13 / 19
An alternative interpretation z φ θ x E x ∼ p data [ L ( x ; θ, φ ) ] ≡ − D KL ( p data ( x ) q ( z | x ; φ ) � p ( x ; θ ) p ( z | x ; θ ) ) � �� � � �� � � �� � ELBO q ( z , x ; φ ) p ( z , x ; θ ) Optimizing ELBO is the same as matching the inference distribution q ( z , x ; φ ) to the generative distribution p ( z , x ; θ ) = p ( z ) p ( x | z ; θ ) Intuition : p ( x ; θ ) p ( z | x ; θ ) = p data ( x ) q ( z | x ; φ ) if p data ( x ) = p ( x ; θ ) 1 q ( z | x ; φ ) = p ( z | x ; θ ) for all x 2 Hence we get the VAE objective: 3 − D KL ( p data ( x ) � p ( x ; θ )) − E x ∼ p data [ D KL ( q ( z | x ; φ ) � p ( z | x ; θ ))] Many other variants are possible! VAE + GAN: − JSD ( p data ( x ) � p ( x ; θ )) − D KL ( p data ( x ) � p ( x ; θ )) − E x ∼ p data [ D KL ( q ( z | x ; φ ) � p ( z | x ; θ ))] Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 12 14 / 19
Adversarial Autoencoder (VAE + GAN) z φ θ x E x ∼ p data [ L ( x ; θ, φ ) ] ≡ − D KL ( p data ( x ) q ( z | x ; φ ) � p ( x ; θ ) p ( z | x ; θ ) ) � �� � � �� � � �� � ELBO q ( z , x ; φ ) p ( z , x ; θ ) Optimizing ELBO is the same as matching the inference distribution q ( z , x ; φ ) to the generative distribution p ( z , x ; θ ) Symmetry: Using alternative factorization: p ( z ) p ( x | z ; θ ) = q ( z ; φ ) q ( x | z ; φ ) if q ( z ; φ ) = p ( z ) 1 q ( x | z ; φ ) = p ( x | z ; θ ) for all z 2 We get an equivalent form of the VAE objective: 3 − D KL ( q ( z ; φ ) � p ( z )) − E z ∼ q ( z ; φ ) [ D KL ( q ( x | z ; φ ) � p ( x | z ; θ ))] Other variants are possible. For example, can add − JSD ( q ( z ; φ ) � p ( z )) to match features in latent space (Zhao et al., 2017; Makhzani et al, 2018) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 12 15 / 19
Information Preference φ z θ x E x ∼ p data [ L ( x ; θ, φ ) ] ≡ − D KL ( p data ( x ) q ( z | x ; φ ) � p ( x ; θ ) p ( z | x ; θ ) ) � �� � � �� � � �� � ELBO q ( z , x ; φ ) p ( z , x ; θ ) ELBO is optimized as long as q ( z , x ; φ ) = p ( z , x ; θ ). Many solutions are possible! For example, p ( z , x ; θ ) = p ( z ) p ( x | z ; θ ) = p ( z ) p data ( x ) 1 q ( z , x ; φ ) = p data ( x ) q ( z | x ; φ ) = p data ( x ) p ( z ) 2 Note z and z are independent. z carries no information about x . This 3 happens in practice when p ( x | z ; θ ) is too flexible, like PixelCNN. Issue: Many more variables than constraints Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 12 16 / 19
Recommend
More recommend