advanced machine learning
play

Advanced Machine Learning Variational Auto-encoders Amit Sethi, EE, - PowerPoint PPT Presentation

Advanced Machine Learning Variational Auto-encoders Amit Sethi, EE, IITB Objectives Learn how VAEs help in sampling from a data distribution Write the objective function of a VAE Derive how VAE objective is adapted for SGD VAE setup


  1. Advanced Machine Learning Variational Auto-encoders Amit Sethi, EE, IITB

  2. Objectives • Learn how VAEs help in sampling from a data distribution • Write the objective function of a VAE • Derive how VAE objective is adapted for SGD

  3. VAE setup • We are interested in maximizing the data likelihood 𝑄 𝑌 = 𝑄 𝑌 𝑨; 𝜄 𝑄 𝑨 𝑒𝑨 • Let 𝑄 𝑌 𝑨; 𝜄 be modeled by 𝑔 𝑨; 𝜄 • Further, let us assume that 𝑄 𝑌 𝑨; 𝜄 = 𝒪 𝑌 𝑔 𝑨; 𝜄 , 𝜏 2 𝐽 Source: VAEs by Kingma , Welling, et al.; “Tutorial on Variational Autoencoders ” by Carl Doersch

  4. We do not care about distribution of z • Latent variable z is drawn from a standard normal z ~ N (0, I ) θ X N • It may represent many different variations of the data Source: VAEs by Kingma , Welling, et al.; “Tutorial on Variational Autoencoders ” by Carl Doersch

  5. Example of a variable transformation z X = g(z) = z/10 + z/‖z‖ Source: VAEs by Kingma , Welling, et al.; “Tutorial on Variational Autoencoders ” by Carl Doersch

  6. Because of Gaussian assumption, the most obvious variation may not be the most likely • Although the ‘2’ on the right is a better choice as a variation of the one on the left, the one in the middle is more likely due to the Gaussian assumption Source: VAEs by Kingma , Welling, et al.; “Tutorial on Variational Autoencoders ” by Carl Doersch

  7. Sampling z from standard normal is problematic • It may give samples of z that are unlikely to have produced X • Can we sample z itself intelligently? • Enter Q( z | X ) to compute, e.g., E z~Q P(X|z) • All we need to do is reduce the KL divergence between P(X) and E z~Q P(X|z) • Hence, a variational method Source: VAEs by Kingma , Welling, et al.; “Tutorial on Variational Autoencoders ” by Carl Doersch

  8. VAE Objective Setup D [ Q ( z ) ‖ P ( z | X )] = E z ~ Q [log Q (z) − log P ( z | X )] = E z ~ Q [log Q (z) − log P ( X | z ) − log P (z)] + log P ( X ) Rearranging some terms: log P ( X ) − D [ Q ( z ) ‖ P ( z | X )] = E z ∼ Q [log P ( X | z )+ − D [ Q ( z ) ‖ P ( z )] Introducing dependency of Q on X : log P ( X ) − D [ Q ( z | X ) ‖ P ( z | X )] = E z ∼ Q [log P ( X | z )+ − D [ Q ( z | X ) ‖ P ( z )] Source: VAEs by Kingma , Welling, et al.; “Tutorial on Variational Autoencoders ” by Carl Doersch

  9. Optimizing the RHS • Q is encoding X into z; P ( X | z ) is decoding z • Assume in LHS Q ( z | X ) is a high capacity NN • For: E z ∼ Q [log P ( X | z )+ − D [ Q ( z | X ) ‖ P ( z )] • Assume: Q ( z | X ) = N (z| μ ( X; θ ),∑( X; θ )) • Then KL divergence is: D [ N ( μ ( X ),Σ( X )) ‖ N ( 0 , I )] = 1/2 [ tr(Σ( X )) + μ ( X ) T μ ( X ) − k − log det(Σ( X )) ] • In SGD, the objective becomes maximizing: E X ∼ D *log P(X)−D*Q( z|X ) ‖ P( z|X)]] =E X ∼ D [E z ∼ Q [log P(X|z )+ − D*Q( z|X ) ‖ P(z )]] Source: VAEs by Kingma , Welling, et al.; “Tutorial on Variational Autoencoders ” by Carl Doersch

  10. Moving the gradient inside the expectation • We need to compute the gradient of: log P(X|z ) − D*Q( z|X ) ‖ P(z)+ • The first term does not depend on parameters Q , but E z ∼ Q [log P(X|z)] does! • So, we need to generate z that are plausible, i.e. decodable Source: VAEs by Kingma , Welling, et al.; “Tutorial on Variational Autoencoders ” by Carl Doersch

  11. The actual model that resists backpropagation • Cannot backpropagate through a stochastic unit Source: VAEs by Kingma , Welling, et al.; “Tutorial on Variational Autoencoders ” by Carl Doersch

  12. The actual model that resists backpropagation Reparameterization trick: e ∼ N (0,I) and z=μ(X)+Σ 1/2 (X) ∗ e This works, if Q(z|X) and P(z) are continuous • E X ∼ D [E e ∼ N(0,I) [log P(X|z= μ( X) + Σ 1/2 (X) ∗ e)+−D*Q( z|X )‖P(z)++ • Now, we can BP end-to-end, because expectations are not with respect to distributions dependent on the model Source: VAEs by Kingma , Welling, et al.; “Tutorial on Variational Autoencoders ” by Carl Doersch

  13. Test-time sampling is straightforward • The encoder pathway, including the multiplication and addition are discarded • For getting an estimate of likelihood of a test sample, generate z, and then compute P(z|X) Source: VAEs by Kingma , Welling, et al.; “Tutorial on Variational Autoencoders ” by Carl Doersch

  14. Conditional VAE Source: VAEs by Kingma , Welling, et al.; “Tutorial on Variational Autoencoders ” by Carl Doersch

  15. Sample results for a MNIST VAE Source: VAEs by Kingma , Welling, et al.; “Tutorial on Variational Autoencoders ” by Carl Doersch

  16. Sample results for a MNIST CVAE Source: VAEs by Kingma , Welling, et al.; “Tutorial on Variational Autoencoders ” by Carl Doersch

  17. Advanced Machine Learning Generative Adversarial Networks Amit Sethi, EE, IITB

  18. Objectives • Articulate how using a discriminator helps a generator • Write the objective function of GAN • Write the training algorithm for GAN

  19. GAN trains two networks together z G x' D y x • GAN objective: min 𝐻 max 𝑊 𝐸, 𝐻 = 𝐸 𝔽 𝒚~𝑞 𝒚 (𝒚) log 𝐸 𝒚 + 𝔽 𝒜~𝑞 𝒜 𝒜 [log(1 − 𝐸(𝐻 𝒜 )] Source: “Generative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

  20. At the solution, the transformed distribution from z will emulate p x ( x ) p x ( x ) p x ( x ) p x ( x ) p x ( x ) D D D D G G G G Training steps  • As training progresses, the distributions of the transformed noise and the data will become indistinguishable Source: “Generative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

  21. The trick is to allow D to catch up before improving G in each iteration o For training iterations o For k steps o Update discriminator by ascending 1 + log(1 − 𝐸 𝐻(𝒜 𝑗 ) 𝑛 log 𝐸 𝒚 𝑗 o 𝛼 𝜄 𝐸 𝑛 𝑗=1 o Update generator by descending 1 log(1 − 𝐸 𝐻(𝒜 𝑗 ) 𝑛 o 𝛼 𝜄 𝐻 𝑛 𝑗=1 Source: “Generative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

  22. An optimum exists 𝑞 𝒚 (𝒚) • For a fixed generator, 𝐸 𝒚 = 𝑞 𝒚 𝒚 +𝑞 𝐻 (𝒚) • Because 𝔽 𝒚~𝑞 𝒚 (𝒚) log 𝐸 𝒚 + 𝔽 𝒜~𝑞 𝒜 𝒜 [log(1 − 𝐸(𝐻 𝒜 )] = 𝑞 𝒚 𝒚 log 𝐸 𝒚 + 𝑞 𝐻 𝒚 log 1 − 𝐸 𝒚 𝑒𝑦 𝒚 And, optimal of 𝑏 log 𝑧 + 𝑐 log(1 − 𝑧) is 𝑏 𝑏+𝑐 Source: “Generative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

  23. Generator’s optimization reduces as follows… • 𝔽 𝒚~𝑞 𝒚 (𝒚) log 𝐸 𝒚 + 𝔽 𝒜~𝑞 𝒜 𝒜 [log(1 − 𝐸(𝐻 𝒜 )] 𝑞 𝒚 (𝒚) = 𝔽 𝒚~𝑞 𝒚 (𝒚) log 𝑞 𝒚 𝒚 + 𝑞 𝐻 (𝒚) 𝑞 𝑯 (𝒚) + 𝔽 𝒚~𝑞 𝑯 𝒚 log 𝑞 𝒚 𝒚 + 𝑞 𝐻 (𝒚) = − log 4 + 𝐿𝑀 𝑞 𝒚 (𝒚) 𝑞 𝒚 𝒚 + 𝑞 𝐻 (𝒚) 2 + 𝐿𝑀 𝑞 𝐻 (𝒚) 𝑞 𝒚 𝒚 + 𝑞 𝐻 (𝒚) 2 • This assumes that the generator and the discriminator are high capacity such that these can model the desired distributions arbitrarily well. Source: “Generative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

  24. Some sample generations and interpolations of latent vector Source: “Generative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

  25. DC-GAN was designed to generate better images • No pooling – convolutions with > or < 1 stride • No fully connected layers • Heavy use of batchnorm • Use ReLU in G , leakyReLU in D in all but final layers • Use tanh in the last layer of G Source: “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” by Radford et al. in ICLR 2016

  26. While mode- collapse isn’t evident, there is some underfitting Source: “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” by Radford et al. in ICLR 2016

  27. GAN features can directly be used for classification Source: “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” by Radford et al. in ICLR 2016

  28. GANs allow latent vector “arithmetic” Source: “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” by Radford et al. in ICLR 2016

  29. Advantages and disadvantages of GAN • Markov chains not • No explicit generator needed • Need to sync D with G • Only backprop used • Mode collapse • No inference needed • Models a wide range of functions Source: “Generative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

  30. Conditional GAN introduces another variable (e.g. class) • Instead of the GAN objective: 𝔽 𝒚~𝑞 𝒚 (𝒚) log 𝐸 𝒚 + 𝔽 𝒜~𝑞 𝒜 𝒜 [log(1 − 𝐸(𝐻 𝒜 )] • CGAN uses a modified objective: 𝔽 𝒚~𝑞 𝒚 (𝒚) log 𝐸 𝒚|𝒛 + 𝔽 𝒜~𝑞 𝒜 𝒜 [log(1 − 𝐸(𝐻 𝒜|𝒛 )] Source: “Conditional Generative Adversarial Nets” by Mirza and Osindero, Arxiv 2014

  31. Conditional GAN introduces another variable (e.g. class) Source: “Conditional Generative Adversarial Nets” by Mirza and Osindero, Arxiv 2014

  32. Each row is conditioned upon one digit label of a CGAN Source: “Conditional Generative Adversarial Nets” by Mirza and Osindero, Arxiv 2014

Recommend


More recommend