neural discrete
play

Neural Discrete Representation Learning (VQ-VAE) Aaron van den - PowerPoint PPT Presentation

Neural Discrete Representation Learning (VQ-VAE) Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu Google Deepmind NIPS 2017 Neural Discrete Representation Learning 1. What is the task? 2. Comparison & Contribution 3. VQ-VAE Model


  1. Neural Discrete Representation Learning (VQ-VAE) Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu Google Deepmind NIPS 2017

  2. Neural Discrete Representation Learning 1. What is the task? 2. Comparison & Contribution 3. VQ-VAE Model 4. Results 1. Density estimation & Reconstruction 2. Sampling 3. Speech 5. Discussion & Conclusion

  3. What is the task? • Task 1: Density estimation: learn p(x) • Task 2: Extract meaningful latent variable (unsupervised) • Task 3: Reconstruct input Output x’ Latent z Input x

  4. Comparison & Contribution 1. Bounds p(x), but does not require variational approximation 2. Train using maximum likelihood (stable training) 3. First to use discrete latent variables successfully A little girl sitting on a bed with a 4. Uses whole latent space (avoid ‘posterior collapse’) teddybear After discussion: Why is discrete nice? More natural representation for humans, avoids posterior collapse (because you can more easily manage your latent space using your dictionary), compresseable, easier to learn a prior over a discrete latent space (more tractable than a continuous latent space).

  5. Auto Encoder How to discretize? For the example: We take this to be a 4 x 4 image with 2 channels. Output Input Latent variable (reconstruction) We can train this system end-to-end using MSE (reconstruction loss)

  6. How to Discretize? 4 x 4 image with 2 channels. Each e has 2 We plot all pixel values (16) in 2D dimensions. (since we have 2 channels) Channel 2 Channel 1

  7. How to Discretize? Make dictionary of vectors 𝑓 1 , … , 𝑓 𝐿 4 x 4 image with 2 channels. Each 𝑓 𝑗 has 2 dimensions.

  8. How to Discretize? Make dictionary of vectors 𝑓 1 , … , 𝑓 𝐿 4 x 4 image with 2 channels. Each 𝑓 𝑗 has 2 dimensions. 𝑓 3 𝑓 2 For each latent pixel, look up nearest dictionary element 𝑓 𝑓 1

  9. How to Discretize? 4 x 4 image with 2 channels. Each 𝑓 𝑗 has 2 dimensions. 𝑓 3

  10. Proposed Model Output Input (reconstruction) Latent variable Latent is 1 channel image and contains the id of each e for each pixel ( discrete ).

  11. How to train? • No time to discuss … See slide 18 -19 • Lets talk about results

  12. R1: Density Estimation & Reconstructions • Comparable with VAE on CIFAR-10 in terms of density estimation • Reconstructions on ImageNet are very good Imagenet 128 * 128 * 3 * 8 = 393216 bits = 48 Kb Reconstruction 32 * 32 * 9 = 9216 bits = 1 Kb

  13. Class: pickup R2: Sampling / Generation PixelCNN • Lack global structure, unsharp. • 1 pixelCNN is not powerful enough. Hierarchical representation necessary

  14. R3: Stacking VQ-VAE • No time to discuss … See slide 20 -22 • Lets go to R4: Speech.

  15. R4: Speech • Decoder: Wavenet (state of the art speech generation) • Excellent speech reconstruction • Sampling results • Unsupervised learning • Voice style transfer • Learns phonemes (using latents: 49.3% accuracy – 7.2% random) https://avdnoord.github.io/homepage/vqvae/

  16. Discussion and Conclusion • Impressive results & good idea • Paper • Glances over many details, supplement & implementation missing • Are learned latents useful? Should be addressed quantatively • Image generation can be greatly improved • Using a hierarchical model as in Lampert (previous coffeetalk) should greatly improve speed and quality

  17. Thanks! • Slides author • https://drive.google.com/file/d/1t8W2L1H2RtUge- IQYqGXa9ihKNVQpqNI/view • Talk author • https://www.youtube.com/watch?v=HqaIkq3qH40

  18. How to train? (1/2) • How to backpropegate through the discretization? • Lets say a gradient is incoming to a dictionary vector • We do not update the dictionary vector (fixed) • Instead we apply the gradient of e to the non-discretized vector 𝑓 3

  19. How to train? (2/2) • Loss part 1: reconstruction error (dictionary fixed) • Loss part 2: to update the dictionary

  20. R3: Stacking VQ-VAE (1/2) Original (21168 bits = 3 Kb) • VQ-VAE stacked to get higher level latents • Use DeepMind lab (artificial images) • Errors: sharpness and global mismatch • Latents seem ‘useful’: can generate coherent video from latent space (input first 6 images, output: video) • No quantative experiment Reconstruction (27 bits)

  21. R3: Stacking VQ-VAE (2/2) Generated Video

  22. Multistage VQ-VAE VQ 3 latents in [0,512] 21 x 21 x 1 in [0,512] Before: 84 * 84 * 3 * 8 = 21168 bits = 3 Kb After 3 * 9 = 27 bits 84 x 84 x 3 In [0,256] Reconstruction not very accurate but powerful representation

  23. Comparison GAN Variational Pixel CNN VQ-VAE Autoencoder (This talk)     Compute exact likelihood p(x)     Has latent variable z     Compute latent variable z (inference)     Discrete latent variable     Stable training?     ? Sharp images?

Recommend


More recommend