Pixel Recurrent Neural Networks Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu Google Deepmind ICML'16 188 citations
Pixel Recurrent Neural Networks 1. What is the task? 2. Other models: GAN, VAE, … 3. PixelCNN model 4. Results 5. Discussion & Conclusion 6. Extensions (preview of next coffeetalk)
Goal: learning the distribution of natural images • Task: • Why learn p(x)? • Input: training set of images • Image reconstruction / inpainting / denoising: input corrupted image, • Output: model that estimates p(x) output fixed image for any image x • Image colorization: input greyscale, • Evaluation: measure p(x) on testset. output color image Higher p(x) is better. • Semi-supervised learning (low • Note: p(x) should be normalized density separation) • Representation learning (find manifold of natural images) • Dimensionality reduction / finding variations in data • Clustering • …
Other approaches GAN Variational Pixel CNN Invertible Autoencoder (This talk) models (VAE) Real NVP Compute exact likelihood p(x) Has latent variable z Compute latent variable z (inference) Stable training? (No mode collapse) ? Sharp images? ?
Pixel CNN (1/2) • Why is computing 𝑞(𝑦) so difficult? • This is the reason why GANs avoid it, and VAE approximate it • Answer: normalization of 𝑞(𝑦) • We need to integrate the model output over all images x which is intractable • Pixel CNN computes 𝑞(𝑦) using the chain rule of probability 𝑞 𝑦 = 𝑞 𝑦 4 𝑦 3 , 𝑦 2 , 𝑦 1 𝑞 𝑦 3 𝑦 2 , 𝑦 1 𝑞 𝑦 2 𝑦 1 𝑞 𝑦 1 • The function 𝑞(𝑦 𝑗 |𝑦 𝑗−1 , … , 𝑦 1 ) is modeled using a CNN • This 1D function is easy to keep normalized • If this conditional density is normalized, 𝑞(𝑦) is properly normalized as well!
Pixel CNN (2/2) 1. Order pixels 2. Imagine already generated pixels 1-6, want to predict pixel 7 3. Mask pixels 7-16 (set to 0) 4. CNN outputs normalized histogram for pixel 7 given pixel values 1-6 (maksed input) • Maximize log likelihood w.r.t. CNN parameters Image from trainset Masked image 1 2 3 4 1 2 3 4 5 6 7 8 5 6 7 8 CNN 9 10 11 12 9 10 11 12 13 14 15 16 13 14 15 16 OUTPUT INPUT
Results (1/2)
Results of generating ‘new’ images
Results & Discussion • Sampled images • Good local coherence • Incoherent global structure • Sharp images! • SOTA on likelihood CIFAR-10 • Discussion CIFAR-10. NLL = Negative log likelihood in bits • Slow generation (sequential) per dimension (lower is better) • No latent representation • (Teacher forcing)
Preview of next coffeetalk • PixelCNN++ (faster), Conditional PixelCNN, PixelVAE , … • Use a pyramid of pixel CNN models • Go from low resolution to high resolution • Improves global coherence of generated images • Model becomes much faster • Decomposition of likelihood (high level details, low level details) • Next coffeetalk : “ PixelCNN with Auxiliary Variables for Natural Image Modeling” C.H. Lampert • Want to know more? https://www.cs.toronto.edu/~duvenaud/courses/csc2541/index.html Good course on Deep Generative Models (GAN, VAE, pixelCNN , Real NVP,…)
Recommend
More recommend