improving pixelcnn
play

Improving PixelCNN Vertical stack oblem with this m of masked - PowerPoint PPT Presentation

Improving PixelCNN Vertical stack oblem with this m of masked convolution. Blind spot Horizontal stack Solution: use two stacks of Stacking layers of masked convolution, convolution creates convolution creates a blindspot a blindspot a


  1. Improving PixelCNN Vertical stack oblem with this m of masked convolution. Blind spot Horizontal stack Solution: use two stacks of Stacking layers of masked convolution, convolution creates convolution creates a blindspot a blindspot a vertical stack and a horizontal stack 66

  2. Improving PixelCNN I There is a problem with this oblem with this form of masked convolution. m of masked convolution. 1 1 1 1 1 1 1 1 1 1 Blind spot 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 Stacking layers of masked convolution creates a blindspot 67

  3. Improving PixelCNN II Use more expressive nonlinearity: essive nonlinearity: h k +1 = tanh( W k,f ⇤ h k ) � σ ( W k,g ⇤ h k ) This information flow (between vertical and horizontal stacks) preserves the correct pixel dependencies 68

  4. Samples from PixelCNN cs: CIFAR-10 Topi Topics: • Samples from a class-conditioned PixelCNN Coral Reef 69

  5. Samples from PixelCNN cs: CIFAR-10 Topi Topics: • Samples from a class-conditioned PixelCNN 70

  6. Samples from PixelCNN cs: CIFAR-10 Topi Topics: • Samples from a class-conditioned PixelCNN Sandbar 71

  7. Neural Image Model: Pixel RNN Convolutional Long Convolutional Long Short-Term Memory x 1 x n Row LSTM P( ) x i LSTM x n 2 Stollenga et al, 2015 Oord, Kalchbrenner, Kavukcuoglu, 2016 72

  8. Neural Image Model: Pixel RNN Convolutional Long Pixel RNN Multiple layers of convolutional LSTM x 1 x n P( ) x i LSTM x n 2 73

  9. Samples from PixelRNN 74

  10. Samples from PixelRNN 75

  11. Samples from PixelRNN 76

  12. Architecture for 1D sequences (Bytenet / Wavenet) - Stack of dilated, masked 1-D convolutions in the decoder - The architecture is parallelizable along the time dimension (during training or scoring) - Easy access to many states from the past 77

  13. Video Pixel Net Masked convolution 78

  14. Video Pixel Net 79

  15. VPN Samples for Moving MNIST No frame dependencies VPN Videos on nal.ai/vpn 80

  16. VPN Samples for Robotic Pushing No frame dependencies VPN Videos on nal.ai/vpn 81

  17. VPN Samples for Robotic Pushing 82

  18. Variational Autoencoders 83

  19. Variational Auto-Encoders in General z z ~ q( z | x ) Variational Auto-encoder (VAE) Amortised variational inference for latent variable models F ( q ) = E q φ ( z ) [log p θ ( x | z )] � KL [ q φ ( z | x ) k p ( z )] Desi sign ch choice ces Inference Model Network • Pri rior r on th the late tent t vari riable p( x | z ) q( z | x ) − Continuous, Discrete, Gaussian, Bernoulli, Mixture • Li Likel elihood function − iid (static), sequential, temporal, spatial • Ap Approximating posterior − distribution, sequential, spatial x ~ p( x | z ) For sca scalability and ease se of implementation Data x • Stochastic gradient descent (and variants), • Stochastic gradient estimation 84

  20. Variational Autoencoders (VAEs) • The VAE defines a generative process in terms of ancestral sampling through a cascade of hidden stochastic layers: Gen Each term may denote a Proces h 3 W 3 h 2 W 2 h 1 W 1 v Input data Input data 85

  21. Variational Autoencoders (VAEs) • The VAE defines a generative process in terms of ancestral sampling through a cascade of hidden stochastic layers: Gen Each term may denote a Generative Proces Process Process h 3 W 3 h 2 W 2 h 1 W 1 v Input data Input data 86

  22. Variational Autoencoders (VAEs) • The VAE defines a generative process in terms of ancestral sampling through a cascade of hidden stochastic layers: Gen Each term may denote a Generative Proces Process Process h 3 W 3 denotes parameters of VAE. • • d h 2 • L is the number of st stic layers. stoch chast W 2 aluaRon is tractab h 1 • Sampling and probability evaluation is tractable for each . ach . W 1 v Input data Input data 87

  23. Variational Autoencoders (VAEs) • The VAE defines a generative process in terms of ancestral sampling through a cascade of hidden stochastic layers: Gen Each term may denote a Each term may denote a Generative Proces Process complicated nonlinear relationship Process h 3 W 3 denotes parameters of VAE. • • d h 2 • L is the number of st stic layers. stoch chast W 2 aluaRon is tractab h 1 • Sampling and probability evaluation is tractable for each . ach . W 1 v Input data Input data 88

  24. Variational Autoencoders (VAEs) • The VAE defines a generative process in terms of ancestral sampling through a cascade of hidden stochastic layers: This term denotes a one-layer This term denotes a o neural net StochasRc Layer Stochastic Layer denotes parameters of VAE. • DeterminisRc • d Deterministic Layer Layer • L is the number of st stic layers. stoch chast aluaRon is tractab • Sampling and probability evaluation is StochasRc Layer Stochastic Layer tractable for each . ach . 89

  25. Variational Bound • The VAE is trained to maximize the variational lower bound: h 3 W 3 h 2 W 2 h 1 W 1 v Input data 90

  26. Variational Bound • The VAE is trained to maximize the variational lower bound: • Trading off the data log-likelihood and the KL divergence from the true posterior. h 3 W 3 h 2 W 2 h 1 W 1 v Input data 91

  27. Variational Bound • The VAE is trained to maximize the variational lower bound: • Trading off the data log-likelihood and the KL divergence from the true posterior. h 3 • Hard to optimize the variational bound with respect to the W 3 recognition network (high-variance). h 2 W 2 h 1 • Key idea of Kingma and Welling is to use W 1 reparameterization trick. v Input data 92

  28. Reparameterization Trick • Assume that the recognition distribution is Gaussian: • with mean and covariance computed from the state of the hidden units at the previous layer. 93

  29. Reparameterization Trick • Assume that the recognition distribution is Gaussian: • with mean and covariance computed from the state of the hidden units at the previous layer. • Alternatively, we can express this in term of au able : auxi xiliar ary y var variab 94

  30. Reparameterization Trick • Assume that the recognition distribution is Gaussian: • • Or on c • The recognition distribution can be expressed in terms of a deterministic mapping: apping: Deterministic Encoder Distribution of does not depend on of n 95

  31. Reparameterization Trick Decoder Decoder ( ) ( ) + Sample from * Encoder Encoder Sample from ( ) ( ) without reparameterization trick with reparameterization trick Image: Carl Doersch 96

  32. Computing the Gradients • The gradient w.r.t the parameters: both recognition and generative: generaRve: Gradients can be Gradients can be The mapping h is a deterministic computed by backprop neural net for fixed of 97

  33. Implementing a Variational Algorithm Forward pass Backward pass tion Variational inference turns integration r φ into optimization: Au Automated Tools : : z H[ q(z) ] Prior Differentiation : Theano, Torch7, • Di p(z) Prior TensorFlow, Stan. p(z) Message passing: infer.NET • Me log p(z) Inference r θ q(z |x) Inference q(z |x) Model • Stochastic gradient descent and p(x |z) Model other preconditioned optimization. p(x |z) r φ • Same code can run on both GPUs or Data x on distributed clusters. tion. • Probabilistic models are modular, can log p(x|z) easily be combined. s or Ideally want probabilistic programming using variational inference. 103

  34. Latent Gaussian VAE p ( z ) = N ( 0 , I ) Deep Latent Gaussian Model Prior z H[ q(z) ] p(z) log p(z) p ( x | f p θ ( z )) Inference q(z |x) p θ ( x | z ) = N ( µ p θ ( z ) , Σ p θ ( z )) Model p(x |z) Data x q φ ( z | x ) = N ( µ q φ ( x ) , Σ q φ ( x )) log p(x|z) F ( x , q ) = E q ( z ) [log p ( x | z )] � KL [ q ( z ) k p ( z )] All functions are deep networks. 104

  35. Latent Gaussian VAE 3 dimensional latent variable of MNIST Oxygen/Swimmers Moving Left Latent space disentangles the input data. Latent Factor Embedding p(Y| θ ) R D 3 < 0.01 0.01 - 0.2 Latent space and 0.2 - 0.4 2 likelihood bound Factor 2 1 gives a visualisation of importance. 0 − 1 − 2 − 2 − 1 0 1 2 3 Factor 1 105

  36. Blundell, Charles, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham VAE Representations Ruderman, Joel Z. Leibo, Jack Rae, Daan Wierstra, and Demis Hassabis. "Model-Free Episodic Control.” 2016 a 1 a 2 a 3 Representation a R Representations are useful for strategies such as episodic control. 106

  37. Latent Gaussian VAE • Require flexible approximations for the types of posteriors we are likely to see. 107

  38. Gregor, Karol, Ivo Danihelka, Andriy Mnih, Charles Latent Binary VAE Blundell, and Daan Wierstra. "Deep autoregressive networks. 2013 Deep Auto-regressive Networks p ( z i | z <i ) = Bern ( z i | f ( z <i )) Prior z H[ q(z) ] Y p ( z ) = p ( z i | z <i ) p(z) i log p(z) Inference Y p ( x | z ) = p ( x i | x <i , z ) q(z |x) i Y Bern ( x i | f p p ( x | z ) = θ ( x <i , z )) i Model p(x |z) Data x Y q φ ( z ) = q φ ( z i | z <i ) i Y Bern ( z i | f q q φ ( z ) = φ ( z <i )) log p(x|z) i 108

Recommend


More recommend