Improving PixelCNN Vertical stack oblem with this m of masked convolution. Blind spot Horizontal stack Solution: use two stacks of Stacking layers of masked convolution, convolution creates convolution creates a blindspot a blindspot a vertical stack and a horizontal stack 66
Improving PixelCNN I There is a problem with this oblem with this form of masked convolution. m of masked convolution. 1 1 1 1 1 1 1 1 1 1 Blind spot 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 Stacking layers of masked convolution creates a blindspot 67
Improving PixelCNN II Use more expressive nonlinearity: essive nonlinearity: h k +1 = tanh( W k,f ⇤ h k ) � σ ( W k,g ⇤ h k ) This information flow (between vertical and horizontal stacks) preserves the correct pixel dependencies 68
Samples from PixelCNN cs: CIFAR-10 Topi Topics: • Samples from a class-conditioned PixelCNN Coral Reef 69
Samples from PixelCNN cs: CIFAR-10 Topi Topics: • Samples from a class-conditioned PixelCNN 70
Samples from PixelCNN cs: CIFAR-10 Topi Topics: • Samples from a class-conditioned PixelCNN Sandbar 71
Neural Image Model: Pixel RNN Convolutional Long Convolutional Long Short-Term Memory x 1 x n Row LSTM P( ) x i LSTM x n 2 Stollenga et al, 2015 Oord, Kalchbrenner, Kavukcuoglu, 2016 72
Neural Image Model: Pixel RNN Convolutional Long Pixel RNN Multiple layers of convolutional LSTM x 1 x n P( ) x i LSTM x n 2 73
Samples from PixelRNN 74
Samples from PixelRNN 75
Samples from PixelRNN 76
Architecture for 1D sequences (Bytenet / Wavenet) - Stack of dilated, masked 1-D convolutions in the decoder - The architecture is parallelizable along the time dimension (during training or scoring) - Easy access to many states from the past 77
Video Pixel Net Masked convolution 78
Video Pixel Net 79
VPN Samples for Moving MNIST No frame dependencies VPN Videos on nal.ai/vpn 80
VPN Samples for Robotic Pushing No frame dependencies VPN Videos on nal.ai/vpn 81
VPN Samples for Robotic Pushing 82
Variational Autoencoders 83
Variational Auto-Encoders in General z z ~ q( z | x ) Variational Auto-encoder (VAE) Amortised variational inference for latent variable models F ( q ) = E q φ ( z ) [log p θ ( x | z )] � KL [ q φ ( z | x ) k p ( z )] Desi sign ch choice ces Inference Model Network • Pri rior r on th the late tent t vari riable p( x | z ) q( z | x ) − Continuous, Discrete, Gaussian, Bernoulli, Mixture • Li Likel elihood function − iid (static), sequential, temporal, spatial • Ap Approximating posterior − distribution, sequential, spatial x ~ p( x | z ) For sca scalability and ease se of implementation Data x • Stochastic gradient descent (and variants), • Stochastic gradient estimation 84
Variational Autoencoders (VAEs) • The VAE defines a generative process in terms of ancestral sampling through a cascade of hidden stochastic layers: Gen Each term may denote a Proces h 3 W 3 h 2 W 2 h 1 W 1 v Input data Input data 85
Variational Autoencoders (VAEs) • The VAE defines a generative process in terms of ancestral sampling through a cascade of hidden stochastic layers: Gen Each term may denote a Generative Proces Process Process h 3 W 3 h 2 W 2 h 1 W 1 v Input data Input data 86
Variational Autoencoders (VAEs) • The VAE defines a generative process in terms of ancestral sampling through a cascade of hidden stochastic layers: Gen Each term may denote a Generative Proces Process Process h 3 W 3 denotes parameters of VAE. • • d h 2 • L is the number of st stic layers. stoch chast W 2 aluaRon is tractab h 1 • Sampling and probability evaluation is tractable for each . ach . W 1 v Input data Input data 87
Variational Autoencoders (VAEs) • The VAE defines a generative process in terms of ancestral sampling through a cascade of hidden stochastic layers: Gen Each term may denote a Each term may denote a Generative Proces Process complicated nonlinear relationship Process h 3 W 3 denotes parameters of VAE. • • d h 2 • L is the number of st stic layers. stoch chast W 2 aluaRon is tractab h 1 • Sampling and probability evaluation is tractable for each . ach . W 1 v Input data Input data 88
Variational Autoencoders (VAEs) • The VAE defines a generative process in terms of ancestral sampling through a cascade of hidden stochastic layers: This term denotes a one-layer This term denotes a o neural net StochasRc Layer Stochastic Layer denotes parameters of VAE. • DeterminisRc • d Deterministic Layer Layer • L is the number of st stic layers. stoch chast aluaRon is tractab • Sampling and probability evaluation is StochasRc Layer Stochastic Layer tractable for each . ach . 89
Variational Bound • The VAE is trained to maximize the variational lower bound: h 3 W 3 h 2 W 2 h 1 W 1 v Input data 90
Variational Bound • The VAE is trained to maximize the variational lower bound: • Trading off the data log-likelihood and the KL divergence from the true posterior. h 3 W 3 h 2 W 2 h 1 W 1 v Input data 91
Variational Bound • The VAE is trained to maximize the variational lower bound: • Trading off the data log-likelihood and the KL divergence from the true posterior. h 3 • Hard to optimize the variational bound with respect to the W 3 recognition network (high-variance). h 2 W 2 h 1 • Key idea of Kingma and Welling is to use W 1 reparameterization trick. v Input data 92
Reparameterization Trick • Assume that the recognition distribution is Gaussian: • with mean and covariance computed from the state of the hidden units at the previous layer. 93
Reparameterization Trick • Assume that the recognition distribution is Gaussian: • with mean and covariance computed from the state of the hidden units at the previous layer. • Alternatively, we can express this in term of au able : auxi xiliar ary y var variab 94
Reparameterization Trick • Assume that the recognition distribution is Gaussian: • • Or on c • The recognition distribution can be expressed in terms of a deterministic mapping: apping: Deterministic Encoder Distribution of does not depend on of n 95
Reparameterization Trick Decoder Decoder ( ) ( ) + Sample from * Encoder Encoder Sample from ( ) ( ) without reparameterization trick with reparameterization trick Image: Carl Doersch 96
Computing the Gradients • The gradient w.r.t the parameters: both recognition and generative: generaRve: Gradients can be Gradients can be The mapping h is a deterministic computed by backprop neural net for fixed of 97
Implementing a Variational Algorithm Forward pass Backward pass tion Variational inference turns integration r φ into optimization: Au Automated Tools : : z H[ q(z) ] Prior Differentiation : Theano, Torch7, • Di p(z) Prior TensorFlow, Stan. p(z) Message passing: infer.NET • Me log p(z) Inference r θ q(z |x) Inference q(z |x) Model • Stochastic gradient descent and p(x |z) Model other preconditioned optimization. p(x |z) r φ • Same code can run on both GPUs or Data x on distributed clusters. tion. • Probabilistic models are modular, can log p(x|z) easily be combined. s or Ideally want probabilistic programming using variational inference. 103
Latent Gaussian VAE p ( z ) = N ( 0 , I ) Deep Latent Gaussian Model Prior z H[ q(z) ] p(z) log p(z) p ( x | f p θ ( z )) Inference q(z |x) p θ ( x | z ) = N ( µ p θ ( z ) , Σ p θ ( z )) Model p(x |z) Data x q φ ( z | x ) = N ( µ q φ ( x ) , Σ q φ ( x )) log p(x|z) F ( x , q ) = E q ( z ) [log p ( x | z )] � KL [ q ( z ) k p ( z )] All functions are deep networks. 104
Latent Gaussian VAE 3 dimensional latent variable of MNIST Oxygen/Swimmers Moving Left Latent space disentangles the input data. Latent Factor Embedding p(Y| θ ) R D 3 < 0.01 0.01 - 0.2 Latent space and 0.2 - 0.4 2 likelihood bound Factor 2 1 gives a visualisation of importance. 0 − 1 − 2 − 2 − 1 0 1 2 3 Factor 1 105
Blundell, Charles, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham VAE Representations Ruderman, Joel Z. Leibo, Jack Rae, Daan Wierstra, and Demis Hassabis. "Model-Free Episodic Control.” 2016 a 1 a 2 a 3 Representation a R Representations are useful for strategies such as episodic control. 106
Latent Gaussian VAE • Require flexible approximations for the types of posteriors we are likely to see. 107
Gregor, Karol, Ivo Danihelka, Andriy Mnih, Charles Latent Binary VAE Blundell, and Daan Wierstra. "Deep autoregressive networks. 2013 Deep Auto-regressive Networks p ( z i | z <i ) = Bern ( z i | f ( z <i )) Prior z H[ q(z) ] Y p ( z ) = p ( z i | z <i ) p(z) i log p(z) Inference Y p ( x | z ) = p ( x i | x <i , z ) q(z |x) i Y Bern ( x i | f p p ( x | z ) = θ ( x <i , z )) i Model p(x |z) Data x Y q φ ( z ) = q φ ( z i | z <i ) i Y Bern ( z i | f q q φ ( z ) = φ ( z <i )) log p(x|z) i 108
Recommend
More recommend