Cat image is CC0 public domain Dog image is CC0 Public Domain Monkey image is CC0 Public Domain Abstract image is free to use under the Pixabay license Discriminative vs Generative Models Recall Bayesβ Rule: Discriminative Model: Learn a probability distribution p(y|x) π π¦ π§) = π π§ π¦) π(π¦) Generative Model : π π§ Learn a probability distribution p(x) Conditional Generative Model: Learn p(x|y) Justin Johnson November 20, 2019 Lecture 19 - 24
Cat image is CC0 public domain Dog image is CC0 Public Domain Monkey image is CC0 Public Domain Abstract image is free to use under the Pixabay license Discriminative vs Generative Models Recall Bayesβ Rule: Discriminative Model: Learn a probability Discriminative Model (Unconditional) distribution p(y|x) Generative Model π π¦ π§) = π π§ π¦) π(π¦) Generative Model : π π§ Learn a probability Conditional distribution p(x) Prior over labels Generative Model We can build a conditional generative Conditional Generative model from other components! Model: Learn p(x|y) Justin Johnson November 20, 2019 Lecture 19 - 25
What can we do with a discriminative model? Discriminative Model: Assign labels to data Learn a probability Feature learning (with labels) distribution p(y|x) Generative Model : Learn a probability distribution p(x) Conditional Generative Model: Learn p(x|y) Justin Johnson November 20, 2019 Lecture 19 - 26
What can we do with a generative model? Discriminative Model: Assign labels to data Learn a probability Feature learning (with labels) distribution p(y|x) Detect outliers Generative Model : Feature learning (without labels) Learn a probability distribution p(x) Sample to generate new data Conditional Generative Model: Learn p(x|y) Justin Johnson November 20, 2019 Lecture 19 - 27
What can we do with a generative model? Discriminative Model: Assign labels to data Learn a probability Feature learning (supervised) distribution p(y|x) Detect outliers Generative Model : Feature learning (unsupervised) Learn a probability distribution p(x) Sample to generate new data Assign labels, while rejecting outliers! Conditional Generative Generate new data conditioned on input labels Model: Learn p(x|y) Justin Johnson November 20, 2019 Lecture 19 - 28
Taxonomy of Generative Models Generative models Figure adapted from Ian Goodfellow, Tutorial on Generative Adversarial Networks, 2017. Justin Johnson November 20, 2019 Lecture 19 - 29
Taxonomy of Generative Models Model does not explicitly compute p(x), but can Model can Generative models sample from p(x) compute p(x) Explicit density Implicit density Figure adapted from Ian Goodfellow, Tutorial on Generative Adversarial Networks, 2017. Justin Johnson November 20, 2019 Lecture 19 - 30
Taxonomy of Generative Models Model does not explicitly compute p(x), but can Model can Generative models sample from p(x) compute p(x) Explicit density Implicit density Can compute approximation to p(x) Tractable density Approximate density Can compute p(x) Autoregressive - NADE / MADE - NICE / RealNVP - Glow - Ffjord - Figure adapted from Ian Goodfellow, Tutorial on Generative Adversarial Networks, 2017. Justin Johnson November 20, 2019 Lecture 19 - 31
Taxonomy of Generative Models Model does not explicitly compute p(x), but can Model can Generative models sample from p(x) compute p(x) Explicit density Implicit density Can compute approximation to p(x) Tractable density Approximate density Can compute p(x) Autoregressive - NADE / MADE - Variational Markov Chain NICE / RealNVP - Glow - Variational Autoencoder Boltzmann Machine Ffjord - Figure adapted from Ian Goodfellow, Tutorial on Generative Adversarial Networks, 2017. Justin Johnson November 20, 2019 Lecture 19 - 32
Taxonomy of Generative Models Model does not explicitly compute p(x), but can Model can Generative models sample from p(x) compute p(x) Explicit density Implicit density Can compute approximation to p(x) Tractable density Approximate density Markov Chain Direct GSN Generative Adversarial Can compute p(x) Autoregressive Networks (GANs) - NADE / MADE - Variational Markov Chain NICE / RealNVP - Glow - Variational Autoencoder Boltzmann Machine Ffjord - Figure adapted from Ian Goodfellow, Tutorial on Generative Adversarial Networks, 2017. Justin Johnson November 20, 2019 Lecture 19 - 33
Taxonomy of Generative Models Model does not explicitly compute p(x), but can Model can Generative models sample from p(x) compute p(x) Explicit density Implicit density Can compute approximation to p(x) Tractable density Approximate density Markov Chain Direct GSN Generative Adversarial Can compute p(x) Autoregressive Networks (GANs) - NADE / MADE - Variational Markov Chain NICE / RealNVP - We will talk Glow - Variational Autoencoder Boltzmann Machine about these Ffjord - Figure adapted from Ian Goodfellow, Tutorial on Generative Adversarial Networks, 2017. Justin Johnson November 20, 2019 Lecture 19 - 34
Autoregressive models Justin Johnson November 20, 2019 Lecture 19 - 35
Explicit Density Estimation Goal : Write down an explicit function for π π¦ = π(π¦, π) Justin Johnson November 20, 2019 Lecture 19 - 36
Explicit Density Estimation Goal : Write down an explicit function for π π¦ = π(π¦, π) Given dataset π¦ (/) , π¦ (0) , β¦ π¦ 2 , train the model by solving: π β = arg max π(π¦ ; ) Maximize probability of training data 9 : (Maximum likelihood estimation) ; Justin Johnson November 20, 2019 Lecture 19 - 37
Explicit Density Estimation Goal : Write down an explicit function for π π¦ = π(π¦, π) Given dataset π¦ (/) , π¦ (0) , β¦ π¦ 2 , train the model by solving: π β = arg max π(π¦ ; ) Maximize probability of training data 9 : (Maximum likelihood estimation) ; < β ; log π(π¦ ; ) = arg max Log trick to exchange product for sum Justin Johnson November 20, 2019 Lecture 19 - 38
Explicit Density Estimation Goal : Write down an explicit function for π π¦ = π(π¦, π) Given dataset π¦ (/) , π¦ (0) , β¦ π¦ 2 , train the model by solving: π β = arg max π(π¦ ; ) Maximize probability of training data 9 : (Maximum likelihood estimation) ; < β ; log π(π¦ ; ) = arg max Log trick to exchange product for sum < β ; log π(π¦ ; , π) This will be our loss function! = arg max Train with gradient descent Justin Johnson November 20, 2019 Lecture 19 - 39
Explicit Density: Autoregressive Models Goal : Write down an explicit function for π π¦ = π(π¦, π) Assume x consists of π¦ = π¦ / , π¦ 0 , π¦ @ , β¦ , π¦ A multiple subparts: Justin Johnson November 20, 2019 Lecture 19 - 40
Explicit Density: Autoregressive Models Goal : Write down an explicit function for π π¦ = π(π¦, π) Assume x consists of π¦ = π¦ / , π¦ 0 , π¦ @ , β¦ , π¦ A multiple subparts: π π¦ = π π¦ / , π¦ 0 , π¦ @ , β¦ , π¦ A Break down probability using the chain rule: = π π¦ / π π¦ 0 π¦ / )π π¦ @ π¦ / , π¦ 0 ) β¦ Justin Johnson November 20, 2019 Lecture 19 - 41
Explicit Density: Autoregressive Models Goal : Write down an explicit function for π π¦ = π(π¦, π) Assume x consists of π¦ = π¦ / , π¦ 0 , π¦ @ , β¦ , π¦ A multiple subparts: π π¦ = π π¦ / , π¦ 0 , π¦ @ , β¦ , π¦ A Break down probability using the chain rule: = π π¦ / π π¦ 0 π¦ / )π π¦ @ π¦ / , π¦ 0 ) β¦ A = β CD/ π π¦ C π¦ / , β¦ , π¦ CE/ ) Probability of the next subpart given all the previous subparts Justin Johnson November 20, 2019 Lecture 19 - 42
Explicit Density: Autoregressive Models Goal : Write down an explicit function for π π¦ = π(π¦, π) Assume x consists of π¦ = π¦ / , π¦ 0 , π¦ @ , β¦ , π¦ A multiple subparts: π π¦ = π π¦ / , π¦ 0 , π¦ @ , β¦ , π¦ A Break down probability using the chain rule: = π π¦ / π π¦ 0 π¦ / )π π¦ @ π¦ / , π¦ 0 ) β¦ A = β CD/ π π¦ C π¦ / , β¦ , π¦ CE/ ) p(x 1 ) p(x 2 ) p(x 3 ) p(x 4 ) Weβve already seen this! Probability of the next subpart h 1 h 2 h 3 h 4 Language given all the previous subparts modeling with x 0 x 1 x 2 x 3 an RNN! Justin Johnson November 20, 2019 Lecture 19 - 43
PixelRNN Generate image pixels one at a time, starting at the upper left corner Compute a hidden state for each pixel that depends on hidden states and RGB values from the left and from above (LSTM recurrence) β G,H = π(β GE/,H , β G,HE/ , π) At each pixel, predict red, then blue, then green: softmax over [0, 1, β¦, 255] Van den Oord et al, βPixel Recurrent Neural Networksβ, ICML 2016 Justin Johnson November 20, 2019 Lecture 19 - 44
PixelRNN Generate image pixels one at a time, starting at the upper left corner Compute a hidden state for each pixel that depends on hidden states and RGB values from the left and from above (LSTM recurrence) β G,H = π(β GE/,H , β G,HE/ , π) At each pixel, predict red, then blue, then green: softmax over [0, 1, β¦, 255] Van den Oord et al, βPixel Recurrent Neural Networksβ, ICML 2016 Justin Johnson November 20, 2019 Lecture 19 - 45
PixelRNN Generate image pixels one at a time, starting at the upper left corner Compute a hidden state for each pixel that depends on hidden states and RGB values from the left and from above (LSTM recurrence) β G,H = π(β GE/,H , β G,HE/ , π) At each pixel, predict red, then blue, then green: softmax over [0, 1, β¦, 255] Van den Oord et al, βPixel Recurrent Neural Networksβ, ICML 2016 Justin Johnson November 20, 2019 Lecture 19 - 46
PixelRNN Generate image pixels one at a time, starting at the upper left corner Compute a hidden state for each pixel that depends on hidden states and RGB values from the left and from above (LSTM recurrence) β G,H = π(β GE/,H , β G,HE/ , π) At each pixel, predict red, then blue, then green: softmax over [0, 1, β¦, 255] Van den Oord et al, βPixel Recurrent Neural Networksβ, ICML 2016 Justin Johnson November 20, 2019 Lecture 19 - 47
PixelRNN Generate image pixels one at a time, starting at the upper left corner Compute a hidden state for each pixel that depends on hidden states and RGB values from the left and from above (LSTM recurrence) β G,H = π(β GE/,H , β G,HE/ , π) At each pixel, predict red, then blue, then green: softmax over [0, 1, β¦, 255] Van den Oord et al, βPixel Recurrent Neural Networksβ, ICML 2016 Justin Johnson November 20, 2019 Lecture 19 - 48
PixelRNN Generate image pixels one at a time, starting at the upper left corner Compute a hidden state for each pixel that depends on hidden states and RGB values from the left and from above (LSTM recurrence) β G,H = π(β GE/,H , β G,HE/ , π) At each pixel, predict red, then blue, then green: softmax over [0, 1, β¦, 255] Van den Oord et al, βPixel Recurrent Neural Networksβ, ICML 2016 Justin Johnson November 20, 2019 Lecture 19 - 49
PixelRNN Generate image pixels one at a time, starting at the upper left corner Compute a hidden state for each pixel that depends on hidden states and RGB values from the left and from above (LSTM recurrence) β G,H = π(β GE/,H , β G,HE/ , π) At each pixel, predict red, then blue, then green: softmax over [0, 1, β¦, 255] Van den Oord et al, βPixel Recurrent Neural Networksβ, ICML 2016 Justin Johnson November 20, 2019 Lecture 19 - 50
PixelRNN Generate image pixels one at a time, starting at the upper left corner Compute a hidden state for each pixel that depends on hidden states and RGB values from the left and from above (LSTM recurrence) β G,H = π(β GE/,H , β G,HE/ , π) At each pixel, predict red, then blue, then green: softmax over [0, 1, β¦, 255] Each pixel depends implicity on all pixels above and to the left: Van den Oord et al, βPixel Recurrent Neural Networksβ, ICML 2016 Justin Johnson November 20, 2019 Lecture 19 - 51
PixelRNN Generate image pixels one at a time, starting at the upper left corner Compute a hidden state for each pixel that depends on hidden states and RGB values from the left and from above (LSTM recurrence) β G,H = π(β GE/,H , β G,HE/ , π) At each pixel, predict red, then blue, then green: softmax over [0, 1, β¦, 255] Each pixel depends implicity on all pixels above and to the left: Van den Oord et al, βPixel Recurrent Neural Networksβ, ICML 2016 Justin Johnson November 20, 2019 Lecture 19 - 52
PixelRNN Problem: Very slow during both training and testing; N x N image Generate image pixels one at a time, starting at requires 2N-1 sequential steps the upper left corner Compute a hidden state for each pixel that depends on hidden states and RGB values from the left and from above (LSTM recurrence) β G,H = π(β GE/,H , β G,HE/ , π) At each pixel, predict red, then blue, then green: softmax over [0, 1, β¦, 255] Each pixel depends implicity on all pixels above and to the left: Van den Oord et al, βPixel Recurrent Neural Networksβ, ICML 2016 Justin Johnson November 20, 2019 Lecture 19 - 53
PixelCNN Still generate image pixels starting from corner Dependency on previous pixels now modeled using a CNN over context region Van den Oord et al, βConditional Image Generation with PixelCNN Decodersβ, NeurIPS 2016 Justin Johnson November 20, 2019 Lecture 19 - 54
PixelCNN Softmax loss at each pixel Still generate image pixels starting from corner Dependency on previous pixels now modeled using a CNN over context region Training: maximize likelihood of training images Van den Oord et al, βConditional Image Generation with PixelCNN Decodersβ, NeurIPS 2016 Justin Johnson November 20, 2019 Lecture 19 - 55
PixelCNN Softmax loss at each pixel Still generate image pixels starting from corner Dependency on previous pixels now modeled using a CNN over context region Training: maximize likelihood of training images Training is faster than PixelRNN (can parallelize convolutions since context region values known from training images) Generation must still proceed sequentially => still slow Van den Oord et al, βConditional Image Generation with PixelCNN Decodersβ, NeurIPS 2016 Justin Johnson November 20, 2019 Lecture 19 - 56
PixelRNN: Generated Samples 32x32 ImageNet 32x32 CIFAR-10 Van den Oord et al, βPixel Recurrent Neural Networksβ, ICML 2016 Justin Johnson November 20, 2019 Lecture 19 - 57
Autoregressive Models: PixelRNN and PixelCNN Improving PixelCNN performance - Gated convolutional layers Pros: - Short-cut connections - Can explicitly compute likelihood p(x) - Discretized logistic loss - Explicit likelihood of training data - Multi-scale gives good evaluation metric - Training tricks - Good samples - Etc⦠Con: See - Sequential generation => slow - Van der Oord et al. NIPS 2016 - Salimans et al. 2017 (PixelCNN++) Justin Johnson November 20, 2019 Lecture 19 - 58
Variational Autoencoders Justin Johnson November 20, 2019 Lecture 19 - 59
Variational Autoencoders PixelRNN / PixelCNN explicitly parameterizes density function with a neural network, so we can train to maximize likelihood of training data: Variational Autoencoders (VAE) define an intractable density that we cannot explicitly compute or optimize But we will be able to directly optimize a lower bound on the density Justin Johnson November 20, 2019 Lecture 19 - 60
Variational Autoencoders Justin Johnson November 20, 2019 Lecture 19 - 61
(Regular, non-variational) Autoencoders Unsupervised method for learning feature vectors from raw data x, without any labels Features should extract useful information (maybe object identities, Originally : Linear + nonlinearity (sigmoid) properties, scene type, etc) that we Later : Deep, fully-connected can use for downstream tasks Later : ReLU CNN Features Encoder Input data Input Data Justin Johnson November 20, 2019 Lecture 19 - 62
(Regular, non-variational) Autoencoders Problem : How can we learn this feature transform from raw data? Features should extract useful information (maybe object identities, Originally : Linear + nonlinearity (sigmoid) properties, scene type, etc) that we Later : Deep, fully-connected can use for downstream tasks Later : ReLU CNN But we canβt observe features! Features Encoder Input data Input Data Justin Johnson November 20, 2019 Lecture 19 - 63
(Regular, non-variational) Autoencoders Problem : How can we learn this feature transform from raw data? Idea : Use the features to reconstruct the input data with a decoder βAutoencodingβ = encoding itself Originally : Linear + nonlinearity (sigmoid) Reconstructed Later : Deep, fully-connected input data Later : ReLU CNN (upconv) Decoder Features Encoder Input data Input Data Justin Johnson November 20, 2019 Lecture 19 - 64
(Regular, non-variational) Autoencoders Loss : L2 distance between input and reconstructed data. Loss Function Does not use any 0 π¦ β π¦ 0 I labels! Just raw data! Reconstructed input data Decoder Features Encoder Input data Input Data Justin Johnson November 20, 2019 Lecture 19 - 65
Reconstructed data (Regular, non-variational) Autoencoders Loss : L2 distance between input and reconstructed data. Loss Function Does not use any 0 π¦ β π¦ 0 I labels! Just raw data! Decoder: 4 tconv layers Reconstructed Encoder: input data 4 conv layers Decoder Features Encoder Input data Input Data Justin Johnson November 20, 2019 Lecture 19 - 66
Reconstructed data (Regular, non-variational) Autoencoders Loss : L2 distance between input and reconstructed data. Loss Function Does not use any 0 π¦ β π¦ 0 I labels! Just raw data! Decoder: 4 tconv layers Reconstructed Encoder: input data 4 conv layers Decoder Features need to be Features lower dimensional than the data Encoder Input data Input Data Justin Johnson November 20, 2019 Lecture 19 - 67
(Regular, non-variational) Autoencoders After training, throw away decoder and use encoder for a downstream task Reconstructed input data Decoder After training, Features throw away decoder Encoder Input data Justin Johnson November 20, 2019 Lecture 19 - 68
(Regular, non-variational) Autoencoders After training, throw away decoder and use encoder for a downstream task Loss function Encoder can be (Softmax, etc) used to initialize a supervised model Predicted Label bird plane Fine-tune Classifier truck dog deer encoder jointly with Features classifier Encoder Train for final task (sometimes with Input data small data) Justin Johnson November 20, 2019 Lecture 19 - 69
(Regular, non-variational) Autoencoders Autoencoders learn latent features for data without any labels! Can use features to initialize a supervised model Not probabilistic: No way to sample new data from learned model Reconstructed input data Decoder Features Encoder Input data Justin Johnson November 20, 2019 Lecture 19 - 70
Variational Autoencoders Kingma and Welling, Auto-Encoding Variational Beyes, ICLR 2014 Justin Johnson November 20, 2019 Lecture 19 - 71
Variational Autoencoders Probabilistic spin on autoencoders: 1. Learn latent features z from raw data 2. Sample from the model to generate new data Justin Johnson November 20, 2019 Lecture 19 - 72
Variational Autoencoders Probabilistic spin on autoencoders: Assume training data is 1. Learn latent features z from raw data generated from unobserved (latent) 2. Sample from the model to generate new data representation z Intuition: x is an image, z is latent factors used to generate x: attributes, orientation, etc. Justin Johnson November 20, 2019 Lecture 19 - 73
Variational Autoencoders Probabilistic spin on autoencoders: Assume training data is 1. Learn latent features z from raw data generated from unobserved (latent) 2. Sample from the model to generate new data representation z Intuition: x is an image, z is latent After training, sample new data like this: factors used to generate x: Sample from attributes, orientation, etc. conditional Sample z from prior Justin Johnson November 20, 2019 Lecture 19 - 74
Variational Autoencoders Probabilistic spin on autoencoders: Assume training data is 1. Learn latent features z from raw data generated from unobserved (latent) 2. Sample from the model to generate new data representation z Intuition: x is an image, z is latent After training, sample new data like this: factors used to generate x: Sample from attributes, orientation, etc. conditional Assume simple prior p(z), e.g. Gaussian Sample z from prior Justin Johnson November 20, 2019 Lecture 19 - 75
Variational Autoencoders Probabilistic spin on autoencoders: Assume training data is 1. Learn latent features z from raw data generated from unobserved (latent) 2. Sample from the model to generate new data representation z Intuition: x is an image, z is latent After training, sample new data like this: factors used to generate x: Sample from attributes, orientation, etc. conditional Assume simple prior p(z), e.g. Gaussian Sample z Represent p(x|z) with a neural network from prior (Similar to decoder from autencoder) Justin Johnson November 20, 2019 Lecture 19 - 76
Variational Autoencoders Decoder must be probabilistic : Decoder inputs z, outputs mean ΞΌ x|z Assume training data is and (diagonal) covariance β x|z generated from unobserved (latent) representation z Sample x from Gaussian with mean ΞΌ x|z and (diagonal) covariance β x|z Intuition: x is an image, z is latent factors used to generate x: Sample from attributes, orientation, etc. conditional Assume simple prior p(z), e.g. Gaussian Sample z Represent p(x|z) with a neural network from prior (Similar to decoder from autencoder) Justin Johnson November 20, 2019 Lecture 19 - 77
Variational Autoencoders Decoder must be probabilistic : Decoder inputs z, outputs mean ΞΌ x|z Assume training data is and (diagonal) covariance β x|z generated from unobserved (latent) representation z Sample x from Gaussian with mean How to train this model? ΞΌ x|z and (diagonal) covariance β x|z Sample from Basic idea: maximize likelihood of data conditional If we could observe the z for each x, then could train a conditional generative model Sample z p(x|z) from prior Justin Johnson November 20, 2019 Lecture 19 - 78
Variational Autoencoders Decoder must be probabilistic : Decoder inputs z, outputs mean ΞΌ x|z Assume training data is and (diagonal) covariance β x|z generated from unobserved (latent) representation z Sample x from Gaussian with mean How to train this model? ΞΌ x|z and (diagonal) covariance β x|z Sample from Basic idea: maximize likelihood of data conditional We donβt observe z, so need to marginalize: π K π¦ = ! π K π¦, π¨ ππ¨ = ! π K π¦ π¨ π K π¨ ππ¨ Sample z from prior Justin Johnson November 20, 2019 Lecture 19 - 79
Variational Autoencoders Decoder must be probabilistic : Decoder inputs z, outputs mean ΞΌ x|z Assume training data is and (diagonal) covariance β x|z generated from unobserved (latent) representation z Sample x from Gaussian with mean How to train this model? ΞΌ x|z and (diagonal) covariance β x|z Sample from Basic idea: maximize likelihood of data conditional We donβt observe z, so need to marginalize: π K π¦ = ! π K π¦, π¨ ππ¨ = ! π K π¦ π¨ π K π¨ ππ¨ Sample z from prior Ok, can compute this with decoder network Justin Johnson November 20, 2019 Lecture 19 - 80
Variational Autoencoders Decoder must be probabilistic : Decoder inputs z, outputs mean ΞΌ x|z Assume training data is and (diagonal) covariance β x|z generated from unobserved (latent) representation z Sample x from Gaussian with mean How to train this model? ΞΌ x|z and (diagonal) covariance β x|z Sample from Basic idea: maximize likelihood of data conditional We donβt observe z, so need to marginalize: π K π¦ = ! π K π¦, π¨ ππ¨ = ! π K π¦ π¨ π K π¨ ππ¨ Sample z from prior Ok, we assumed Gaussian prior for z Justin Johnson November 20, 2019 Lecture 19 - 81
Variational Autoencoders Decoder must be probabilistic : Decoder inputs z, outputs mean ΞΌ x|z Assume training data is and (diagonal) covariance β x|z generated from unobserved (latent) representation z Sample x from Gaussian with mean How to train this model? ΞΌ x|z and (diagonal) covariance β x|z Sample from Basic idea: maximize likelihood of data conditional We donβt observe z, so need to marginalize: π K π¦ = ! π K π¦, π¨ ππ¨ = ! π K π¦ π¨ π K π¨ ππ¨ Sample z from prior Problem: Impossible to integrate over all z! Justin Johnson November 20, 2019 Lecture 19 - 82
Recall π π¦, π¨ = π π¦ π¨ π π¨ = π π¨ π¦ π π¦ Variational Autoencoders Decoder must be probabilistic : Decoder inputs z, outputs mean ΞΌ x|z Assume training data is and (diagonal) covariance β x|z generated from unobserved (latent) representation z Sample x from Gaussian with mean How to train this model? ΞΌ x|z and (diagonal) covariance β x|z Sample from Basic idea: maximize likelihood of data conditional Another idea: Try Bayesβ Rule: π K π¦ = π K π¦ π¨)π K π¨ Sample z π K π¨ π¦) from prior Justin Johnson November 20, 2019 Lecture 19 - 83
Recall π π¦, π¨ = π π¦ π¨ π π¨ = π π¨ π¦ π π¦ Variational Autoencoders Decoder must be probabilistic : Decoder inputs z, outputs mean ΞΌ x|z Assume training data is and (diagonal) covariance β x|z generated from unobserved (latent) representation z Sample x from Gaussian with mean How to train this model? ΞΌ x|z and (diagonal) covariance β x|z Sample from Basic idea: maximize likelihood of data conditional Another idea: Try Bayesβ Rule: π K π¦ = π K π¦ π¨)π K π¨ Ok, compute with Sample z π K π¨ π¦) decoder network from prior Justin Johnson November 20, 2019 Lecture 19 - 84
Recall π π¦, π¨ = π π¦ π¨ π π¨ = π π¨ π¦ π π¦ Variational Autoencoders Decoder must be probabilistic : Decoder inputs z, outputs mean ΞΌ x|z Assume training data is and (diagonal) covariance β x|z generated from unobserved (latent) representation z Sample x from Gaussian with mean How to train this model? ΞΌ x|z and (diagonal) covariance β x|z Sample from Basic idea: maximize likelihood of data conditional Another idea: Try Bayesβ Rule: π K π¦ = π K π¦ π¨)π K π¨ Ok, we assumed Sample z π K π¨ π¦) Gaussian prior from prior Justin Johnson November 20, 2019 Lecture 19 - 85
Recall π π¦, π¨ = π π¦ π¨ π π¨ = π π¨ π¦ π π¦ Variational Autoencoders Decoder must be probabilistic : Decoder inputs z, outputs mean ΞΌ x|z Assume training data is and (diagonal) covariance β x|z generated from unobserved (latent) representation z Sample x from Gaussian with mean How to train this model? ΞΌ x|z and (diagonal) covariance β x|z Sample from Basic idea: maximize likelihood of data conditional Another idea: Try Bayesβ Rule: π K π¦ = π K π¦ π¨)π K π¨ Problem : No way Sample z π K π¨ π¦) to compute this! from prior Justin Johnson November 20, 2019 Lecture 19 - 86
Recall π π¦, π¨ = π π¦ π¨ π π¨ = π π¨ π¦ π π¦ Variational Autoencoders Decoder must be probabilistic : Decoder inputs z, outputs mean ΞΌ x|z Assume training data is and (diagonal) covariance β x|z generated from unobserved (latent) representation z Sample x from Gaussian with mean How to train this model? ΞΌ x|z and (diagonal) covariance β x|z Sample from Basic idea: maximize likelihood of data conditional Another idea: Try Bayesβ Rule: π K π¦ = π K π¦ π¨)π K π¨ Solution: Train Sample z another network π K π¨ π¦) from prior (encoder) that learns π N π¨ π¦) β π K π¨ π¦) Justin Johnson November 20, 2019 Lecture 19 - 87
Recall π π¦, π¨ = π π¦ π¨ π π¨ = π π¨ π¦ π π¦ Variational Autoencoders Decoder must be probabilistic : Decoder inputs z, outputs mean ΞΌ x|z Assume training data is and (diagonal) covariance β x|z generated from unobserved (latent) representation z Sample x from Gaussian with mean How to train this model? ΞΌ x|z and (diagonal) covariance β x|z Sample from Basic idea: maximize likelihood of data conditional Another idea: Try Bayesβ Rule: π K π¦ = π K π¦ π¨)π K π¨ β π K π¦ π¨)π K π¨ Sample z π K π¨ π¦) π N π¨ π¦) from prior Use encoder to compute π N π¨ π¦) β π K π¨ π¦) Justin Johnson November 20, 2019 Lecture 19 - 88
Variational Autoencoders Decoder network inputs Encoder network inputs latent code z, gives data x, gives distribution If we can ensure that distribution over data x over latent codes z π N π¨ π¦) β π K π¨ π¦) , π K π¦ | π¨ = π(π G|S , Ξ£ G|S ) π N π¨ | π¦ = π(π S|G , Ξ£ S|G ) then we can approximate π K π¦ β π K π¦ π¨)π(π¨) π N π¨ π¦) Idea : Jointly train both encoder and decoder Justin Johnson November 20, 2019 Lecture 19 - 89
Variational Autoencoders log π K (π¦) = log π K π¦ π¨)π(π¨) π K π¨ π¦) Bayesβ Rule Justin Johnson November 20, 2019 Lecture 19 - 90
Variational Autoencoders π K π¦ π¨ π π¨ π N (π¨|π¦) log π K (π¦) = log π K π¦ π¨)π(π¨) = log π K π¨ π¦) π K π¨ π¦ π N (π¨|π¦) Multiply top and bottom by q Ξ¦ (z|x) Justin Johnson November 20, 2019 Lecture 19 - 91
Variational Autoencoders π K π¦ π¨ π π¨ π N (π¨|π¦) log π K (π¦) = log π K π¦ π¨)π(π¨) = log π K π¨ π¦) π K π¨ π¦ π N (π¨|π¦) π N π¨|π¦ π N (π¨|π¦) = log π K π¦ π¨ β log + log π(π¨) π K (π¨|π¦) Split up using rules for logarithms Justin Johnson November 20, 2019 Lecture 19 - 92
Variational Autoencoders π K π¦ π¨ π π¨ π N (π¨|π¦) log π K (π¦) = log π K π¦ π¨)π(π¨) c = log π K π¨ π¦) π K π¨ π¦ π N (π¨|π¦) c π N π¨|π¦ π N (π¨|π¦) c = log π K π¦ π¨ β log + log π(π¨) π K (π¨|π¦) Split up using rules for logarithms Justin Johnson November 20, 2019 Lecture 19 - 93
Variational Autoencoders π K π¦ π¨ π π¨ π N (π¨|π¦) log π K (π¦) = log π K π¦ π¨)π(π¨) = log π K π¨ π¦) π K π¨ π¦ π N (π¨|π¦) π N π¨|π¦ π N (π¨|π¦) = log π K π¦ π¨ β log + log π(π¨) π K (π¨|π¦) We can wrap in an log π K π¦ = πΉ S~X Y (S|G) log π K (π¦) expectation since it doesnβt depend on z Justin Johnson November 20, 2019 Lecture 19 - 94
Variational Autoencoders = log π K π¦ π¨ π π¨ π N (π¨|π¦) log π K (π¦) = log π K π¦ π¨)π(π¨) π K π¨ π¦) π K π¨ π¦ π N (π¨|π¦) = πΉ S [log π K (π¦|π¨)] β πΉ S log π N π¨ π¦ + πΉ S log π N (π¨|π¦) π π¨ π K (π¨|π¦) We can wrap in an log π K π¦ = πΉ S~X Y (S|G) log π K (π¦) expectation since it doesnβt depend on z Justin Johnson November 20, 2019 Lecture 19 - 95
Variational Autoencoders = log π K π¦ π¨ π π¨ π N (π¨|π¦) log π K (π¦) = log π K π¦ π¨)π(π¨) π K π¨ π¦) π K π¨ π¦ π N (π¨|π¦) = πΉ S [log π K (π¦|π¨)] β πΉ S log π N π¨ π¦ + πΉ S log π N (π¨|π¦) π π¨ π K (π¨|π¦) = πΉ S~X Y (S|G) [log π K (π¦|π¨)] β πΈ ]^ π N π¨ π¦ , π π¨ + πΈ ]^ (π N π¨ π¦ , π K π¨ π¦ ) Data reconstruction Justin Johnson November 20, 2019 Lecture 19 - 96
Variational Autoencoders = log π K π¦ π¨ π π¨ π N (π¨|π¦) log π K (π¦) = log π K π¦ π¨)π(π¨) π K π¨ π¦) π K π¨ π¦ π N (π¨|π¦) = πΉ S [log π K (π¦|π¨)] β πΉ S log π N π¨ π¦ + πΉ S log π N (π¨|π¦) π π¨ π K (π¨|π¦) = πΉ S~X Y (S|G) [log π K (π¦|π¨)] β πΈ ]^ π N π¨ π¦ , π π¨ + πΈ ]^ (π N π¨ π¦ , π K π¨ π¦ ) KL divergence between prior, and samples from the encoder network Justin Johnson November 20, 2019 Lecture 19 - 97
Variational Autoencoders = log π K π¦ π¨ π π¨ π N (π¨|π¦) log π K (π¦) = log π K π¦ π¨)π(π¨) π K π¨ π¦) π K π¨ π¦ π N (π¨|π¦) = πΉ S [log π K (π¦|π¨)] β πΉ S log π N π¨ π¦ + πΉ S log π N (π¨|π¦) π π¨ π K (π¨|π¦) = πΉ S~X Y (S|G) [log π K (π¦|π¨)] β πΈ ]^ π N π¨ π¦ , π π¨ + πΈ ]^ (π N π¨ π¦ , π K π¨ π¦ ) KL divergence between encoder and posterior of decoder Justin Johnson November 20, 2019 Lecture 19 - 98
Variational Autoencoders = log π K π¦ π¨ π π¨ π N (π¨|π¦) log π K (π¦) = log π K π¦ π¨)π(π¨) π K π¨ π¦) π K π¨ π¦ π N (π¨|π¦) = πΉ S [log π K (π¦|π¨)] β πΉ S log π N π¨ π¦ + πΉ S log π N (π¨|π¦) π π¨ π K (π¨|π¦) = πΉ S~X Y (S|G) [log π K (π¦|π¨)] β πΈ ]^ π N π¨ π¦ , π π¨ + πΈ ]^ (π N π¨ π¦ , π K π¨ π¦ ) KL is >= 0, so dropping this term gives a lower bound on the data likelihood: Justin Johnson November 20, 2019 Lecture 19 - 99
Variational Autoencoders = log π K π¦ π¨ π π¨ π N (π¨|π¦) log π K (π¦) = log π K π¦ π¨)π(π¨) π K π¨ π¦) π K π¨ π¦ π N (π¨|π¦) = πΉ S [log π K (π¦|π¨)] β πΉ S log π N π¨ π¦ + πΉ S log π N (π¨|π¦) π π¨ π K (π¨|π¦) = πΉ S~X Y (S|G) [log π K (π¦|π¨)] β πΈ ]^ π N π¨ π¦ , π π¨ + πΈ ]^ (π N π¨ π¦ , π K π¨ π¦ ) log π K π¦ β₯ πΉ S~X Y (S|G) [log π K (π¦|π¨)] β πΈ ]^ π N π¨ π¦ , π π¨ Justin Johnson November 20, 2019 Lecture 19 - 100
Recommend
More recommend