cs7015 deep learning lecture 21
play

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21 Acknowledgments


  1. Let { X = x i } N i =1 be the training data We can think of X as a random variable in R n For example, X could be an image and the dimensions of X correspond to pixels of the image We are interested in learning an abstraction Figure: Abstraction (i.e., given an X find the hidden representa- tion z ) 11/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  2. Let { X = x i } N i =1 be the training data We can think of X as a random variable in R n For example, X could be an image and the dimensions of X correspond to pixels of the image We are interested in learning an abstraction Figure: Abstraction (i.e., given an X find the hidden representa- tion z ) We are also interested in generation ( i.e. , given a hidden representation generate an X ) Figure: Generation 11/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  3. Let { X = x i } N i =1 be the training data We can think of X as a random variable in R n For example, X could be an image and the dimensions of X correspond to pixels of the image We are interested in learning an abstraction Figure: Abstraction (i.e., given an X find the hidden representa- tion z ) We are also interested in generation ( i.e. , given a hidden representation generate an X ) In probabilistic terms we are interested in P ( z | X ) and P ( X | z ) (to be consistent with the literation on VAEs we will use z instead of H Figure: Generation and X instead of V ) 11/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  4. Earlier we saw RBMs where we learnt P ( z | X ) and P ( X | z ) H ∈ { 0 , 1 } n c 1 c 2 c n · · · h 1 h 2 h n w 1 , 1 w m,n W ∈ R m × n v 1 v 2 · · · v m b 1 b 2 b m V ∈ { 0 , 1 } m 12/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  5. Earlier we saw RBMs where we learnt P ( z | X ) and P ( X | z ) H ∈ { 0 , 1 } n Below we list certain characteristics of RBMs c 1 c 2 c n · · · h 1 h 2 h n w 1 , 1 w m,n W ∈ R m × n v 1 v 2 · · · v m b 1 b 2 b m V ∈ { 0 , 1 } m 12/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  6. Earlier we saw RBMs where we learnt P ( z | X ) and P ( X | z ) H ∈ { 0 , 1 } n Below we list certain characteristics of RBMs c 1 c 2 c n Structural assumptions: We assume cer- · · · h 1 h 2 h n tain independencies in the Markov Network w 1 , 1 w m,n W ∈ R m × n v 1 v 2 · · · v m b 1 b 2 b m V ∈ { 0 , 1 } m 12/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  7. Earlier we saw RBMs where we learnt P ( z | X ) and P ( X | z ) H ∈ { 0 , 1 } n Below we list certain characteristics of RBMs c 1 c 2 c n Structural assumptions: We assume cer- · · · h 1 h 2 h n tain independencies in the Markov Network Computational: When training with Gibbs w 1 , 1 w m,n Sampling we have to run the Markov Chain W ∈ R m × n for many time steps which is expensive v 1 v 2 · · · v m b 1 b 2 b m V ∈ { 0 , 1 } m 12/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  8. Earlier we saw RBMs where we learnt P ( z | X ) and P ( X | z ) H ∈ { 0 , 1 } n Below we list certain characteristics of RBMs c 1 c 2 c n Structural assumptions: We assume cer- · · · h 1 h 2 h n tain independencies in the Markov Network Computational: When training with Gibbs w 1 , 1 w m,n Sampling we have to run the Markov Chain W ∈ R m × n for many time steps which is expensive When using Contrastive Approximation: v 1 v 2 · · · v m Divergence, we approximate the expectation by a point estimate b 1 b 2 b m V ∈ { 0 , 1 } m 12/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  9. Earlier we saw RBMs where we learnt P ( z | X ) and P ( X | z ) H ∈ { 0 , 1 } n Below we list certain characteristics of RBMs c 1 c 2 c n Structural assumptions: We assume cer- · · · h 1 h 2 h n tain independencies in the Markov Network Computational: When training with Gibbs w 1 , 1 w m,n Sampling we have to run the Markov Chain W ∈ R m × n for many time steps which is expensive When using Contrastive Approximation: v 1 v 2 · · · v m Divergence, we approximate the expectation by a point estimate b 1 b 2 b m (Nothing wrong with the above but we just V ∈ { 0 , 1 } m mention them to make the reader aware of these characteristics) 12/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  10. We now return to our goals 13/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  11. We now return to our goals Goal 1: Learn a distribution over the latent variables ( Q ( z | X )) 13/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  12. We now return to our goals Goal 1: Learn a distribution over the latent variables ( Q ( z | X )) Goal 2: Learn a distribution over the visible variables ( P ( X | z )) 13/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  13. We now return to our goals Goal 1: Learn a distribution over the latent variables ( Q ( z | X )) Goal 2: Learn a distribution over the visible variables ( P ( X | z )) VAEs use a neural network based encoder for z Goal 1 Encoder Q θ ( z | X ) Data: X θ : the parameters of the encoder neural network 13/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  14. We now return to our goals Reconstruction: ˆ X Goal 1: Learn a distribution over the latent variables ( Q ( z | X )) Decoder P φ ( X | z ) Goal 2: Learn a distribution over the visible variables ( P ( X | z )) VAEs use a neural network based encoder for z Goal 1 and a neural network based decoder for Goal Encoder Q θ ( z | X ) 2 Data: X θ : the parameters of the encoder neural network φ : the parameters of the decoder neural network 13/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  15. We now return to our goals Reconstruction: ˆ X Goal 1: Learn a distribution over the latent variables ( Q ( z | X )) Decoder P φ ( X | z ) Goal 2: Learn a distribution over the visible variables ( P ( X | z )) VAEs use a neural network based encoder for z Goal 1 and a neural network based decoder for Goal Encoder Q θ ( z | X ) 2 We will look at the encoder first Data: X θ : the parameters of the encoder neural network φ : the parameters of the decoder neural network 13/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  16. Encoder: What do we mean when we say z we want to learn a distribution? Q θ ( z | X ) X 14/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  17. Encoder: What do we mean when we say z we want to learn a distribution? We mean that we want to learn the parameters of the distribution Q θ ( z | X ) X 14/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  18. Encoder: What do we mean when we say z we want to learn a distribution? We mean that we want to learn the parameters of the distribution But what are the parameters of Q ( z | X )? Q θ ( z | X ) X 14/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  19. Encoder: What do we mean when we say z we want to learn a distribution? We mean that we want to learn the parameters of the distribution But what are the parameters of Q ( z | X )? Well it depends on our modeling assump- Q θ ( z | X ) tion! X 14/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  20. Encoder: What do we mean when we say z we want to learn a distribution? We mean that we want to learn the parameters of the µ Σ distribution But what are the parameters of Q ( z | X )? Well it depends on our modeling assump- Q θ ( z | X ) tion! In VAEs we assume that the latent variables come from a standard normal distribution X N (0 , I ) and the job of the encoder is to then predict the parameters of this distribution X ∈ R n , µ ∈ R m and Σ ∈ R m × m 14/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  21. ˆ X i Now what about the decoder? P φ ( X | z ) z Sample µ Σ Q θ ( z | X ) X i 15/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  22. ˆ X i Now what about the decoder? The job of the decoder is to predict a probab- P φ ( X | z ) ility distribution over X : P ( X | z ) z Sample µ Σ Q θ ( z | X ) X i 15/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  23. ˆ X i Now what about the decoder? The job of the decoder is to predict a probab- P φ ( X | z ) ility distribution over X : P ( X | z ) Once again we will assume a certain form for this distribution z Sample µ Σ Q θ ( z | X ) X i 15/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  24. ˆ X i Now what about the decoder? The job of the decoder is to predict a probab- P φ ( X | z ) ility distribution over X : P ( X | z ) Once again we will assume a certain form for this distribution z For example, if we want to predict 28 x 28 Sample pixels and each pixel belongs to R ( i.e. , X ∈ R 784 ) then what would be a suitable family µ for P ( X | z )? Σ Q θ ( z | X ) X i 15/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  25. ˆ X i Now what about the decoder? The job of the decoder is to predict a probab- P φ ( X | z ) ility distribution over X : P ( X | z ) Once again we will assume a certain form for this distribution z For example, if we want to predict 28 x 28 Sample pixels and each pixel belongs to R ( i.e. , X ∈ R 784 ) then what would be a suitable family µ for P ( X | z )? Σ We could assume that P ( X | z ) is a Gaussian Q θ ( z | X ) distribution with unit variance X i 15/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  26. ˆ X i Now what about the decoder? The job of the decoder is to predict a probab- P φ ( X | z ) ility distribution over X : P ( X | z ) Once again we will assume a certain form for this distribution z For example, if we want to predict 28 x 28 Sample pixels and each pixel belongs to R ( i.e. , X ∈ R 784 ) then what would be a suitable family µ for P ( X | z )? Σ We could assume that P ( X | z ) is a Gaussian Q θ ( z | X ) distribution with unit variance The job of the decoder f would then be to X i predict the mean of this distribution as f φ ( z ) 15/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  27. ˆ X i What would be the objective function of the decoder ? P φ ( X | z ) z Sample µ Σ Q θ ( z | X ) X i 16/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  28. ˆ X i What would be the objective function of the decoder ? P φ ( X | z ) For any given training sample x i it should maximize P ( x i ) given by z ˆ P ( x i ) = P ( z ) P ( x i | z ) dz Sample µ Σ Q θ ( z | X ) X i 16/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  29. ˆ X i What would be the objective function of the decoder ? P φ ( X | z ) For any given training sample x i it should maximize P ( x i ) given by z ˆ P ( x i ) = P ( z ) P ( x i | z ) dz Sample = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] µ Σ Q θ ( z | X ) X i 16/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  30. ˆ X i What would be the objective function of the decoder ? P φ ( X | z ) For any given training sample x i it should maximize P ( x i ) given by z ˆ P ( x i ) = P ( z ) P ( x i | z ) dz Sample = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] µ Σ (As usual we take log for numerical stability) Q θ ( z | X ) X i 16/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  31. ˆ X i This is the loss function for one data point ( l i ( θ )) and we will just sum over all the data P φ ( X | z ) points to get the total loss L ( θ ) m � L ( θ ) = l i ( θ ) z i =1 Sample µ Σ Q θ ( z | X ) X i 17/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  32. ˆ X i This is the loss function for one data point ( l i ( θ )) and we will just sum over all the data P φ ( X | z ) points to get the total loss L ( θ ) m � L ( θ ) = l i ( θ ) z i =1 Sample In addition, we also want a constraint on the distribution over the latent variables µ Σ Q θ ( z | X ) X i 17/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  33. ˆ X i This is the loss function for one data point ( l i ( θ )) and we will just sum over all the data P φ ( X | z ) points to get the total loss L ( θ ) m � L ( θ ) = l i ( θ ) z i =1 Sample In addition, we also want a constraint on the distribution over the latent variables Specifically, we had assumed P ( z ) to be µ Σ N (0 , I ) and we want Q ( z | X ) to be as close to P ( z ) as possible Q θ ( z | X ) X i 17/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  34. ˆ X i This is the loss function for one data point ( l i ( θ )) and we will just sum over all the data P φ ( X | z ) points to get the total loss L ( θ ) m � L ( θ ) = l i ( θ ) z i =1 Sample In addition, we also want a constraint on the distribution over the latent variables Specifically, we had assumed P ( z ) to be µ Σ N (0 , I ) and we want Q ( z | X ) to be as close to P ( z ) as possible Q θ ( z | X ) Thus, we will modify the loss function such X i that l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 17/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  35. ˆ X i This is the loss function for one data point ( l i ( θ )) and we will just sum over all the data P φ ( X | z ) points to get the total loss L ( θ ) m � L ( θ ) = l i ( θ ) z i =1 Sample In addition, we also want a constraint on the distribution over the latent variables Specifically, we had assumed P ( z ) to be µ Σ N (0 , I ) and we want Q ( z | X ) to be as close to P ( z ) as possible Q θ ( z | X ) Thus, we will modify the loss function such X i that KL divergence captures l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] the difference (or distance) between 2 distributions + KL ( Q θ ( z | x i ) || P ( z )) 17/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  36. ˆ X i The second term in the loss function can actually be thought of as a regularizer P φ ( X | z ) z Sample µ Σ Q θ ( z | X ) X i l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 18/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  37. ˆ X i The second term in the loss function can actually be thought of as a regularizer P φ ( X | z ) It ensures that the encoder does not cheat by mapping each x i to a different point (a normal distribution with very low variance) in the Euclidean space z Sample µ Σ Q θ ( z | X ) X i l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 18/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  38. ˆ X i The second term in the loss function can actually be thought of as a regularizer P φ ( X | z ) It ensures that the encoder does not cheat by mapping each x i to a different point (a normal distribution with very low variance) in the Euclidean space In other words, in the absence of the regularizer the z encoder can learn a unique mapping for each x i and Sample the decoder can then decode from this unique mapping µ Σ Q θ ( z | X ) X i l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 18/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  39. ˆ X i The second term in the loss function can actually be thought of as a regularizer P φ ( X | z ) It ensures that the encoder does not cheat by mapping each x i to a different point (a normal distribution with very low variance) in the Euclidean space In other words, in the absence of the regularizer the z encoder can learn a unique mapping for each x i and Sample the decoder can then decode from this unique mapping Even with high variance in samples from the distribu- tion, we want the decoder to be able to reconstruct µ Σ the original data very well (motivation similar to the adding noise) Q θ ( z | X ) X i l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 18/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  40. ˆ X i The second term in the loss function can actually be thought of as a regularizer P φ ( X | z ) It ensures that the encoder does not cheat by mapping each x i to a different point (a normal distribution with very low variance) in the Euclidean space In other words, in the absence of the regularizer the z encoder can learn a unique mapping for each x i and Sample the decoder can then decode from this unique mapping Even with high variance in samples from the distribu- tion, we want the decoder to be able to reconstruct µ Σ the original data very well (motivation similar to the adding noise) To summarize, for each data point we predict a distri- Q θ ( z | X ) bution such that, with high probability a sample from this distribution should be able to reconstruct the ori- X i ginal data point l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 18/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  41. ˆ X i The second term in the loss function can actually be thought of as a regularizer P φ ( X | z ) It ensures that the encoder does not cheat by mapping each x i to a different point (a normal distribution with very low variance) in the Euclidean space In other words, in the absence of the regularizer the z encoder can learn a unique mapping for each x i and Sample the decoder can then decode from this unique mapping Even with high variance in samples from the distribu- tion, we want the decoder to be able to reconstruct µ Σ the original data very well (motivation similar to the adding noise) To summarize, for each data point we predict a distri- Q θ ( z | X ) bution such that, with high probability a sample from this distribution should be able to reconstruct the ori- X i ginal data point l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] But why do we choose a normal distribution? Isn’t it too simplistic to assume that z follows a normal + KL ( Q θ ( z | x i ) || P ( z )) distribution 18/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  42. Isn’t it a very strong assumption that P ( z ) ∼ N (0 , I ) ? l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 19/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  43. Isn’t it a very strong assumption that P ( z ) ∼ N (0 , I ) ? For example, in the 2-dimensional case how can we be sure that P ( z ) is a normal distri- bution and not any other distribution l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 19/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  44. Isn’t it a very strong assumption that P ( z ) ∼ N (0 , I ) ? For example, in the 2-dimensional case how can we be sure that P ( z ) is a normal distri- bution and not any other distribution The key insight here is that any distribution in d dimensions can be generated by the fol- lowing steps l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 19/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  45. Isn’t it a very strong assumption that P ( z ) ∼ N (0 , I ) ? For example, in the 2-dimensional case how can we be sure that P ( z ) is a normal distri- bution and not any other distribution The key insight here is that any distribution in d dimensions can be generated by the fol- lowing steps Step 1: Start with a set of d variables that are normally distributed (that’s exactly what we are assuming for P ( z )) l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 19/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  46. Isn’t it a very strong assumption that P ( z ) ∼ N (0 , I ) ? For example, in the 2-dimensional case how can we be sure that P ( z ) is a normal distri- bution and not any other distribution The key insight here is that any distribution in d dimensions can be generated by the fol- lowing steps Step 1: Start with a set of d variables that are normally distributed (that’s exactly what we are assuming for P ( z )) Step 2: Mapping these variables through a sufficiently complex function (that’s exactly l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] what the first few layers of the decoder can + KL ( Q θ ( z | x i ) || P ( z )) do) 19/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  47. In particular, note that in the adjoining example if z is 2-D and normally distributed then f ( z ) is roughly ring shaped (giving us the distribution in the bottom figure) f ( z ) = z z 10 + || z || l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 20/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  48. In particular, note that in the adjoining example if z is 2-D and normally distributed then f ( z ) is roughly ring shaped (giving us the distribution in the bottom figure) f ( z ) = z z 10 + || z || A non-linear neural network, such as the one we use for the decoder, could learn a complex mapping from z to f φ ( z ) using its parameters φ l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 20/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  49. In particular, note that in the adjoining example if z is 2-D and normally distributed then f ( z ) is roughly ring shaped (giving us the distribution in the bottom figure) f ( z ) = z z 10 + || z || A non-linear neural network, such as the one we use for the decoder, could learn a complex mapping from z to f φ ( z ) using its parameters φ The initial layers of a non linear decoder could learn their weights such that the output is f φ ( z ) l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 20/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  50. In particular, note that in the adjoining example if z is 2-D and normally distributed then f ( z ) is roughly ring shaped (giving us the distribution in the bottom figure) f ( z ) = z z 10 + || z || A non-linear neural network, such as the one we use for the decoder, could learn a complex mapping from z to f φ ( z ) using its parameters φ The initial layers of a non linear decoder could learn their weights such that the output is f φ ( z ) The above argument suggests that even if we start with normally distributed variables the initial layers of the decoder could learn a complex transformation of these variables say f φ ( z ) if required l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 20/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  51. In particular, note that in the adjoining example if z is 2-D and normally distributed then f ( z ) is roughly ring shaped (giving us the distribution in the bottom figure) f ( z ) = z z 10 + || z || A non-linear neural network, such as the one we use for the decoder, could learn a complex mapping from z to f φ ( z ) using its parameters φ The initial layers of a non linear decoder could learn their weights such that the output is f φ ( z ) The above argument suggests that even if we start with normally distributed variables the initial layers of the decoder could learn a complex transformation of these variables say f φ ( z ) if required The objective function of the decoder will ensure that l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] an appropriate transformation of z is learnt to recon- struct X + KL ( Q θ ( z | x i ) || P ( z )) 20/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  52. Module 21.3: Variational autoencoders: (The graphical model perspective) 21/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  53. Here we can think of z and X as random vari- ables z X N 22/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  54. Here we can think of z and X as random vari- ables z We are then interested in the joint prob- ability distribution P ( X, z ) which factorizes X as P ( X, z ) = P ( z ) P ( X | z ) N 22/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  55. Here we can think of z and X as random vari- ables z We are then interested in the joint prob- ability distribution P ( X, z ) which factorizes X as P ( X, z ) = P ( z ) P ( X | z ) This factorization is natural because we can N imagine that the latent variables are fixed first and then the visible variables are drawn based on the latent variables 22/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  56. Here we can think of z and X as random vari- ables z We are then interested in the joint prob- ability distribution P ( X, z ) which factorizes X as P ( X, z ) = P ( z ) P ( X | z ) This factorization is natural because we can N imagine that the latent variables are fixed first and then the visible variables are drawn based on the latent variables For example, if we want to draw a digit we could first fix the latent variables: the digit, size, angle, thickness, position and so on and then draw a digit which corresponds to these latent variables 22/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  57. Here we can think of z and X as random vari- ables z We are then interested in the joint prob- ability distribution P ( X, z ) which factorizes X as P ( X, z ) = P ( z ) P ( X | z ) This factorization is natural because we can N imagine that the latent variables are fixed first and then the visible variables are drawn based on the latent variables For example, if we want to draw a digit we could first fix the latent variables: the digit, size, angle, thickness, position and so on and then draw a digit which corresponds to these latent variables And of course, unlike RBMs, this is a directed graphical model 22/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  58. Now at inference time, we are given an X (observed variable) and we are interested in finding the most z likely assignments of latent variables z which would have resulted in this observation X N 23/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  59. Now at inference time, we are given an X (observed variable) and we are interested in finding the most z likely assignments of latent variables z which would have resulted in this observation Mathematically, we want to find X P ( z | X ) = P ( X | z ) P ( z ) N P ( X ) 23/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  60. Now at inference time, we are given an X (observed variable) and we are interested in finding the most z likely assignments of latent variables z which would have resulted in this observation Mathematically, we want to find X P ( z | X ) = P ( X | z ) P ( z ) N P ( X ) This is hard to compute because the LHS contains P ( X ) which is intractable ˆ P ( X | z ) P ( z ) dz P ( X ) = ˆ ˆ ˆ = P ( X | z 1 , z 2 , ..., z n ) P ( z 1 , z 2 , ..., z n ) dz 1 , ...dz n ... 23/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  61. Now at inference time, we are given an X (observed variable) and we are interested in finding the most z likely assignments of latent variables z which would have resulted in this observation Mathematically, we want to find X P ( z | X ) = P ( X | z ) P ( z ) N P ( X ) This is hard to compute because the LHS contains P ( X ) which is intractable ˆ P ( X | z ) P ( z ) dz P ( X ) = ˆ ˆ ˆ = P ( X | z 1 , z 2 , ..., z n ) P ( z 1 , z 2 , ..., z n ) dz 1 , ...dz n ... In RBMs, we had a similar integral which we approx- imated using Gibbs Sampling 23/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  62. Now at inference time, we are given an X (observed variable) and we are interested in finding the most z likely assignments of latent variables z which would have resulted in this observation Mathematically, we want to find X P ( z | X ) = P ( X | z ) P ( z ) N P ( X ) This is hard to compute because the LHS contains P ( X ) which is intractable ˆ P ( X | z ) P ( z ) dz P ( X ) = ˆ ˆ ˆ = P ( X | z 1 , z 2 , ..., z n ) P ( z 1 , z 2 , ..., z n ) dz 1 , ...dz n ... In RBMs, we had a similar integral which we approx- imated using Gibbs Sampling VAEs, on the other hand, cast this into an optimiza- tion problem and learn the parameters of the optim- ization problem 23/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  63. Specifically, in VAEs, we assume that instead of P ( z | X ) which is intractable, the posterior z distribution is given by Q θ ( z | X ) X N 24/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  64. Specifically, in VAEs, we assume that instead of P ( z | X ) which is intractable, the posterior z distribution is given by Q θ ( z | X ) Further, we assume that Q θ ( z | X ) is a Gaus- X sian whose parameters are determined by a neural network µ , Σ = g θ ( X ) N 24/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  65. Specifically, in VAEs, we assume that instead of P ( z | X ) which is intractable, the posterior z distribution is given by Q θ ( z | X ) Further, we assume that Q θ ( z | X ) is a Gaus- X sian whose parameters are determined by a neural network µ , Σ = g θ ( X ) N The parameters of the distribution are thus determined by the parameters θ of a neural network 24/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  66. Specifically, in VAEs, we assume that instead of P ( z | X ) which is intractable, the posterior z distribution is given by Q θ ( z | X ) Further, we assume that Q θ ( z | X ) is a Gaus- X sian whose parameters are determined by a neural network µ , Σ = g θ ( X ) N The parameters of the distribution are thus determined by the parameters θ of a neural network Our job then is to learn the parameters of this neural network 24/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  67. But what is the objective function for this neural network z X N 25/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  68. But what is the objective function for this neural network z Well we want the proposed distribution Q θ ( z | X ) to be as close to the true distribu- X tion N 25/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  69. But what is the objective function for this neural network z Well we want the proposed distribution Q θ ( z | X ) to be as close to the true distribu- X tion We can capture this using the following ob- N jective function minimize KL ( Q θ ( z | X ) || P ( z | X )) 25/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  70. But what is the objective function for this neural network z Well we want the proposed distribution Q θ ( z | X ) to be as close to the true distribu- X tion We can capture this using the following ob- N jective function minimize KL ( Q θ ( z | X ) || P ( z | X )) What are the parameters of the objective function ? 25/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Recommend


More recommend