cs7015 deep learning lecture 22
play

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22 Module


  1. CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  2. Module 22.1: Neural Autoregressive Density Estimator (NADE) 2/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  3. H ∈ { 0 , 1 } n c 1 c 2 c n So far we have seen a few latent variable h 1 h 2 · · · h n generation models such as RBMs and VAEs w 1 , 1 w m,n W ∈ R m × n v 1 v 2 · · · v m b 1 b 2 b m V ∈ { 0 , 1 } m ˆ x P φ ( x | z ) z + ∗ ǫ µ Σ Q θ ( z | x ) x 3/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  4. H ∈ { 0 , 1 } n c 1 c 2 c n So far we have seen a few latent variable h 1 h 2 · · · h n generation models such as RBMs and VAEs Latent variable models make certain independence w 1 , 1 w m,n W ∈ R m × n assumptions which reduces the number of factors v 1 v 2 · · · v m and in turn the number of parameters in the model b 1 b 2 b m V ∈ { 0 , 1 } m ˆ x P φ ( x | z ) z + ∗ ǫ µ Σ Q θ ( z | x ) x 3/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  5. H ∈ { 0 , 1 } n c 1 c 2 c n So far we have seen a few latent variable h 1 h 2 · · · h n generation models such as RBMs and VAEs Latent variable models make certain independence w 1 , 1 w m,n W ∈ R m × n assumptions which reduces the number of factors v 1 v 2 · · · v m and in turn the number of parameters in the model b 1 b 2 b m V ∈ { 0 , 1 } m For example, in RBMs we assumed that the visible variables were independent given the hidden x ˆ variables which allowed us to do Block Gibbs P φ ( x | z ) Sampling z + ∗ ǫ µ Σ Q θ ( z | x ) x 3/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  6. H ∈ { 0 , 1 } n c 1 c 2 c n So far we have seen a few latent variable h 1 h 2 · · · h n generation models such as RBMs and VAEs Latent variable models make certain independence w 1 , 1 w m,n W ∈ R m × n assumptions which reduces the number of factors v 1 v 2 · · · v m and in turn the number of parameters in the model b 1 b 2 b m V ∈ { 0 , 1 } m For example, in RBMs we assumed that the visible variables were independent given the hidden x ˆ variables which allowed us to do Block Gibbs P φ ( x | z ) Sampling z Similarly in VAEs we assumed P ( x | z ) = N (0 , I ) + which effectively means that given the latent ∗ ǫ variables, the x ’s are independent of each other (Since Σ = I) µ Σ Q θ ( z | x ) x 3/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  7. We will now look at Autoregressive (AR) Models which do not contain any latent variables 4/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  8. We will now look at Autoregressive (AR) Models which do not contain any latent variables The aim of course is to learn a joint distribution over x 4/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  9. We will now look at Autoregressive (AR) Models which do not contain any latent variables The aim of course is to learn a joint distribution over x As usual, for ease of illustration we will assume x ∈ { 0 , 1 } n 4/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  10. We will now look at Autoregressive (AR) Models which do not contain any latent variables The aim of course is to learn a joint distribution over x As usual, for ease of illustration we will assume x ∈ { 0 , 1 } n AR models do not make any independence x 1 x 2 x 3 x 4 assumption but use the default factorization of p ( x ) given by the chain rule p ( x ) = n � p ( x i | x <k ) i =1 4/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  11. We will now look at Autoregressive (AR) Models which do not contain any latent variables The aim of course is to learn a joint distribution over x As usual, for ease of illustration we will assume x ∈ { 0 , 1 } n AR models do not make any independence x 1 x 2 x 3 x 4 assumption but use the default factorization of p ( x ) given by the chain rule p ( x ) = n � p ( x i | x <k ) i =1 The above factorization contains n factors and some of these factors contain many parameters ( O (2 n ) in total ) 4/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  12. Obviously, it is infeasible to learn such an x 1 x 2 x 3 x 4 exponential number of parameters 5/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  13. Obviously, it is infeasible to learn such an x 1 x 2 x 3 x 4 exponential number of parameters AR models work around this by using a neural network to parameterize these factors and then learn the parameters of this neural network 5/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  14. Obviously, it is infeasible to learn such an x 1 x 2 x 3 x 4 exponential number of parameters AR models work around this by using a neural network to parameterize these factors and then learn the parameters of this neural network What does this mean? Let us see! 5/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  15. At the output layer we want to predict ) , x 3 ) n conditional probability distributions (each , x , x 2 2 ) | x | x | x ) 1 1 1 corresponding to one of the factors in our joint p ( x p ( x p ( x p ( x 1 2 3 4 distribution) V 3 W .,<k 6/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  16. At the output layer we want to predict ) , x 3 ) n conditional probability distributions (each , x , x 2 2 ) | x | x | x ) 1 1 1 corresponding to one of the factors in our joint p ( x p ( x p ( x p ( x 1 2 3 4 distribution) At the input layer we are given the n input V 3 variables W .,<k x 1 x 2 x 3 x 4 6/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  17. At the output layer we want to predict ) , x 3 ) n conditional probability distributions (each , x , x 2 2 ) | x | x | x ) 1 1 1 corresponding to one of the factors in our joint p ( x p ( x p ( x p ( x 1 2 3 4 distribution) At the input layer we are given the n input V 3 variables Now the catch is that the n th output should only be connected to the previous n -1 inputs W .,<k x 1 x 2 x 3 x 4 6/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  18. At the output layer we want to predict ) , x 3 ) n conditional probability distributions (each , x , x 2 2 ) | x | x | x ) 1 1 1 corresponding to one of the factors in our joint p ( x p ( x p ( x p ( x 1 2 3 4 distribution) At the input layer we are given the n input V 3 variables Now the catch is that the n th output should only be connected to the previous n -1 inputs In particular, when we are computing p ( x 3 | x 2 , x 1 ) the only inputs that we should consider are x 1 , x 2 because these are the only variables given to us while computing the W .,<k conditional x 1 x 2 x 3 x 4 6/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  19. The Neural Autoregressive Density Estimator ) , x 3 ) (NADE) proposes a simple solution for this , x , x 2 2 ) | x | x | x ) 1 1 1 p ( x p ( x p ( x p ( x 1 2 3 4 V h 1 h 2 h 3 h 4 W x 1 x 2 x 3 x 4 7/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  20. The Neural Autoregressive Density Estimator ) , x 3 ) (NADE) proposes a simple solution for this , x , x 2 2 ) | x | x | x ) 1 1 1 p ( x p ( x p ( x p ( x 1 2 3 4 First, for every output unit, we compute a hidden representation using only the relevant input units V h 1 h 2 h 3 h 4 W x 1 x 2 x 3 x 4 7/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  21. The Neural Autoregressive Density Estimator ) , x 3 ) (NADE) proposes a simple solution for this , x , x 2 2 ) | x | x | x ) 1 1 1 p ( x p ( x p ( x p ( x 1 2 3 4 First, for every output unit, we compute a hidden representation using only the relevant input units V 3 For example, for the k th output unit, the hidden representation will be computed using: h 1 h 2 h 3 h 4 h k = σ ( W .,<k x <k + b ) where h k ∈ R d , W ∈ R d × n , W .,<k are the first k columns of W W .,< 3 x 1 x 2 x 3 x 4 7/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  22. The Neural Autoregressive Density Estimator ) , x 3 ) (NADE) proposes a simple solution for this , x , x 2 2 ) | x | x | x ) 1 1 1 p ( x p ( x p ( x p ( x 1 2 3 4 First, for every output unit, we compute a hidden representation using only the relevant input units V 3 For example, for the k th output unit, the hidden representation will be computed using: h 1 h 2 h 3 h 4 h k = σ ( W .,<k x <k + b ) where h k ∈ R d , W ∈ R d × n , W .,<k are the first k columns of W We now compute the output p ( x k | x k − 1 ) as: W .,< 3 1 y k = p ( x k | x k − 1 ) = σ ( V k h k + c k ) x 1 x 2 x 3 x 4 1 7/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  23. Let us look at the equations carefully ) , x 3 ) , x , x 2 2 ) | x | x | x ) 1 1 1 h k = σ ( W .,<k x <k + b ) p ( x p ( x p ( x p ( x 1 2 3 4 y k = p ( x k | x k − 1 ) = σ ( V k h k + c k ) 1 V 3 h 1 h 2 h 3 h 4 W .,< 3 x 1 x 2 x 3 x 4 8/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

Recommend


More recommend