cs7015 deep learning lecture 23
play

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23 Module


  1. What should be the objective function of the overall network? Real or Fake Discriminator Real Images Generator z ∼ N (0 , I ) 8/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  2. What should be the objective function of the overall network? Let’s look at the objective function of the Real or Fake generator first Discriminator Real Images Generator z ∼ N (0 , I ) 8/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  3. What should be the objective function of the overall network? Let’s look at the objective function of the Real or Fake generator first Discriminator Given an image generated by the generator as G φ ( z ) the discriminator assigns a score D θ ( G φ ( z )) to it Real Images Generator z ∼ N (0 , I ) 8/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  4. What should be the objective function of the overall network? Let’s look at the objective function of the Real or Fake generator first Discriminator Given an image generated by the generator as G φ ( z ) the discriminator assigns a score D θ ( G φ ( z )) to it This score will be between 0 and 1 and will tell us Real Images Generator the probability of the image being real or fake z ∼ N (0 , I ) 8/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  5. What should be the objective function of the overall network? Let’s look at the objective function of the Real or Fake generator first Discriminator Given an image generated by the generator as G φ ( z ) the discriminator assigns a score D θ ( G φ ( z )) to it This score will be between 0 and 1 and will tell us Real Images Generator the probability of the image being real or fake For a given z , the generator would want to maximize log D θ ( G φ ( z )) (log likelihood) or z ∼ N (0 , I ) minimize log(1 − D θ ( G φ ( z ))) 8/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  6. This is just for a single z and the generator would like to do this for all possible values of z , Real or Fake Discriminator Real Images Generator z ∼ N (0 , I ) 9/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  7. This is just for a single z and the generator would like to do this for all possible values of z , For example, if z was discrete and drawn from a Real or Fake uniform distribution ( i.e. , p ( z ) = 1 N ∀ z ) then the Discriminator generator’s objective function would be N 1 � min N log(1 − D θ ( G φ ( z ))) φ i =1 Real Images Generator z ∼ N (0 , I ) 9/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  8. This is just for a single z and the generator would like to do this for all possible values of z , For example, if z was discrete and drawn from a Real or Fake uniform distribution ( i.e. , p ( z ) = 1 N ∀ z ) then the Discriminator generator’s objective function would be N 1 � min N log(1 − D θ ( G φ ( z ))) φ i =1 Real Images Generator However, in our case, z is continuous and not uniform ( z ∼ N (0 , I )) so the equivalent objective function would be z ∼ N (0 , I ) ˆ min p ( z ) log(1 − D θ ( G φ ( z ))) φ min φ E z ∼ p ( z ) [log(1 − D θ ( G φ ( z )))] 9/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  9. Now let’s look at the discriminator Real or Fake Discriminator Real Images Generator z ∼ N (0 , I ) 10/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  10. Now let’s look at the discriminator The task of the discriminator is to assign a high score to real images and a low score to fake images Real or Fake Discriminator Real Images Generator z ∼ N (0 , I ) 10/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  11. Now let’s look at the discriminator The task of the discriminator is to assign a high score to real images and a low score to fake images Real or Fake And it should do this for all possible real images Discriminator and all possible fake images Real Images Generator z ∼ N (0 , I ) 10/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  12. Now let’s look at the discriminator The task of the discriminator is to assign a high score to real images and a low score to fake images Real or Fake And it should do this for all possible real images Discriminator and all possible fake images In other words, it should try to maximize the following objective function Real Images Generator max E x ∼ pdata [log D θ ( x )]+ E z ∼ p ( z ) [log(1 − D θ ( G φ ( z )))] θ z ∼ N (0 , I ) 10/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  13. If we put the objectives of the generator and discriminator together we get a minimax game Real or Fake min max [ E x ∼ p data log D θ ( x ) φ θ Discriminator + E z ∼ p ( z ) log(1 − D θ ( G φ ( z )))] Real Images Generator z ∼ N (0 , I ) 11/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  14. If we put the objectives of the generator and discriminator together we get a minimax game Real or Fake min max [ E x ∼ p data log D θ ( x ) φ θ Discriminator + E z ∼ p ( z ) log(1 − D θ ( G φ ( z )))] The first term in the objective is only w.r.t. the parameters of the discriminator ( θ ) Real Images Generator z ∼ N (0 , I ) 11/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  15. If we put the objectives of the generator and discriminator together we get a minimax game Real or Fake min max [ E x ∼ p data log D θ ( x ) φ θ Discriminator + E z ∼ p ( z ) log(1 − D θ ( G φ ( z )))] The first term in the objective is only w.r.t. the parameters of the discriminator ( θ ) Real Images Generator The second term in the objective is w.r.t. the parameters of the generator ( φ ) as well as the discriminator ( θ ) z ∼ N (0 , I ) 11/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  16. If we put the objectives of the generator and discriminator together we get a minimax game Real or Fake min max [ E x ∼ p data log D θ ( x ) φ θ Discriminator + E z ∼ p ( z ) log(1 − D θ ( G φ ( z )))] The first term in the objective is only w.r.t. the parameters of the discriminator ( θ ) Real Images Generator The second term in the objective is w.r.t. the parameters of the generator ( φ ) as well as the discriminator ( θ ) z ∼ N (0 , I ) The discriminator wants to maximize the second term whereas the generator wants to minimize it (hence it is a two-player game) 11/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  17. So the overall training proceeds by alternating between these two step Real or Fake Discriminator Real Images Generator z ∼ N (0 , I ) 12/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  18. So the overall training proceeds by alternating between these two step Step 1: Gradient Ascent on Discriminator Real or Fake Discriminator max [ E x ∼ p data log D θ ( x )+ E z ∼ p ( z ) log(1 − D θ ( G φ ( z )))] θ Real Images Generator z ∼ N (0 , I ) 12/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  19. So the overall training proceeds by alternating between these two step Step 1: Gradient Ascent on Discriminator Real or Fake Discriminator max [ E x ∼ p data log D θ ( x )+ E z ∼ p ( z ) log(1 − D θ ( G φ ( z )))] θ Step 2: Gradient Descent on Generator min E z ∼ p ( z ) log(1 − D θ ( G φ ( z ))) Real Images Generator φ z ∼ N (0 , I ) 12/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  20. So the overall training proceeds by alternating between these two step Step 1: Gradient Ascent on Discriminator Real or Fake Discriminator max [ E x ∼ p data log D θ ( x )+ E z ∼ p ( z ) log(1 − D θ ( G φ ( z )))] θ Step 2: Gradient Descent on Generator min E z ∼ p ( z ) log(1 − D θ ( G φ ( z ))) Real Images Generator φ In practice, the above generator objective does not work well and we use a slightly modified objective z ∼ N (0 , I ) 12/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  21. So the overall training proceeds by alternating between these two step Step 1: Gradient Ascent on Discriminator Real or Fake Discriminator max [ E x ∼ p data log D θ ( x )+ E z ∼ p ( z ) log(1 − D θ ( G φ ( z )))] θ Step 2: Gradient Descent on Generator min E z ∼ p ( z ) log(1 − D θ ( G φ ( z ))) Real Images Generator φ In practice, the above generator objective does not work well and we use a slightly modified objective z ∼ N (0 , I ) Let us see why 12/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  22. When the sample is likely fake, we want to give a feedback to the generator (using gradients) 4 log(1 − D ( g ( x ))) 2 Loss 0 − 2 − 4 0 0 . 2 0 . 4 0 . 6 0 . 8 1 D ( G ( z )) 13/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  23. When the sample is likely fake, we want to give a feedback to the generator (using gradients) 4 log(1 − D ( g ( x ))) However, in this region where D ( G ( z )) is close to 0, the curve of the loss function is very flat 2 and the gradient would be close to 0 Loss 0 − 2 − 4 0 0 . 2 0 . 4 0 . 6 0 . 8 1 D ( G ( z )) 13/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  24. When the sample is likely fake, we want to give a feedback to the generator (using gradients) 4 log(1 − D ( g ( x ))) However, in this region where D ( G ( z )) is close − log( D ( g ( x ))) to 0, the curve of the loss function is very flat 2 and the gradient would be close to 0 Loss Trick: Instead of minimizing the likelihood of 0 the discriminator being correct, maximize the likelihood of the discriminator being wrong − 2 − 4 0 0 . 2 0 . 4 0 . 6 0 . 8 1 D ( G ( z )) 13/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  25. When the sample is likely fake, we want to give a feedback to the generator (using gradients) 4 log(1 − D ( g ( x ))) However, in this region where D ( G ( z )) is close − log( D ( g ( x ))) to 0, the curve of the loss function is very flat 2 and the gradient would be close to 0 Loss Trick: Instead of minimizing the likelihood of 0 the discriminator being correct, maximize the likelihood of the discriminator being wrong − 2 In effect, the objective remains the same but − 4 the gradient signal becomes better 0 0 . 2 0 . 4 0 . 6 0 . 8 1 D ( G ( z )) 13/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  26. With that we are now ready to see the full algorithm for training GANs 1: procedure GAN Training 11: end procedure 14/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  27. With that we are now ready to see the full algorithm for training GANs 1: procedure GAN Training for number of training iterations do 2: end for 10: 11: end procedure 14/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  28. With that we are now ready to see the full algorithm for training GANs 1: procedure GAN Training for number of training iterations do 2: for k steps do 3: end for 7: end for 10: 11: end procedure 14/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  29. With that we are now ready to see the full algorithm for training GANs 1: procedure GAN Training for number of training iterations do 2: for k steps do 3: • Sample minibatch of m noise samples { z (1) , .., z ( m ) } from noise prior p g ( z ) 4: end for 7: end for 10: 11: end procedure 14/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  30. With that we are now ready to see the full algorithm for training GANs 1: procedure GAN Training for number of training iterations do 2: for k steps do 3: • Sample minibatch of m noise samples { z (1) , .., z ( m ) } from noise prior p g ( z ) 4: • Sample minibatch of m examples { x (1) , .., x ( m ) } from data generating distribution p data ( x ) 5: end for 7: end for 10: 11: end procedure 14/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  31. With that we are now ready to see the full algorithm for training GANs 1: procedure GAN Training for number of training iterations do 2: for k steps do 3: • Sample minibatch of m noise samples { z (1) , .., z ( m ) } from noise prior p g ( z ) 4: • Sample minibatch of m examples { x (1) , .., x ( m ) } from data generating distribution p data ( x ) 5: • Update the discriminator by ascending its stochastic gradient: 6: m 1 � � x ( i ) � � � � z ( i ) ���� � ∇ θ log D θ + log 1 − D θ G φ m i =1 end for 7: end for 10: 11: end procedure 14/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  32. With that we are now ready to see the full algorithm for training GANs 1: procedure GAN Training for number of training iterations do 2: for k steps do 3: • Sample minibatch of m noise samples { z (1) , .., z ( m ) } from noise prior p g ( z ) 4: • Sample minibatch of m examples { x (1) , .., x ( m ) } from data generating distribution p data ( x ) 5: • Update the discriminator by ascending its stochastic gradient: 6: m 1 � � x ( i ) � � � � z ( i ) ���� � ∇ θ log D θ + log 1 − D θ G φ m i =1 end for 7: • Sample minibatch of m noise samples { z (1) , .., z ( m ) } from noise prior p g ( z ) 8: end for 10: 11: end procedure 14/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  33. With that we are now ready to see the full algorithm for training GANs 1: procedure GAN Training for number of training iterations do 2: for k steps do 3: • Sample minibatch of m noise samples { z (1) , .., z ( m ) } from noise prior p g ( z ) 4: • Sample minibatch of m examples { x (1) , .., x ( m ) } from data generating distribution p data ( x ) 5: • Update the discriminator by ascending its stochastic gradient: 6: m 1 � � x ( i ) � � � � z ( i ) ���� � ∇ θ log D θ + log 1 − D θ G φ m i =1 end for 7: • Sample minibatch of m noise samples { z (1) , .., z ( m ) } from noise prior p g ( z ) 8: • Update the generator by ascending its stochastic gradient 9: m 1 � � � � z ( i ) ���� � ∇ φ log D θ G φ m i =1 end for 10: 11: end procedure 14/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  34. Module 23.2: Generative Adversarial Networks - Architecture 15/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  35. We will now look at one of the popular neural networks used for the generator and discriminator (Deep Convolutional GANs) 16/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  36. We will now look at one of the popular neural networks used for the generator and discriminator (Deep Convolutional GANs) For discriminator, any CNN based classifier with 1 class (real) at the output can be used (e.g. VGG, ResNet, etc.) 16/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  37. We will now look at one of the popular neural networks used for the generator and discriminator (Deep Convolutional GANs) For discriminator, any CNN based classifier with 1 class (real) at the output can be used (e.g. VGG, ResNet, etc.) Figure: Generator (Redford et al 2015) (left) and discriminator (Yeh et al 2016) (right) used in DCGAN 16/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  38. Architecture guidelines for stable Deep Convolutional GANs Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator). 17/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  39. Architecture guidelines for stable Deep Convolutional GANs Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator). Use batchnorm in both the generator and the discriminator. 17/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  40. Architecture guidelines for stable Deep Convolutional GANs Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator). Use batchnorm in both the generator and the discriminator. Remove fully connected hidden layers for deeper architectures. 17/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  41. Architecture guidelines for stable Deep Convolutional GANs Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator). Use batchnorm in both the generator and the discriminator. Remove fully connected hidden layers for deeper architectures. Use ReLU activation in generator for all layers except for the output, which uses tanh. 17/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  42. Architecture guidelines for stable Deep Convolutional GANs Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator). Use batchnorm in both the generator and the discriminator. Remove fully connected hidden layers for deeper architectures. Use ReLU activation in generator for all layers except for the output, which uses tanh. Use LeakyReLU activation in the discriminator for all layers 17/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  43. Module 23.3: Generative Adversarial Networks - The Math Behind it 18/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  44. We will now delve a bit deeper into the objective function used by GANs and see what it implies 19/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  45. We will now delve a bit deeper into the objective function used by GANs and see what it implies Suppose we denote the true data distribution by p data ( x ) and the distribution of the data generated by the model as p G ( x ) 19/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  46. We will now delve a bit deeper into the objective function used by GANs and see what it implies Suppose we denote the true data distribution by p data ( x ) and the distribution of the data generated by the model as p G ( x ) What do we wish should happen at the end of training? 19/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  47. We will now delve a bit deeper into the objective function used by GANs and see what it implies Suppose we denote the true data distribution by p data ( x ) and the distribution of the data generated by the model as p G ( x ) What do we wish should happen at the end of training? p G ( x ) = p data ( x ) 19/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  48. We will now delve a bit deeper into the objective function used by GANs and see what it implies Suppose we denote the true data distribution by p data ( x ) and the distribution of the data generated by the model as p G ( x ) What do we wish should happen at the end of training? p G ( x ) = p data ( x ) Can we prove this formally even though the model is not explicitly computing this density? 19/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  49. We will now delve a bit deeper into the objective function used by GANs and see what it implies Suppose we denote the true data distribution by p data ( x ) and the distribution of the data generated by the model as p G ( x ) What do we wish should happen at the end of training? p G ( x ) = p data ( x ) Can we prove this formally even though the model is not explicitly computing this density? We will try to prove this over the next few slides 19/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  50. Theorem The global minimum of the virtual training criterion C ( G ) = max V ( G, D ) is D achieved if and only if p G = p data 20/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  51. Theorem The global minimum of the virtual training criterion C ( G ) = max V ( G, D ) is D achieved if and only if p G = p data is equivalent to 20/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  52. Theorem The global minimum of the virtual training criterion C ( G ) = max V ( G, D ) is D achieved if and only if p G = p data is equivalent to Theorem 1 If p G = p data then the global minimum of the virtual training criterion C ( G ) = max V ( G, D ) is achieved and D 20/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  53. Theorem The global minimum of the virtual training criterion C ( G ) = max V ( G, D ) is D achieved if and only if p G = p data is equivalent to Theorem 1 If p G = p data then the global minimum of the virtual training criterion C ( G ) = max V ( G, D ) is achieved and D 2 The global minimum of the virtual training criterion C ( G ) = max V ( G, D ) is D achieved only if p G = p data 20/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  54. Outline of the Proof The ‘if’ part: The global minimum of the virtual training criterion C ( G ) = max V ( G, D ) is achieved if p G = p data D The ‘only if’ part: The global minimum of the virtual training criterion C ( G ) = max V ( G, D ) is achieved only if p G = p data D 21/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  55. Outline of the Proof The ‘if’ part: The global minimum of the virtual training criterion C ( G ) = max V ( G, D ) is achieved if p G = p data D (a) Find the value of V ( D, G ) when the generator is optimal i.e. , when p G = p data The ‘only if’ part: The global minimum of the virtual training criterion C ( G ) = max V ( G, D ) is achieved only if p G = p data D 21/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  56. Outline of the Proof The ‘if’ part: The global minimum of the virtual training criterion C ( G ) = max V ( G, D ) is achieved if p G = p data D (a) Find the value of V ( D, G ) when the generator is optimal i.e. , when p G = p data (b) Find the value of V ( D, G ) for other values of the generator i.e. , for any p G such that p G � = p data The ‘only if’ part: The global minimum of the virtual training criterion C ( G ) = max V ( G, D ) is achieved only if p G = p data D 21/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  57. Outline of the Proof The ‘if’ part: The global minimum of the virtual training criterion C ( G ) = max V ( G, D ) is achieved if p G = p data D (a) Find the value of V ( D, G ) when the generator is optimal i.e. , when p G = p data (b) Find the value of V ( D, G ) for other values of the generator i.e. , for any p G such that p G � = p data (c) Show that a < b ∀ p G � = p data (and hence the minimum V ( D, G ) is achieved when p G = p data ) The ‘only if’ part: The global minimum of the virtual training criterion C ( G ) = max V ( G, D ) is achieved only if p G = p data D 21/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  58. Outline of the Proof The ‘if’ part: The global minimum of the virtual training criterion C ( G ) = max V ( G, D ) is achieved if p G = p data D (a) Find the value of V ( D, G ) when the generator is optimal i.e. , when p G = p data (b) Find the value of V ( D, G ) for other values of the generator i.e. , for any p G such that p G � = p data (c) Show that a < b ∀ p G � = p data (and hence the minimum V ( D, G ) is achieved when p G = p data ) The ‘only if’ part: The global minimum of the virtual training criterion C ( G ) = max V ( G, D ) is achieved only if p G = p data D Show that when V ( D, G ) is minimum then p G = p data 21/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  59. Outline of the Proof The ‘if’ part: The global minimum of the virtual training criterion C ( G ) = max V ( G, D ) is achieved if p G = p data D (a) Find the value of V ( D, G ) when the generator is optimal i.e. , when p G = p data (b) Find the value of V ( D, G ) for other values of the generator i.e. , for any p G such that p G � = p data (c) Show that a < b ∀ p G � = p data (and hence the minimum V ( D, G ) is achieved when p G = p data ) The ‘only if’ part: The global minimum of the virtual training criterion C ( G ) = max V ( G, D ) is achieved only if p G = p data D Show that when V ( D, G ) is minimum then p G = p data 21/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  60. First let us look at the objective function again min max [ E x ∼ p data log D θ ( x ) + E z ∼ p ( z ) log(1 − D θ ( G φ ( z )))] φ θ 22/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  61. First let us look at the objective function again min max [ E x ∼ p data log D θ ( x ) + E z ∼ p ( z ) log(1 − D θ ( G φ ( z )))] φ θ We will expand it to its integral form ˆ ˆ min max p data ( x ) log D θ ( x ) + p ( z ) log(1 − D θ ( G φ ( z ))) φ θ x z 22/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  62. First let us look at the objective function again min max [ E x ∼ p data log D θ ( x ) + E z ∼ p ( z ) log(1 − D θ ( G φ ( z )))] φ θ We will expand it to its integral form ˆ ˆ min max p data ( x ) log D θ ( x ) + p ( z ) log(1 − D θ ( G φ ( z ))) φ θ x z Let p G ( X ) denote the distribution of the X ’s generated by the generator and since X is a function of z we can replace the second integral as shown below ˆ ˆ min max p data ( x ) log D θ ( x ) + p G ( x ) log(1 − D θ ( x )) φ θ x x 22/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  63. First let us look at the objective function again min max [ E x ∼ p data log D θ ( x ) + E z ∼ p ( z ) log(1 − D θ ( G φ ( z )))] φ θ We will expand it to its integral form ˆ ˆ min max p data ( x ) log D θ ( x ) + p ( z ) log(1 − D θ ( G φ ( z ))) φ θ x z Let p G ( X ) denote the distribution of the X ’s generated by the generator and since X is a function of z we can replace the second integral as shown below ˆ ˆ min max p data ( x ) log D θ ( x ) + p G ( x ) log(1 − D θ ( x )) φ θ x x The above replacement follows from the law of the unconscious statistician (click to link of wikipedia page) 22/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  64. Okay, so our revised objective is given by ˆ min max ( p data ( x ) log D θ ( x ) + p G ( x ) log(1 − D θ ( x ))) dx φ θ x 23/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  65. Okay, so our revised objective is given by ˆ min max ( p data ( x ) log D θ ( x ) + p G ( x ) log(1 − D θ ( x ))) dx φ θ x Given a generator G, we are interested in finding the optimum discriminator D which will maximize the above objective function 23/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  66. Okay, so our revised objective is given by ˆ min max ( p data ( x ) log D θ ( x ) + p G ( x ) log(1 − D θ ( x ))) dx φ θ x Given a generator G, we are interested in finding the optimum discriminator D which will maximize the above objective function The above objective will be maximized when the quantity inside the integral is maximized ∀ x 23/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  67. Okay, so our revised objective is given by ˆ min max ( p data ( x ) log D θ ( x ) + p G ( x ) log(1 − D θ ( x ))) dx φ θ x Given a generator G, we are interested in finding the optimum discriminator D which will maximize the above objective function The above objective will be maximized when the quantity inside the integral is maximized ∀ x To find the optima we will take the derivative of the term inside the integral w.r.t. D and set it to zero 23/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  68. Okay, so our revised objective is given by ˆ min max ( p data ( x ) log D θ ( x ) + p G ( x ) log(1 − D θ ( x ))) dx φ θ x Given a generator G, we are interested in finding the optimum discriminator D which will maximize the above objective function The above objective will be maximized when the quantity inside the integral is maximized ∀ x To find the optima we will take the derivative of the term inside the integral w.r.t. D and set it to zero d d ( D θ ( x )) ( p data ( x ) log D θ ( x ) + p G ( x ) log(1 − D θ ( x ))) = 0 23/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  69. Okay, so our revised objective is given by ˆ min max ( p data ( x ) log D θ ( x ) + p G ( x ) log(1 − D θ ( x ))) dx φ θ x Given a generator G, we are interested in finding the optimum discriminator D which will maximize the above objective function The above objective will be maximized when the quantity inside the integral is maximized ∀ x To find the optima we will take the derivative of the term inside the integral w.r.t. D and set it to zero d d ( D θ ( x )) ( p data ( x ) log D θ ( x ) + p G ( x ) log(1 − D θ ( x ))) = 0 1 1 p data ( x ) D θ ( x ) + p G ( x ) 1 − D θ ( x )( − 1) = 0 23/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  70. Okay, so our revised objective is given by ˆ min max ( p data ( x ) log D θ ( x ) + p G ( x ) log(1 − D θ ( x ))) dx φ θ x Given a generator G, we are interested in finding the optimum discriminator D which will maximize the above objective function The above objective will be maximized when the quantity inside the integral is maximized ∀ x To find the optima we will take the derivative of the term inside the integral w.r.t. D and set it to zero d d ( D θ ( x )) ( p data ( x ) log D θ ( x ) + p G ( x ) log(1 − D θ ( x ))) = 0 1 1 p data ( x ) D θ ( x ) + p G ( x ) 1 − D θ ( x )( − 1) = 0 p data ( x ) p G ( x ) D θ ( x ) = 1 − D θ ( x ) 23/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  71. Okay, so our revised objective is given by ˆ min max ( p data ( x ) log D θ ( x ) + p G ( x ) log(1 − D θ ( x ))) dx φ θ x Given a generator G, we are interested in finding the optimum discriminator D which will maximize the above objective function The above objective will be maximized when the quantity inside the integral is maximized ∀ x To find the optima we will take the derivative of the term inside the integral w.r.t. D and set it to zero d d ( D θ ( x )) ( p data ( x ) log D θ ( x ) + p G ( x ) log(1 − D θ ( x ))) = 0 1 1 p data ( x ) D θ ( x ) + p G ( x ) 1 − D θ ( x )( − 1) = 0 p data ( x ) p G ( x ) D θ ( x ) = 1 − D θ ( x ) ( p data ( x ))(1 − D θ ( x )) = ( p G ( x ))( D θ ( x )) 23/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  72. Okay, so our revised objective is given by ˆ min max ( p data ( x ) log D θ ( x ) + p G ( x ) log(1 − D θ ( x ))) dx φ θ x Given a generator G, we are interested in finding the optimum discriminator D which will maximize the above objective function The above objective will be maximized when the quantity inside the integral is maximized ∀ x To find the optima we will take the derivative of the term inside the integral w.r.t. D and set it to zero d d ( D θ ( x )) ( p data ( x ) log D θ ( x ) + p G ( x ) log(1 − D θ ( x ))) = 0 1 1 p data ( x ) D θ ( x ) + p G ( x ) 1 − D θ ( x )( − 1) = 0 p data ( x ) p G ( x ) D θ ( x ) = 1 − D θ ( x ) ( p data ( x ))(1 − D θ ( x )) = ( p G ( x ))( D θ ( x )) p data ( x ) D θ ( x ) = p G ( x ) + p data ( x ) 23/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  73. This means for any given generator p data ( x ) D ∗ G ( G ( x )) = p data ( x ) + p G ( x ) 24/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  74. This means for any given generator p data ( x ) D ∗ G ( G ( x )) = p data ( x ) + p G ( x ) Now the if part of the theorem says “if p G = p data ....” 24/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

  75. This means for any given generator p data ( x ) D ∗ G ( G ( x )) = p data ( x ) + p G ( x ) Now the if part of the theorem says “if p G = p data ....” So let us substitute p G = p data into D ∗ G ( G ( x )) and see what happens to the loss functions 24/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

Recommend


More recommend