wasserstein gan
play

Wasserstein GAN Martin Arjovsky, Soumith Chintala, Lon Bottou, ICML - PowerPoint PPT Presentation

Wasserstein GAN Martin Arjovsky, Soumith Chintala, Lon Bottou, ICML 2017 Presented by Yaochen Xie 12-222017 Contents GAN and its applications [1] GAN vs. Variational Auto-Encoder [2] Whats wrong with GAN [3], [4] JS


  1. Wasserstein GAN Martin Arjovsky, Soumith Chintala, Léon Bottou, ICML 2017 Presented by Yaochen Xie 12-22–2017

  2. Contents ❖ GAN and its applications [1] ❖ GAN vs. Variational Auto-Encoder [2] ❖ What’s wrong with GAN [3], [4] ❖ JS Divergence and KL Divergence [3], [4] ❖ Wasserstein Distance [4], [5] ❖ WGAN and its Implementation [4]

  3. Take A Look Back at GAN D and G play the following two-player minimax game with the value function V(D, G):

  4. Applications of GAN - Image Translation Conditional GAN Triangle GAN

  5. Applications of GAN - Super-Resolution

  6. Applications of GAN - Image Inpainting Real Input TV LR GAN

  7. GAN vs. VAE - AutoEncoder

  8. GAN vs. VAE - Variational AutoEncoder • add a constraint on the encoding network , that forces it to generate latent vectors that roughly follow a unit gaussian distribution • generative loss : mean squared error • latent loss : KL divergence

  9. GAN vs. VAE • VAE - explicit, use MSE to judge generation quality • GAN - implicit, use discriminator to judge generation quality

  10. Drawbacks of GAN Unstable, not Gradient Vanishing Mode Collapse converging

  11. Kullback–Leibler divergence (Relative Entropy) A metric that measures the distance between two distributions Continuous Distributions: Discrete Distributions: (P||Q) (P||Q) is not equal to Notice that: Rigorously, KL divergence cannot be considered as a distance.

  12. Jensen-Shannon divergence A symmetrized and smoothed version of the KL Divergence When two distribution are far from each other….

  13. Where is the loss from? Cross Entropy: Loss based on Cross Entropy: What if p and q belongs to continuous distributions? Expectation

  14. What’s going wrong? Now we fix G, and let D be optimum: => 2 times of Jensen-Shannon divergence Till now, to optimize the loss is equivalent to minimize the JS-divergence between Pr and Pg. Gradient Vanishing !

  15. What’s going wrong? When G is fixed and D is optimum: = Mode Collapse Unstable KL —> ∞ KL —> 0

  16. What’s going wrong?

  17. We need a weaker distance WHY? 1. We wish to be continuous The most fundamental di ff erence between such distances is their impact on the convergence of sequences of probability distributions. iif. —> 0 E A weaker distance means easier to converge.

  18. We need a weaker distance WHY? 2. We wish Continuity means that when a sequence of parameters converges to , the distributions also converge to The weaker this distance, the easier it is to define a continuous mapping from θ -space to P θ -space, since it’s easier for the distributions to converge. Then

  19. Wasserstein (Earth-Mover) Distance ! “ If each distribution is viewed as a unit amount of "dirt" piled on , the metric is the minimum "cost" of turning one pile into the other, which is assumed to be the amount of dirt that needs to be moved times the distance it has to be moved ”

  20. Wasserstein Distance ! • KL-Divergence and JS-Divergence are too strong for the loss function 1 to be continuous. • Wasserstein distance is a weaker measurement of distance s.t.: 1. is continuous if is continuous. 2. is continuous and di ff erentiable almost everywhere if is locally Lipschitz with finite expectation of local Lipschitz constant.

  21. Wasserstein Distance !

  22. Optimal Transportation View of GAN Brenier potential

  23. Convex Geometry Minkowski theorem Alexandrov theorem Geometric Interpretation to Optimal Transport Map

  24. Wasserstein distance in WGAN Kantorovich-Rubinstein Duality : https://vincentherrmann.github.io/blog/wasserstein/ when μ and ν have bounded support, where Lip( f ) denotes the minimal Lipschitz constant for f .

  25. Implementation Compared with origin GAN, WGAN conducts four changes: - Discriminator (with sigmoid activation) —> Critic (without sigmoid) - - 1 L D = L G = - Truncation of parameters in Critic (Discriminator). - Do not use momentum when gradient descending.

  26. Experiments

  27. References [1] Ian J. Goodfellow. Generative Adversarial Nets. ., and Max Welling. Auto-encoding variational bayes. [2] Kingma, Diederik P [3] Martin Arjovsky and L´eon Bottou. Towards principled methods for training generative adversarial networks. [4] Martin Arjovsky, Soumith Chintala and Léon Bottou. Wasserstein GAN. [5] Na Lei, Kehua Su, Li Cui, Shing-Tung Yau, David Xianfeng Gu, A Geometric View of Optimal Transportation and Generative Model

Recommend


More recommend