theory and statistics
play

Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT - PowerPoint PPT Presentation

Improving GANs using Game Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT Min-Max Optimization Solve: inf sup , where , high-dimensional Applications: Mathematics, Optimization, Game


  1. Improving GANs using Game Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT

  2. Min-Max Optimization Solve: inf πœ„ sup 𝑔 πœ„, π‘₯ π‘₯ where πœ„ , π‘₯ high-dimensional β€’ Applications: Mathematics, Optimization, Game Theory ,… [von Neumann 1928, Dantzig ’47, Brown’50 , Robinson’51, Blackwell’56,…] β€’ Best-Case Scenario: 𝑔 is convex in πœ„, concave in w BEGAN. Bertholet et al. 2017. β€’ Modern Applications: GANs, adversarial examples, … – exacerbate the importance of first-order methods, non convex-concave objectives

  3. GAN Outputs BEGAN. Bertholet et al. 2017. LSGAN. Mao et al. 2017.

  4. GAN uses Text -> Image Synthesis Reed et al. 2017. Pix2pix. Isola 2017. Many examples at https://phillipi.github.io/pix2pix/ Many applications: β€’ Domain adaptation β€’ Super-resolution β€’ Image Synthesis 12 CycleGAN. Zhu et al. 2017. β€’ Image Completion Lecture 13 - 7 β€’ Compressed Sensing β€’ …

  5. Min-Max Optimization Solve: inf πœ„ sup 𝑔 πœ„, π‘₯ π‘₯ where πœ„ , π‘₯ high-dimensional β€’ Applications: Mathematics, Optimization, Game Theory ,… [von Neumann 1928, Dantzig ’47, Brown’50 , Robinson’51, Blackwell’56,…] β€’ Best-Case Scenario: 𝑔 is convex in πœ„, concave in w BEGAN. Bertholet et al. 2017. β€’ Modern Applications: GANs, adversarial examples, … – exacerbate the importance of first-order methods, non convex-concave objectives β€’ Personal Perspective: applications of min-max optimization will multiply, going forward, as ML develops more complex and harder to interpret algorithms – sup players will be introduced to check the behavior of the inf players

  6. Generative Adversarial Networks (GANs) [Goodfellow et al. NeurIPS’14] Real or Hallucinated inf πœ„ sup π’ˆ 𝜾, 𝒙 Discriminator : DNN w/ π‘₯ parameters π‘₯ expresses how well Real Images Hallucinated Images Discriminator distinguishes (from training set) (from generator) true vs generated images Generator : DNN w/ parameters πœ„ e.g. Wasserstein-GANs: Simple π‘Ž ∼ 𝑂(0, 𝐽) 𝑔(πœ„, π‘₯) = 𝔽 π‘ŒβˆΌπ‘ž π‘ π‘“π‘π‘š 𝐸 π‘₯ π‘Œ βˆ’ 𝔽 π‘ŽβˆΌπ‘‚(0,𝐽) 𝐸 π‘₯ 𝐻 πœ„ (π‘Ž) Randomness β€’ πœ„, π‘₯ : high-dimensional ⇝ solve game by having min (resp. max) player run online gradient descent (resp. ascent) β€’ major challenges: – training oscillations – generated & real distributions high-dimensional ⇝ no rigorous statistical guarantees

  7. Menu β€’ Min-Max Optimization and Adversarial Training β€’ Training Challenges: β€’ reducing training oscillations β€’ Statistical Challenges: β€’ reducing sample requirements β€’ attaining statistical guarantees

  8. Menu β€’ Min-Max Optimization and Adversarial Training β€’ Training Challenges: β€’ reducing training oscillations β€’ Statistical Challenges: β€’ reducing sample requirements β€’ attaining statistical guarantees

  9. Training Oscillations: Gaussian Mixture True Distribution : Mixture of 8 Gaussians on a circle Output Distribution of standard GAN, trained via gradient descent/ascent dynamics: cycling through modes at different steps of training from [Metz et al ICLR’17]

  10. Training Oscillations: Handwritten Digits True Distribution : MNIST Output Distribution of standard GAN, trained via gradient descent/ascent dynamics cycling through β€œproto - digits” at different steps of training from [Metz et al ICLR’17]

  11. Training Oscillations: even for bilinear objectives! 3 β€’ True distribution: isotropic Normal distribution, namely π‘Œ ∼ π’ͺ 4 , 𝐽 2Γ—2 β€’ Generator architecture : 𝐻 𝜾 π‘Ž = 𝜾 + π‘Ž (adds input π‘Ž to internal params) π‘Ž , πœ„, π‘₯ : 2-dimensional β€’ Discriminator architecture : 𝐸 𝒙 β‹… = 𝒙,β‹… (linear projection) β€’ W-GAN objective: min 𝜾 max 𝒙 𝔽 π‘Œ 𝐸 𝒙 π‘Œ βˆ’ 𝔽 π‘Ž 𝐸 𝒙 𝐻 𝜾 (π‘Ž) convex-concave 3 𝒙 T β‹… = min 𝜾 max 4 βˆ’ 𝜾 function 𝒙 Gradient Descent Dynamics from [Daskalakis, Ilyas , Syrgkanis, Zeng ICLR’18]

  12. Training Oscillations: persistence under many variants of Gradient Descent from [Daskalakis, Ilyas , Syrgkanis, Zeng ICLR’18]

  13. Training Oscillations: Online Learning Perspective β€’ Best-Case Scenario: Given convex-concave 𝑔(𝑦, 𝑧) , solve: min π‘¦βˆˆπ‘Œ max π‘§βˆˆπ‘ 𝑔(𝑦, 𝑧) β€’ [von Neumann’28]: min-max=max-min; solvable via convex-programming β€’ Online Learning : if min and max players run any no-regret learning procedure they converge to minimax equilibrium β€’ E.g. follow-the-regularized-leader (FTRL), follow-the-perturbed-leader, MWU 2 -regularization ≑ gradient descent β€’ Follow-the-regularized-leader with β„“ 2 β€’ β€œConvergence:” Sequence 𝑦 𝑒 , 𝑧 𝑒 𝑒 converges to minimax equilibrium in the 1 1 𝑒 Οƒ πœβ‰€π‘’ 𝑦 𝜐 , 𝑒 Οƒ πœβ‰€π‘’ 𝑧 𝜐 average sense , i.e. 𝑔 β†’ min π‘¦βˆˆπ‘Œ max π‘§βˆˆπ‘ 𝑔(𝑦, 𝑧) β€’ Can we show point-wise convergence of no-regret learning methods? β€’ [Mertikopoulos-Papadimitriou-Piliouras SODA’18]: No for any FTRL

  14. Negative Momentum β€’ Variant of gradient descent: βˆ€π‘’: 𝑦 𝑒+1 = 𝑦 𝑒 βˆ’ πœƒ β‹… 𝛼𝑔 𝑦 𝑒 + 𝜽/πŸ‘ β‹… π›‚π’ˆ(π’š π’–βˆ’πŸ ) β€’ Interpretation: undo today, some of yesterday’s gradient; ie negative momentum β€’ Gradient Descent w/ negative momentum 2 -regularization = Optimistic FTRL w/ β„“ 2 [Rakhlin-Sridharan COLT’13, Syrgkanis et al. NeurIPS’15] β‰ˆ extra-gradient method [Korpelevich’76, Chiang et al COLT’12 , Gidel et al’18, Mertikopoulos et al’18] β€’ Does it help in min-max optimization?

  15. Negative Momentum: why it could help β€’ E.g. 𝑔 𝑦, 𝑧 = 𝑦 βˆ’ 0.5 β‹… 𝑧 βˆ’ 0.5 𝑦 𝑒+1 = 𝑦 𝑒 βˆ’ πœƒ β‹… 𝛼 𝑦 𝑔 𝑦 𝑒 , 𝑧 𝑒 𝑧 𝑒+1 = 𝑧 𝑒 + πœƒ β‹… 𝛼 𝑧 𝑔 𝑦 𝑒 , 𝑧 𝑒 : start : min-max equilibrium 𝑦 𝑒+1 = 𝑦 𝑒 βˆ’ πœƒ β‹… 𝛼 𝑦 𝑔 𝑦 𝑒 , 𝑧 𝑒 +𝜽/πŸ‘ β‹… 𝛂 π’š π’ˆ(π’š π’–βˆ’πŸ , 𝒛 π’–βˆ’πŸ ) 𝑧 𝑒+1 = 𝑧 𝑒 + πœƒ β‹… 𝛼 𝑧 𝑔 𝑦 𝑒 , 𝑧 𝑒 βˆ’πœ½/πŸ‘ β‹… 𝛂 𝒛 π’ˆ(π’š π’–βˆ’πŸ , 𝒛 π’–βˆ’πŸ )

  16. Negative Momentum: convergence β€’ Optimistic gradient descent-ascent (OGDA) dynamics: 𝜽 βˆ€π‘’: 𝑦 𝑒+1 = 𝑦 𝑒 βˆ’ πœƒ β‹… 𝛼 𝑦 𝑔 𝑦 𝑒 , 𝑧 𝑒 + πŸ‘ β‹… 𝛂 𝐲 π’ˆ π’š π’–βˆ’πŸ , 𝒛 π’–βˆ’πŸ 𝜽 𝑧 𝑒+1 = 𝑧 𝑒 + πœƒ β‹… 𝛼 𝑧 𝑔 𝑦 𝑒 , 𝑧 𝑒 βˆ’ πŸ‘ β‹… 𝛂 𝒛 π’ˆ(π’š π’–βˆ’πŸ , 𝒛 π’–βˆ’πŸ ) β€’ [Daskalakis-Ilyas-Syrgkanis- Zeng ICLR’18]: OGDA exhibits last iterate convergence π‘§βˆˆβ„ 𝑛 𝑔 𝑦, 𝑧 = 𝑦 T 𝐡𝑧 + 𝑐 π‘ˆ 𝑦 + 𝑑 π‘ˆ 𝑧 for unconstrained bilinear games: min π‘¦βˆˆβ„ π‘œ max β€’ [Liang- Stokes’18]: …convergence rate is geometric if 𝐡 is well-conditioned, extends to strongly convex-concave functions 𝑔 𝑦, 𝑧 β€’ E.g. in previous isotropic Gaussian case: π‘Œ ∼ π’ͺ 3,4 , 𝐽 2Γ—2 , 𝐻 πœ„ π‘Ž = πœ„ + π‘Ž , 𝐸 π‘₯ β‹… = π‘₯,β‹…

  17. Negative Momentum: convergence β€’ Optimistic gradient descent-ascent (OGDA) dynamics: 𝜽 βˆ€π‘’: 𝑦 𝑒+1 = 𝑦 𝑒 βˆ’ πœƒ β‹… 𝛼 𝑦 𝑔 𝑦 𝑒 , 𝑧 𝑒 + πŸ‘ β‹… 𝛂 𝐲 π’ˆ π’š π’–βˆ’πŸ , 𝒛 π’–βˆ’πŸ 𝜽 𝑧 𝑒+1 = 𝑧 𝑒 + πœƒ β‹… 𝛼 𝑧 𝑔 𝑦 𝑒 , 𝑧 𝑒 βˆ’ πŸ‘ β‹… 𝛂 𝒛 π’ˆ(π’š π’–βˆ’πŸ , 𝒛 π’–βˆ’πŸ ) β€’ [Daskalakis-Ilyas-Syrgkanis- Zeng ICLR’18]: OGDA exhibits last iterate convergence π‘§βˆˆβ„ 𝑛 𝑔 𝑦, 𝑧 = 𝑦 T 𝐡𝑧 + 𝑐 π‘ˆ 𝑦 + 𝑑 π‘ˆ 𝑧 for unconstrained bilinear games: min π‘¦βˆˆβ„ π‘œ max β€’ [Liang- Stokes’18]: …convergence rate is geometric if 𝐡 is well-conditioned, extends to strongly convex-concave functions 𝑔 𝑦, 𝑧 β€’ [Daskalakis-Panageas ITCS’18]: Projected OGDA exhibits last iterate convergence π‘§βˆˆΞ” 𝑛 𝑦 T 𝐡𝑧 even for constrained bilinear games: min π‘¦βˆˆΞ” π‘œ max = all linear programming

  18. Negative Momentum: in the Wild β€’ Can try optimism for non convex-concave min-max objectives 𝑔 𝑦, 𝑧 β€’ Issue [Daskalakis, Panageas NeurIPS’18 ]: No hope that stable points of OGDA or GDA are only local min-max points 1 8 β‹… 𝑦 2 βˆ’ 1 2 β‹… 𝑧 2 + 6 β€’ e.g. 𝑔 𝑦, 𝑧 = βˆ’ 10 β‹… 𝑦 β‹… 𝑧 Gradient Descent-Ascent field β€’ Nested-ness: Local Min-Max βŠ† Stable Points of GDA βŠ† Stable Points of OGDA

  19. Negative Momentum: in the Wild β€’ Can try optimism for non convex-concave min-max objectives 𝑔 𝑦, 𝑧 β€’ Issue [Daskalakis, Panageas NeurIPS’18 ]: No hope that stable points of OGDA or GDA are only local min-max points β€’ Local Min-Max βŠ† Stable Points of GDA βŠ† Stable Points of OGDA β€’ also [Adolphs et al. 18]: left inclusion β€’ Question: identify first-order method converging to local min-max w/ probability 1 β€’ While this is pending, evaluate optimism in practice… β€’ [Daskalakis-Ilyas-Syrgkanis-Zeng ICLR’18 ]: propose optimistic Adam β€’ Adam , a variant of gradient descent proposed by [Kingma- Ba ICLR’15] , has found wide adoption in deep learning, although it doesn’t always converge [Reddi-Kale- Kumar ICLR’18] β€’ Optimistic Adam is the right adaptation of Adam to β€œundo some of the past gradients”

  20. Optimistic Adam on CIFAR10 β€’ Compare Adam, Optimistic Adam , trained on CIFAR10, in terms of Inception Score β€’ No fine-tuning for Optimistic Adam , used same hyper-parameters for both algorithms as suggested in Gulrajani et al. (2017)

Recommend


More recommend