Improving GANs using Game Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT
Min-Max Optimization Solve: inf π sup π π, π₯ π₯ where π , π₯ high-dimensional β’ Applications: Mathematics, Optimization, Game Theory ,β¦ [von Neumann 1928, Dantzig β47, Brownβ50 , Robinsonβ51, Blackwellβ56,β¦] β’ Best-Case Scenario: π is convex in π, concave in w BEGAN. Bertholet et al. 2017. β’ Modern Applications: GANs, adversarial examples, β¦ β exacerbate the importance of first-order methods, non convex-concave objectives
GAN Outputs BEGAN. Bertholet et al. 2017. LSGAN. Mao et al. 2017.
GAN uses Text -> Image Synthesis Reed et al. 2017. Pix2pix. Isola 2017. Many examples at https://phillipi.github.io/pix2pix/ Many applications: β’ Domain adaptation β’ Super-resolution β’ Image Synthesis 12 CycleGAN. Zhu et al. 2017. β’ Image Completion Lecture 13 - 7 β’ Compressed Sensing β’ β¦
Min-Max Optimization Solve: inf π sup π π, π₯ π₯ where π , π₯ high-dimensional β’ Applications: Mathematics, Optimization, Game Theory ,β¦ [von Neumann 1928, Dantzig β47, Brownβ50 , Robinsonβ51, Blackwellβ56,β¦] β’ Best-Case Scenario: π is convex in π, concave in w BEGAN. Bertholet et al. 2017. β’ Modern Applications: GANs, adversarial examples, β¦ β exacerbate the importance of first-order methods, non convex-concave objectives β’ Personal Perspective: applications of min-max optimization will multiply, going forward, as ML develops more complex and harder to interpret algorithms β sup players will be introduced to check the behavior of the inf players
Generative Adversarial Networks (GANs) [Goodfellow et al. NeurIPSβ14] Real or Hallucinated inf π sup π πΎ, π Discriminator : DNN w/ π₯ parameters π₯ expresses how well Real Images Hallucinated Images Discriminator distinguishes (from training set) (from generator) true vs generated images Generator : DNN w/ parameters π e.g. Wasserstein-GANs: Simple π βΌ π(0, π½) π(π, π₯) = π½ πβΌπ π πππ πΈ π₯ π β π½ πβΌπ(0,π½) πΈ π₯ π» π (π) Randomness β’ π, π₯ : high-dimensional β solve game by having min (resp. max) player run online gradient descent (resp. ascent) β’ major challenges: β training oscillations β generated & real distributions high-dimensional β no rigorous statistical guarantees
Menu β’ Min-Max Optimization and Adversarial Training β’ Training Challenges: β’ reducing training oscillations β’ Statistical Challenges: β’ reducing sample requirements β’ attaining statistical guarantees
Menu β’ Min-Max Optimization and Adversarial Training β’ Training Challenges: β’ reducing training oscillations β’ Statistical Challenges: β’ reducing sample requirements β’ attaining statistical guarantees
Training Oscillations: Gaussian Mixture True Distribution : Mixture of 8 Gaussians on a circle Output Distribution of standard GAN, trained via gradient descent/ascent dynamics: cycling through modes at different steps of training from [Metz et al ICLRβ17]
Training Oscillations: Handwritten Digits True Distribution : MNIST Output Distribution of standard GAN, trained via gradient descent/ascent dynamics cycling through βproto - digitsβ at different steps of training from [Metz et al ICLRβ17]
Training Oscillations: even for bilinear objectives! 3 β’ True distribution: isotropic Normal distribution, namely π βΌ πͺ 4 , π½ 2Γ2 β’ Generator architecture : π» πΎ π = πΎ + π (adds input π to internal params) π , π, π₯ : 2-dimensional β’ Discriminator architecture : πΈ π β = π,β (linear projection) β’ W-GAN objective: min πΎ max π π½ π πΈ π π β π½ π πΈ π π» πΎ (π) convex-concave 3 π T β = min πΎ max 4 β πΎ function π Gradient Descent Dynamics from [Daskalakis, Ilyas , Syrgkanis, Zeng ICLRβ18]
Training Oscillations: persistence under many variants of Gradient Descent from [Daskalakis, Ilyas , Syrgkanis, Zeng ICLRβ18]
Training Oscillations: Online Learning Perspective β’ Best-Case Scenario: Given convex-concave π(π¦, π§) , solve: min π¦βπ max π§βπ π(π¦, π§) β’ [von Neumannβ28]: min-max=max-min; solvable via convex-programming β’ Online Learning : if min and max players run any no-regret learning procedure they converge to minimax equilibrium β’ E.g. follow-the-regularized-leader (FTRL), follow-the-perturbed-leader, MWU 2 -regularization β‘ gradient descent β’ Follow-the-regularized-leader with β 2 β’ βConvergence:β Sequence π¦ π’ , π§ π’ π’ converges to minimax equilibrium in the 1 1 π’ Ο πβ€π’ π¦ π , π’ Ο πβ€π’ π§ π average sense , i.e. π β min π¦βπ max π§βπ π(π¦, π§) β’ Can we show point-wise convergence of no-regret learning methods? β’ [Mertikopoulos-Papadimitriou-Piliouras SODAβ18]: No for any FTRL
Negative Momentum β’ Variant of gradient descent: βπ’: π¦ π’+1 = π¦ π’ β π β πΌπ π¦ π’ + π½/π β ππ(π πβπ ) β’ Interpretation: undo today, some of yesterdayβs gradient; ie negative momentum β’ Gradient Descent w/ negative momentum 2 -regularization = Optimistic FTRL w/ β 2 [Rakhlin-Sridharan COLTβ13, Syrgkanis et al. NeurIPSβ15] β extra-gradient method [Korpelevichβ76, Chiang et al COLTβ12 , Gidel et alβ18, Mertikopoulos et alβ18] β’ Does it help in min-max optimization?
Negative Momentum: why it could help β’ E.g. π π¦, π§ = π¦ β 0.5 β π§ β 0.5 π¦ π’+1 = π¦ π’ β π β πΌ π¦ π π¦ π’ , π§ π’ π§ π’+1 = π§ π’ + π β πΌ π§ π π¦ π’ , π§ π’ : start : min-max equilibrium π¦ π’+1 = π¦ π’ β π β πΌ π¦ π π¦ π’ , π§ π’ +π½/π β π π π(π πβπ , π πβπ ) π§ π’+1 = π§ π’ + π β πΌ π§ π π¦ π’ , π§ π’ βπ½/π β π π π(π πβπ , π πβπ )
Negative Momentum: convergence β’ Optimistic gradient descent-ascent (OGDA) dynamics: π½ βπ’: π¦ π’+1 = π¦ π’ β π β πΌ π¦ π π¦ π’ , π§ π’ + π β π π² π π πβπ , π πβπ π½ π§ π’+1 = π§ π’ + π β πΌ π§ π π¦ π’ , π§ π’ β π β π π π(π πβπ , π πβπ ) β’ [Daskalakis-Ilyas-Syrgkanis- Zeng ICLRβ18]: OGDA exhibits last iterate convergence π§ββ π π π¦, π§ = π¦ T π΅π§ + π π π¦ + π π π§ for unconstrained bilinear games: min π¦ββ π max β’ [Liang- Stokesβ18]: β¦convergence rate is geometric if π΅ is well-conditioned, extends to strongly convex-concave functions π π¦, π§ β’ E.g. in previous isotropic Gaussian case: π βΌ πͺ 3,4 , π½ 2Γ2 , π» π π = π + π , πΈ π₯ β = π₯,β
Negative Momentum: convergence β’ Optimistic gradient descent-ascent (OGDA) dynamics: π½ βπ’: π¦ π’+1 = π¦ π’ β π β πΌ π¦ π π¦ π’ , π§ π’ + π β π π² π π πβπ , π πβπ π½ π§ π’+1 = π§ π’ + π β πΌ π§ π π¦ π’ , π§ π’ β π β π π π(π πβπ , π πβπ ) β’ [Daskalakis-Ilyas-Syrgkanis- Zeng ICLRβ18]: OGDA exhibits last iterate convergence π§ββ π π π¦, π§ = π¦ T π΅π§ + π π π¦ + π π π§ for unconstrained bilinear games: min π¦ββ π max β’ [Liang- Stokesβ18]: β¦convergence rate is geometric if π΅ is well-conditioned, extends to strongly convex-concave functions π π¦, π§ β’ [Daskalakis-Panageas ITCSβ18]: Projected OGDA exhibits last iterate convergence π§βΞ π π¦ T π΅π§ even for constrained bilinear games: min π¦βΞ π max = all linear programming
Negative Momentum: in the Wild β’ Can try optimism for non convex-concave min-max objectives π π¦, π§ β’ Issue [Daskalakis, Panageas NeurIPSβ18 ]: No hope that stable points of OGDA or GDA are only local min-max points 1 8 β π¦ 2 β 1 2 β π§ 2 + 6 β’ e.g. π π¦, π§ = β 10 β π¦ β π§ Gradient Descent-Ascent field β’ Nested-ness: Local Min-Max β Stable Points of GDA β Stable Points of OGDA
Negative Momentum: in the Wild β’ Can try optimism for non convex-concave min-max objectives π π¦, π§ β’ Issue [Daskalakis, Panageas NeurIPSβ18 ]: No hope that stable points of OGDA or GDA are only local min-max points β’ Local Min-Max β Stable Points of GDA β Stable Points of OGDA β’ also [Adolphs et al. 18]: left inclusion β’ Question: identify first-order method converging to local min-max w/ probability 1 β’ While this is pending, evaluate optimism in practiceβ¦ β’ [Daskalakis-Ilyas-Syrgkanis-Zeng ICLRβ18 ]: propose optimistic Adam β’ Adam , a variant of gradient descent proposed by [Kingma- Ba ICLRβ15] , has found wide adoption in deep learning, although it doesnβt always converge [Reddi-Kale- Kumar ICLRβ18] β’ Optimistic Adam is the right adaptation of Adam to βundo some of the past gradientsβ
Optimistic Adam on CIFAR10 β’ Compare Adam, Optimistic Adam , trained on CIFAR10, in terms of Inception Score β’ No fine-tuning for Optimistic Adam , used same hyper-parameters for both algorithms as suggested in Gulrajani et al. (2017)
Recommend
More recommend