Generative Adversarial Networks Aaron Mishkin UBC MLRG 2018W2 1

Generative Adversial Networks “Two imaginary celebrities that were dreamed up by a random number generator.” https://research.nvidia.com/publication/2017-10 Progressive-Growing-of 2

Why care about GANs? Why to spend your limited time learning about GANs: • GANs are achieving state-of-the-art results in a large variety of image generation tasks. • There’s been a veritable explosion in GAN publications over the last few years – many people are very excited! • GANs are stimulating new theoretical interest in min-max optimization problems and “smooth games”. 3

Why care about GANs: Hyper-realistic Image Generation StyleGAN: image generatation with hierarchical style transfer [3]. 4 https://arxiv.org/abs/1812.04948

Why care about GANs: Conditionally Generative Models Conditional GANs: high-resolution image synthesis via semantic labeling [8]. Input: Segmentation Output: Synthesized Image https://research.nvidia.com/publication/2017-12 High-Resolution-Image-Synthesis 5

Why care about GANs: Image Super Resolution SRGAN: Photo-realistic super-resolution [4]. Bicubic Interp. SRGAN Original Image https://arxiv.org/abs/1609.04802 6

Why care about GANs: Publications Approximately 500 papers GAN papers as of September 2018! 7 See https://github.com/hindupuravinash/the-gan-zoo for the exhaustive list of papers. Image Credit: https://github.com/bgavran.

Generative Models

Generative Modeling Generative Models estimate the probabilistic process that generated a set of observations D . �� x i , y i �� n • D = i =1 : supervised generative models learn the joint distribution p ( x i , y i ), often to compute p ( y i | x i ). � x i � n • D = i =1 : unsupervised generative models learn the distribution of D for clustering, sampling, etc. We can: • directly estimate p ( x i ), • introducing latents y i and estimate p ( x i , y i ). 8

Generative Modeling: Unsupervised Parametric Approaches • Direct Estimation: Choose a parameterized family p ( x | θ ) and learn θ by maximizing the log-likelihood n � θ ∗ = arg max θ log p ( x i | θ ) . i =1 • Latent Variable Models: Define a joint distribution p ( x , y | θ ) and learn θ by maximizing the log-marginal likelihood n � � θ ∗ = arg max θ z i p ( x i , z i | θ ) d z . log i =1 Both approaches require that p ( x | θ ) is easy to evaluate. 9

Generative Modeling: Models for (Very) Complex Data How can we learn such models for very complex data? 10 https://www.researchgate.net/figure/Heterogeneousness-and-diversity-of-the-CIFAR-10-entries-in-their-10-

Generative Modeling: Normalizing Flows and VAEs Design parameterized densities with huge capacity! • Normalizing flows: sequence of non-linear transformations to a simple distribution p z ( z ) p ( x | θ 0: k ) = p z ( z ) where z = f − 1 ◦ · · · ◦ f − 1 ◦ f − 1 θ 0 ( x ) . θ k θ 1 f − 1 must be invertible with tractable log-det. Jacobians. θ j • VAEs: latent-variable models where inference networks specify parameters p ( x , y | θ ) = p ( x | f θ ( y )) p y ( y ) . The marginal likelihood is maximized via the ELBO. 11

GANs: Density-Free Models Generative Adversial Networks (GANs) instead use an unrestricted generator G θ g ( z ) such that p ( x | θ g ) = p z ( { z } ) where { z } = G − 1 θ g ( x ) . • Problem: the inverse image of G θ g ( z ) may be huge! • Problem: it’s likely intractable to preserve volume through G ( z ; θ g ). So, we can’t evaluate p ( x | θ g ) and we can’t learn θ g by maximum likelihood. 12

GANs: Discriminators GANs learn by comparing model samples with examples from D . • Sampling from the generator is easy: ˆ x = G θ g (ˆ z ) , where ˆ z ∼ p z ( z ) . • Given a sample ˆ x , a discriminator tries to distinguish it from true examples: D ( x ) = Pr ( x ∼ p data ) . • The discriminator “supervises” the generator network. 13

GANs: Generator + Descriminator https://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-generative-models-and-adversarial- training-upc-2016 14

GANs: Goodfellow et al. (2014) • Let z ∈ R m and p z ( z ) be a simple base distribution. • The generator G θ g ( z ) : R m → ˜ D is a deep neural network. • ˜ D is the manifold of generated examples. • The discriminator D θ d ( x ) : D ∪ ˜ D → (0 , 1) is also a deep neural network. https://arxiv.org/abs/1511.06434 15

GANs: Saddle-Point Optimization Saddle-Point Optimization: learn G θ g ( z ) and D θ d ( x ) jointly via the objective V ( θ d , θ g ): � � �� min θ g max E p data [log D θ d ( x )] + E p z ( z ) log 1 − D θ d ( G θ g ( z )) θ d � �� likelihood of true data likelihood of generated data 16

GANs: Optimal Discriminators Claim: Given G θ g defining an implicit distribution p g = p ( x | θ g ), the optimal descriminator is p data ( x ) D ∗ ( x ) = p data ( x ) + p g ( x ) . Proof Sketch: � � V ( θ d , θ g ) = p data ( x ) log D ( x ) d x + p ( z ) log(1 − D ( G θ g ( z ))) d z ˜ D D � = p data ( x ) log D ( x ) + p g ( x ) log(1 − D ( x )) d x D∪ ˜ D Maximizing the integrand for all x is sufficient and gives the result (see bonus slides). Previous Slide: https://commons.wikimedia.org/wiki/File:Saddle point.svg 17

GANs: Jensen-Shannon Divergence and Optimal Generators Given an optimal discriminator D ∗ ( x ), the generator objective is � � � � �� log D ∗ 1 − D ∗ C ( θ g ) = E p data θ d ( x ) + E p g ( x ) log θ d ( x ) � � � � p data ( x ) p g ( x ) = E p data log + E p g ( x ) log p data ( x ) + p g ( x ) p data ( x ) + p g ( x ) � � � � � � � � ∝ 1 ( p data + p g ) + 1 ( p data + p g ) � � � � 2 KL p data 2 KL p g � � � � 2 2 � � � � � �� Jensen-Shannon Divergence C ( θ g ) achives its global minimum at p g = p data given an optimal discriminator! 18

GANs: Learning Generators and Discriminators Putting these results to use in practice: • High-capacity discriminators D θ d approximate the Jensen-Shannon divergence when close to global maximum. • D θ d is a “differentiable program”. • We can use D θ d to learn G θ g with our favourite gradient descent method. https://arxiv.org/abs/1511.06434 19

GANs: Training Procedure for i = 1 . . . N do for k = 1 . . . K do • Sample noise samples { z 1 , . . . , z m } ∼ p z ( z ) • Sample examples { x 1 , . . . , x m } from p data ( x ). • Update the discriminator D θ d : m 1 � � � x i � � � � z i �� θ d = θ d − α d ∇ θ d log D + log 1 − D G . m i =1 end for • Sample noise samples { z 1 , . . . , z m } ∼ p z ( z ). • Update the generator G θ g : m 1 � � � � z i �� θ g = θ g − α g ∇ θ g log 1 − D G . m i =1 20 end for

Problems (c. 2016)

Problems with GANs • Vanishing gradients: the discriminator becomes ”too good” and the generator gradient vanishes. • Non-Convergence: the generator and discriminator oscillate without reaching an equilibrium. • Mode Collapse: the generator distribution collapses to a small set of examples. • Mode Dropping: the generator distribution doesn’t fully cover the data distribution. 21

Problems: Vanishing Gradients • The minimax objective saturates when D θ d is close to perfect: � � �� V ( θ d , θ g ) = E p data [log D θ d ( x )]+ E p z ( z ) log 1 − D θ d ( G θ g ( z )) . • A non-saturating heuristic objective for the generator is � � �� J ( G θ g ) = − E p z ( z ) log D θ d ( G θ g ( z )) . 22 https://arxiv.org/abs/1701.00160

Problems: Addressing Vanishing Gradients Solutions: • Change Objectives: use the non-saturating heuristic objective, maximum-likelihood cost, etc. • Limit Discriminator: restrict the capacity of the discriminator. • Schedule Learning: try to balance training D θ d and G θ g . 23

Problems: Non-Convergence Simultaneous gradient descent is not guaranteed to converge for minimax objectives. • Goodfellow et al. only showed convergence when updates are made in the function space [2]. • The parameterization of D θ d and G θ g results in highly non-convex objective. • In practice, training tends to oscillate – updates “undo” each other. 24

Problems: Addressing Non-Convergence Solutions: Lots and lots of hacks! 25 https://github.com/soumith/ganhacks

Problems: Mode Collapse and Mode Dropping One Explanation: SGD may optimize the max-min objective � � �� max min θ g E p data [log D θ d ( x )] + E p z ( z ) log 1 − D θ d ( G θ g ( z )) θ d Intuition: the generator maps all z values to the ˆ x that is mostly likely to fool the discriminator. https://arxiv.org/abs/1701.00160 26

A Possible Solution

Generative Adversarial Networks Aaron Mishkin UBC MLRG 2018W2 1 - PowerPoint PPT Presentation

Generative Adversarial Networks Aaron Mishkin UBC MLRG 2018W2 1 Generative Adversial Networks Two imaginary celebrities that were dreamed up by a random number generator. https://research.nvidia.com/publication/2017-10

Generative Adversarial Networks Benjamin Striner CMU 11-785 March 21, 2018 Benjamin Striner

CSC421/2516 Lecture 18: Generative Adversarial Networks Roger Grosse and Jimmy Ba Roger Grosse

Generative Adversarial Nets(GANs) Troy Cary and Chenzhi Zhao A generative adversarial net is

Robust Estimation and Generative Adversarial Networks Weizhi ZHU Hong Kong University of Science

GAN-based Photo Video Synthesis Summary of Generative Adversarial Nets Lei Zhang What is

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

CSC321 Lecture 19: Generative Adversarial Networks Roger Grosse Roger Grosse CSC321 Lecture 19:

Generative Adversarial Networks presented by Ian Goodfellow presentation co-developed with Aaron

Adversarial Training Attacks on Deep Networks and Generative Adversarial Networks Erkut Erdem

Applications of GANs Photo-Realistic Single Image Super-Resolution Using a Generative

Applications of GANs Photo-Realistic Single Image Super-Resolution Using a Generative

Generative Adversarial Networks Sahin Olut Department of Computer Engineering Istanbul Technical

LAB MEETING: A Connection Between Generative Adversarial Networks, Inverse Reinforcement Learning

Model-Assisted Generative Adversarial Networks Leigh Whitehead ICL Seminar 05/06/20

Bregman and Wasserstein, with Applications to Generative Adversarial Networks (GANs) and beyond

generative design systems Generative Brief Design Definitions Workshop Processes

Multi-Objective Software Effort Estimation Federica Sarro ! ! Senior Research Associate Dept.

Language Processing with Perl and Prolog Chapter 6: Words, Parts of Speech, and Morphology Pierre

EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 6: Words, Parts of Speech, and

Introduction to Big Data and Machine Learning Nonparametric methods Dr. Mihail October 1, 2019

Nonlinear Optimization: The art of modeling INSEAD, Spring 2006 Jean-Philippe Vert Ecole des

Exploring the Limits of Transfer Learning with a unified Text-to-Text Transformer Presented by -

Assessment Briefing for Parents 24 th November 2015 Colman Junior School Leadership Team Colman

Graphs with singular adjacency matrix School of Mathematical Sciences Jiaotong University

Sambuz

Useful Links

Newsletter

Mail Us