Generative networks part 2: GANs 23 / 54
Recap on generative networks Generative networks provide a way to sample from any distribution. 1. Sample z ∼ µ , where µ denotes an efficiently sampleable distribution (e.g., uniform or Gaussian). 2. Output g ( z ) , where g : R d → R m is a deep network. Notation: let g # µ (pushforward of µ through g ) denote this distribution. 24 / 54
Recap on generative networks Generative networks provide a way to sample from any distribution. 1. Sample z ∼ µ , where µ denotes an efficiently sampleable distribution (e.g., uniform or Gaussian). 2. Output g ( z ) , where g : R d → R m is a deep network. Notation: let g # µ (pushforward of µ through g ) denote this distribution. Brief remarks: ◮ Can this model any target distribution ν ? Yes, (roughly) for the same reason that g can approximate any f : R d → R m . ◮ Graphical models let us sample and estimate probabilities; what about here? Nope. 24 / 54
Univariate examples g ( x ) = x , the identity function, mapping Uniform ([0 , 1]) to itself. 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 25 / 54
Univariate examples g ( x ) = x 2 , 2 mapping Uniform ([0 , 1]) to something ∝ √ x . 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 26 / 54
Univariate examples g is inverse CDF of Gaussian, input distribution is Uniform ([0 , 1]) and output is Gaussian. 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 27 / 54
Another way to visualize generative networks Given a sample from a distribution (even g # µ ), here’s the “kernel density” / “Parzen window” estimate of its density: 1. Start with random draw ( x i ) n i =1 . 2. “Place bumps at every x i ”: � x − x i p ( x ) := 1 � n � Define ˆ i =1 k , n h where k is a kernel function (not the SVM one!), h is the “bandwidth”; for example: 28 / 54
Another way to visualize generative networks Given a sample from a distribution (even g # µ ), here’s the “kernel density” / “Parzen window” estimate of its density: 1. Start with random draw ( x i ) n i =1 . 2. “Place bumps at every x i ”: � x − x i p ( x ) := 1 � n � Define ˆ i =1 k , n h where k is a kernel function (not the SVM one!), h is the “bandwidth”; for example: � � ◮ Gaussian: k ( z ) ∝ exp −� z � 2 / 2 ; ◮ Epanechnikov: k ( z ) ∝ max { 0 , 1 − � z � 2 } . 28 / 54
Examples — univariate sampling. Univariate sample, kernel density estimate (kde), GMM E-M. kde 0.4 gmm 0.3 0.2 0.1 0.0 2 1 0 1 2 3 4 5 29 / 54
Examples — univariate sampling. Univariate sample, kernel density estimate (kde), GAN kde. 0.4 kde gan kde 0.3 0.2 0.1 0.0 2 1 0 1 2 3 4 5 This is admittedly very indirect! As mentioned, there aren’t great ways to get GAN/VAE density information. 30 / 54
Examples — bivariate sampling. Bivariate sample, GMM E-M. 6 5 4 3 2 1 0 1 2 2 1 0 1 2 3 4 5 31 / 54
Examples — bivariate sampling. Bivariate sample, kernel density estimate (kde). 6 5 4 3 2 1 0 1 2 2 1 0 1 2 3 4 5 32 / 54
Examples — bivariate sampling. Bivariate sample, GAN kde. 6 5 4 3 2 1 0 1 2 2 1 0 1 2 3 4 5 Question: how will this plot change with network capacity? 33 / 54
Approaches we’ve seen for modeling distributions. 34 / 54
Approaches we’ve seen for modeling distributions. Let’s survey our approaches to density estimation. ◮ Graphical models: can be interpretable, can encode domain knowledge. ◮ Kernel density estimation: easy to implement, converges to the right thing, suffers a curse of dimension. ◮ Training: easy for KDE, messy for graphical models. Interpretability: fine for both. Sampling: easy for both. Probability measurements: easy for KDE, sometimes easy for graphical model. 34 / 54
Approaches we’ve seen for modeling distributions. Let’s survey our approaches to density estimation. ◮ Graphical models: can be interpretable, can encode domain knowledge. ◮ Kernel density estimation: easy to implement, converges to the right thing, suffers a curse of dimension. ◮ Training: easy for KDE, messy for graphical models. Interpretability: fine for both. Sampling: easy for both. Probability measurements: easy for KDE, sometimes easy for graphical model. Deep networks. ◮ Either we have easy sampling, or we can estimate densities. Doing both seems to have major computational or data costs. 34 / 54
Brief VAE Recap 35 / 54
(Variational) Autoencoders Autoencoder : ◮ f g − − → latent z i = f ( x i ) − − → x i = g ( z i ) . ˆ x i map map � n 1 Objective: i =1 ℓ ( x i , ˆ x i ) . n 36 / 54
(Variational) Autoencoders Autoencoder : ◮ f g − − → latent z i = f ( x i ) − − → x i = g ( z i ) . ˆ x i map map � n 1 Objective: i =1 ℓ ( x i , ˆ x i ) . n Variational Autoencoder : ◮ f g − − → latent distribution µ i = f ( x i ) − − − − − − − → x i ∼ g # µ i . ˆ x i map pushforward � n 1 � ℓ ( x i , ˆ � Objective: x i ) + λ KL ( µ, µ i ) . i =1 n 36 / 54
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x i ∼ g # µ i ˆ 37 / 54
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.2 0.4 0.6 0.8 1.0 x i ∼ g # µ with small λ ˆ 37 / 54
Generative Adversarial Networks (GANs) 38 / 54
Generative network setup and training. ◮ We are given ( x i ) n i =1 ∼ ν . ◮ We want to find g so that ( g ( z i )) n i =1 ≈ ( x i ) n i =1 , where ( z i ) n i =1 ∼ µ . Problem: this isn’t as simple as fitting g ( z i ) ≈ x i . 39 / 54
Generative network setup and training. ◮ We are given ( x i ) n i =1 ∼ ν . ◮ We want to find g so that ( g ( z i )) n i =1 ≈ ( x i ) n i =1 , where ( z i ) n i =1 ∼ µ . Problem: this isn’t as simple as fitting g ( z i ) ≈ x i . Solutions: ◮ VAE: For each x i , construct distribution µ i , so that ˆ x i ∼ g # µ i and x i are close, as are µ i and µ . To generate fresh samples, get z ∼ µ and output g ( z ) . ◮ GAN: Pick a distance notion between distributions (or between samples ( g ( z i )) n i =1 and ( x i ) n i =1 ) and pick g to minimize that! 39 / 54
GAN overview GAN approach: we minimize D ( ν, g # µ ) directly, where “ D ” is some notion of distance/divergence: ◮ Jensen-Shannon Divergence (original GAN paper). ◮ Wasserstein distance (influential follow-up). 40 / 54
GAN overview GAN approach: we minimize D ( ν, g # µ ) directly, where “ D ” is some notion of distance/divergence: ◮ Jensen-Shannon Divergence (original GAN paper). ◮ Wasserstein distance (influential follow-up). Each distance is computed with an alternating/adversarial scheme: 1. We have some current choice g t , and use it to produce a sample x i ) n (ˆ i =1 with ˆ x i = g t ( z i ) . x i ) n 2. We train a discriminator/critic f t to find differences between (ˆ i =1 and ( x i ) n i =1 . 3. We then pick a new generator g t +1 , trained to fool f t ! 40 / 54
Jensen-Shannon divergence (original GAN) 41 / 54
Original GAN formulation p = p 2 + p g Let p, p g denote density of data and generator, ˜ 2 . Original GAN minimizes Jensen-Shannon Divergence : 2 · JS ( p, p g ) = KL ( p, ˜ p ) + KL ( p g , ˜ p ) p ( x ) ln p ( x ) p g ( x ) ln p g ( x ) � � = p ( x ) d x + p ( x ) d x ˜ ˜ = E p ln p ( x ) p ( x ) + E p g ln p g ( x ) p ( x ) . ˜ ˜ 42 / 54
Original GAN formulation p = p 2 + p g Let p, p g denote density of data and generator, ˜ 2 . Original GAN minimizes Jensen-Shannon Divergence : 2 · JS ( p, p g ) = KL ( p, ˜ p ) + KL ( p g , ˜ p ) p ( x ) ln p ( x ) p g ( x ) ln p g ( x ) � � = p ( x ) d x + p ( x ) d x ˜ ˜ = E p ln p ( x ) p ( x ) + E p g ln p g ( x ) p ( x ) . ˜ ˜ But we’ve been saying we can’t write down p g ? 42 / 54
Original GAN formulation p = p 2 + p g Let p, p g denote density of data and generator, ˜ 2 . Original GAN minimizes Jensen-Shannon Divergence : 2 · JS ( p, p g ) = KL ( p, ˜ p ) + KL ( p g , ˜ p ) p ( x ) ln p ( x ) p g ( x ) ln p g ( x ) � � = p ( x ) d x + p ( x ) d x ˜ ˜ = E p ln p ( x ) p ( x ) + E p g ln p g ( x ) p ( x ) . ˜ ˜ But we’ve been saying we can’t write down p g ? Original GAN approach applies alternating minimization to n m 1 + 1 � � . � � � � inf sup ln f ( x i ) ln 1 − f ( g ( z j )) n m g ∈G f ∈F i =1 j =1 f : X → (0 , 1) 42 / 54
Original GAN formulation and algorithm. Original GAN objective: n m 1 + 1 � � . � � � � inf sup ln f ( x i ) ln 1 − f ( g ( z j )) n m g ∈G f ∈F i =1 j =1 f : X → (0 , 1) Algorithm alternates these two steps: 1. Hold g fixed and optimize f . Specifically, generate a sample (ˆ x j ) m j =1 = ( g ( z j )) m j =1 , and approximately optimize n m 1 + 1 � � . � � � � sup ln f ( x i ) ln 1 − f (ˆ x j ) n m f ∈F i =1 j =1 f : X → (0 , 1) 2. Hold f fixed and optimize g . Specifically, generate ( z j ) m j =1 and approximately optimize n m 1 + 1 � � . � � � � inf ln f ( x i ) ln 1 − f ( g ( z j )) n m g ∈G i =1 j =1 43 / 54
Recommend
More recommend