AMMI – Introduction to Deep Learning 10.1. Generative Adversarial Networks Fran¸ cois Fleuret https://fleuret.org/ammi-2018/ Thu Sep 6 16:09:56 CAT 2018 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE
A different approach to learn high-dimension generative models are the Generative Adversarial Networks proposed by Goodfellow et al. (2014). Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 10.1. Generative Adversarial Networks 1 / 29
A different approach to learn high-dimension generative models are the Generative Adversarial Networks proposed by Goodfellow et al. (2014). The idea behind GANs is to train two networks jointly: • A discriminator D to classify samples as “real” or “fake”, • a generator G to map a [simple] fixed distribution to samples that fool D . Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 10.1. Generative Adversarial Networks 1 / 29
A different approach to learn high-dimension generative models are the Generative Adversarial Networks proposed by Goodfellow et al. (2014). The idea behind GANs is to train two networks jointly: • A discriminator D to classify samples as “real” or “fake”, • a generator G to map a [simple] fixed distribution to samples that fool D . D “real” Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 10.1. Generative Adversarial Networks 1 / 29
A different approach to learn high-dimension generative models are the Generative Adversarial Networks proposed by Goodfellow et al. (2014). The idea behind GANs is to train two networks jointly: • A discriminator D to classify samples as “real” or “fake”, • a generator G to map a [simple] fixed distribution to samples that fool D . D “real” G D “fake” Z Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 10.1. Generative Adversarial Networks 1 / 29
A different approach to learn high-dimension generative models are the Generative Adversarial Networks proposed by Goodfellow et al. (2014). The idea behind GANs is to train two networks jointly: • A discriminator D to classify samples as “real” or “fake”, • a generator G to map a [simple] fixed distribution to samples that fool D . D “real” G D “fake” Z The approach is adversarial since the two networks have antagonistic objectives. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 10.1. Generative Adversarial Networks 1 / 29
A bit more formally, let 풳 be the signal space and D the latent space dimension. • The generator G : R D → 풳 is trained so that [ideally] if it gets a random normal-distributed Z as input, it produces a sample following the data distribution as output. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 10.1. Generative Adversarial Networks 2 / 29
A bit more formally, let 풳 be the signal space and D the latent space dimension. • The generator G : R D → 풳 is trained so that [ideally] if it gets a random normal-distributed Z as input, it produces a sample following the data distribution as output. • The discriminator D : 풳 → [0 , 1] is trained so that if it gets a sample as input, it predicts if it is genuine. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 10.1. Generative Adversarial Networks 2 / 29
If G is fixed, to train D given a set of “real points” x n ∼ µ, n = 1 , . . . , N , we can generate z n ∼ 풩 (0 , I ) , n = 1 , . . . , N , build a two-class data-set � � 풟 = ( x 1 , 1) , . . . , ( x N , 1) , ( G ( z 1 ) , 0) , . . . , ( G ( z N ) , 0) , � �� � � �� � real samples ∼ µ fake samples ∼ µ G Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 10.1. Generative Adversarial Networks 3 / 29
If G is fixed, to train D given a set of “real points” x n ∼ µ, n = 1 , . . . , N , we can generate z n ∼ 풩 (0 , I ) , n = 1 , . . . , N , build a two-class data-set � � 풟 = ( x 1 , 1) , . . . , ( x N , 1) , ( G ( z 1 ) , 0) , . . . , ( G ( z N ) , 0) , � �� � � �� � real samples ∼ µ fake samples ∼ µ G and minimize the binary cross-entropy � N � N ℒ ( D ) = − 1 � � log D ( x n ) + log(1 − D ( G ( z n ))) 2 N n =1 n =1 = − 1 � � � � �� ˆ + ˆ E X ∼ µ log D ( X ) E X ∼ µ G log(1 − D ( X )) , 2 where µ is the true distribution of the data, and µ G is the distribution of G ( Z ) with Z ∼ 풩 (0 , I ). Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 10.1. Generative Adversarial Networks 3 / 29
The situation is slightly more complicated since we also want to optimize G to maximize D ’s loss. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 10.1. Generative Adversarial Networks 4 / 29
The situation is slightly more complicated since we also want to optimize G to maximize D ’s loss. Goodfellow et al. (2014) provide an analysis of the resulting equilibrium of that strategy. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 10.1. Generative Adversarial Networks 4 / 29
Let’s define � � � � V ( D , G ) = E X ∼ µ log D ( X ) + E X ∼ µ G log(1 − D ( X )) which is high if D is doing a good job (low cross entropy), and low if G fools D . Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 10.1. Generative Adversarial Networks 5 / 29
Let’s define � � � � V ( D , G ) = E X ∼ µ log D ( X ) + E X ∼ µ G log(1 − D ( X )) which is high if D is doing a good job (low cross entropy), and low if G fools D . Our ultimate goal is a G ∗ that fools any D , so G ∗ = argmin max V ( D , G ) . D G Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 10.1. Generative Adversarial Networks 5 / 29
Let’s define � � � � V ( D , G ) = E X ∼ µ log D ( X ) + E X ∼ µ G log(1 − D ( X )) which is high if D is doing a good job (low cross entropy), and low if G fools D . Our ultimate goal is a G ∗ that fools any D , so G ∗ = argmin max V ( D , G ) . D G If we define the optimal discriminator for a given generator D ∗ G = argmax V ( D , G ) , D our objective becomes G ∗ = argmin V ( D ∗ G , G ) . G Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 10.1. Generative Adversarial Networks 5 / 29
We have � � � � V ( D , G ) = E X ∼ µ log D ( X ) + E X ∼ µ G log(1 − D ( X )) � = µ ( x ) log D ( x ) + µ G ( x ) log(1 − D ( x )) dx . x Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 10.1. Generative Adversarial Networks 6 / 29
We have � � � � V ( D , G ) = E X ∼ µ log D ( X ) + E X ∼ µ G log(1 − D ( X )) � = µ ( x ) log D ( x ) + µ G ( x ) log(1 − D ( x )) dx . x Since µ ( x ) argmax µ ( x ) log d + µ G ( x ) log(1 − d ) = µ ( x ) + µ G ( x ) , d and D ∗ G = argmax V ( D , G ) , D if there is no regularization on D , we get µ ( x ) ∀ x , D ∗ G ( x ) = µ ( x ) + µ G ( x ) . Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 10.1. Generative Adversarial Networks 6 / 29
So, since µ ( x ) ∀ x , D ∗ G ( x ) = µ ( x ) + µ G ( x ) . we get � � � � V ( D ∗ G , G ) = E X ∼ µ log D ∗ G ( X ) + E X ∼ µ G log(1 − D ∗ G ( X )) � � � � µ ( X ) µ G ( X ) = E X ∼ µ log + E X ∼ µ G log µ ( X ) + µ G ( X ) µ ( X ) + µ G ( X ) � � � � � � µ + µ G µ + µ G � � = D KL + D KL − log 4 � � µ µ G � 2 � 2 Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 10.1. Generative Adversarial Networks 7 / 29
So, since µ ( x ) ∀ x , D ∗ G ( x ) = µ ( x ) + µ G ( x ) . we get � � � � V ( D ∗ G , G ) = E X ∼ µ log D ∗ G ( X ) + E X ∼ µ G log(1 − D ∗ G ( X )) � � � � µ ( X ) µ G ( X ) = E X ∼ µ log + E X ∼ µ G log µ ( X ) + µ G ( X ) µ ( X ) + µ G ( X ) � � � � � � µ + µ G µ + µ G � � = D KL + D KL − log 4 � � µ µ G � 2 � 2 = 2 D JS ( µ, µ G ) − log 4 where D JS is the Jensen-Shannon Divergence, a standard dissimilarity measure between distributions. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 10.1. Generative Adversarial Networks 7 / 29
To recap: if there is no capacity limitation for D , and if we define � � � � V ( D , G ) = E X ∼ µ log D ( X ) + E X ∼ µ G log(1 − D ( X )) , computing G ∗ = argmin max V ( D , G ) D G amounts to compute G ∗ = argmin D JS ( µ, µ G ) , G where D JS is a reasonable dissimilarity measure between distributions. Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 10.1. Generative Adversarial Networks 8 / 29
To recap: if there is no capacity limitation for D , and if we define � � � � V ( D , G ) = E X ∼ µ log D ( X ) + E X ∼ µ G log(1 − D ( X )) , computing G ∗ = argmin max V ( D , G ) D G amounts to compute G ∗ = argmin D JS ( µ, µ G ) , G where D JS is a reasonable dissimilarity measure between distributions. Although this derivation provides a nice formal framework, in practice D � is not “fully” optimized to [come close to] D ∗ G when optimizing G . In our minimal example, we alternate gradient steps to improve G and D . Fran¸ cois Fleuret AMMI – Introduction to Deep Learning / 10.1. Generative Adversarial Networks 8 / 29
Recommend
More recommend