MMD as an integral probability metric Maximum mean discrepancy : smooth function for P vs Q MMD ✭ P ❀ Q ❀ F ✮ ✿❂ sup ❬ E P f ✭ X ✮ � E Q f ✭ Y ✮❪ ❦ f ❦✔ 1 ( F ❂ unit ball in RKHS ❋ ) For characteristic RKHS ❋ , MMD ✭ P ❀ Q ❀ F ✮ ❂ 0 iff P ❂ Q Other choices for witness function class: Bounded continuous [Dudley, 2002] Bounded varation 1 (Kolmogorov metric) [Müller, 1997] Bounded Lipschitz (Wasserstein distances) [Dudley, 2002] 22/73
Integral prob. metric vs feature difference The MMD: Witness f for Gauss and Laplace densities 0.8 f Gauss 0.6 Laplace MMD ✭ P ❀ Q ❀ F ✮ Prob. density and f 0.4 ❂ sup ❬ E P f ✭ X ✮ � E Q f ✭ Y ✮❪ 0.2 f ✷ F 0 −0.2 −0.4 −0.6 −6 −4 −2 0 2 4 6 X 23/73
Integral prob. metric vs feature difference The MMD: use E P f ✭ X ✮ ❂ ❤ ✖ P ❀ f ✐ ❋ MMD ✭ P ❀ Q ❀ F ✮ ❂ sup ❬ E P f ✭ X ✮ � E Q f ✭ Y ✮❪ f ✷ F ❂ sup ❤ f ❀ ✖ P � ✖ Q ✐ ❋ f ✷ F 23/73
Integral prob. metric vs feature difference The MMD: MMD ✭ P ❀ Q ❀ F ✮ ❂ sup ❬ E P f ✭ X ✮ � E Q f ✭ Y ✮❪ f ✷ F ❂ sup ❤ f ❀ ✖ P � ✖ Q ✐ ❋ f ✷ F f 23/73
Integral prob. metric vs feature difference The MMD: MMD ✭ P ❀ Q ❀ F ✮ ❂ sup ❬ E P f ✭ X ✮ � E Q f ✭ Y ✮❪ f f ✷ F ❂ sup ❤ f ❀ ✖ P � ✖ Q ✐ ❋ f ✷ F 23/73
Integral prob. metric vs feature difference The MMD: MMD ✭ P ❀ Q ❀ F ✮ ❂ sup ❬ E P f ✭ X ✮ � E Q f ✭ Y ✮❪ f* f ✷ F ❂ sup ❤ f ❀ ✖ P � ✖ Q ✐ ❋ f ✷ F 23/73
Integral prob. metric vs feature difference The MMD: MMD ✭ P ❀ Q ❀ F ✮ ❂ sup ❬ E P f ✭ X ✮ � E Q f ✭ Y ✮❪ f ✷ F ❂ sup ❤ f ❀ ✖ P � ✖ Q ✐ ❋ f ✷ F ❂ ❦ ✖ P � ✖ Q ❦ Function view and feature view equivalent 23/73
Construction of MMD witness Construction of empirical witness function (proof: next slide!) Observe X ❂ ❢ x 1 ❀ ✿ ✿ ✿ ❀ x n ❣ ✘ P Observe Y ❂ ❢ y 1 ❀ ✿ ✿ ✿ ❀ y n ❣ ✘ Q 24/73
Construction of MMD witness Construction of empirical witness function (proof: next slide!) 24/73
Construction of MMD witness Construction of empirical witness function (proof: next slide!) v 24/73
Construction of MMD witness Construction of empirical witness function (proof: next slide!) v witness ✭ v ✮ ⑤ ④③ ⑥ 24/73
Derivation of empirical witness function Recall the witness function expression f ✄ ✴ ✖ P � ✖ Q 25/73
Derivation of empirical witness function Recall the witness function expression f ✄ ✴ ✖ P � ✖ Q The empirical feature mean for P n ❳ ✖ P ✿❂ 1 ❜ ✬ ✭ x i ✮ n i ❂ 1 25/73
Derivation of empirical witness function Recall the witness function expression f ✄ ✴ ✖ P � ✖ Q The empirical feature mean for P n ❳ ✖ P ✿❂ 1 ❜ ✬ ✭ x i ✮ n i ❂ 1 The empirical witness function at v f ✄ ✭ v ✮ ❂ ❤ f ✄ ❀ ✬ ✭ v ✮ ✐ ❋ 25/73
Derivation of empirical witness function Recall the witness function expression f ✄ ✴ ✖ P � ✖ Q The empirical feature mean for P n ❳ ✖ P ✿❂ 1 ❜ ✬ ✭ x i ✮ n i ❂ 1 The empirical witness function at v f ✄ ✭ v ✮ ❂ ❤ f ✄ ❀ ✬ ✭ v ✮ ✐ ❋ ✴ ❤ ❜ ✖ P � ❜ ✖ Q ❀ ✬ ✭ v ✮ ✐ ❋ 25/73
Derivation of empirical witness function Recall the witness function expression f ✄ ✴ ✖ P � ✖ Q The empirical feature mean for P n ❳ ✖ P ✿❂ 1 ❜ ✬ ✭ x i ✮ n i ❂ 1 The empirical witness function at v f ✄ ✭ v ✮ ❂ ❤ f ✄ ❀ ✬ ✭ v ✮ ✐ ❋ ✴ ❤ ❜ ✖ P � ❜ ✖ Q ❀ ✬ ✭ v ✮ ✐ ❋ n n ❳ ❳ ❂ 1 k ✭ x i ❀ v ✮ � 1 k ✭ y i ❀ v ✮ n n i ❂ 1 i ❂ 1 ❤ ✐ Don’t need explicit feature coefficients f ✄ ✿❂ f ✄ f ✄ ✿ ✿ ✿ 1 2 25/73
Interlude: divergence measures 26/73
Divergences 27/73
Divergences 28/73
Divergences 29/73
Divergences 30/73
Divergences Sriperumbudur, Fukumizu, G, Schoelkopf, Lanckriet (2012) 31/73
Two-Sample Testing with MMD 32/73
❍ ❂ ❭ ❍ ✻ ❂ ❭ ❭ ☛ ☛ A statistical test using MMD The empirical MMD: ❳ ❳ 2 ❂ 1 1 ❭ k ✭ x i ❀ x j ✮ ✰ k ✭ y i ❀ y j ✮ MMD n ✭ n � 1 ✮ n ✭ n � 1 ✮ i ✻ ❂ j i ✻ ❂ j ❳ � 2 k ✭ x i ❀ y j ✮ n 2 i ❀ j How does this help decide whether P ❂ Q ? 33/73
❭ ☛ ☛ A statistical test using MMD The empirical MMD: ❳ ❳ 2 ❂ 1 1 ❭ k ✭ x i ❀ x j ✮ ✰ k ✭ y i ❀ y j ✮ MMD n ✭ n � 1 ✮ n ✭ n � 1 ✮ i ✻ ❂ j i ✻ ❂ j ❳ � 2 k ✭ x i ❀ y j ✮ n 2 i ❀ j Perspective from statistical hypothesis testing: Null hypothesis ❍ 0 when P ❂ Q 2 “close to zero”. • should see ❭ MMD Alternative hypothesis ❍ 1 when P ✻ ❂ Q 2 “far from zero” • should see ❭ MMD 33/73
A statistical test using MMD The empirical MMD: ❳ ❳ 2 ❂ 1 1 ❭ k ✭ x i ❀ x j ✮ ✰ k ✭ y i ❀ y j ✮ MMD n ✭ n � 1 ✮ n ✭ n � 1 ✮ i ✻ ❂ j i ✻ ❂ j ❳ � 2 k ✭ x i ❀ y j ✮ n 2 i ❀ j Perspective from statistical hypothesis testing: Null hypothesis ❍ 0 when P ❂ Q 2 “close to zero”. • should see ❭ MMD Alternative hypothesis ❍ 1 when P ✻ ❂ Q 2 “far from zero” • should see ❭ MMD 2 to get false positive rate ☛ Want Threshold c ☛ for ❭ MMD 33/73
2 when P ✻ ❂ Q Behaviour of ❭ MMD Draw n ❂ 200 i.i.d samples from P and Q Laplace with different y-variance. ♣ n ✂ ❭ 2 ❂ 1 ✿ 2 MMD 10 P 8 Q 6 4 2 0 -2 -4 -6 -8 -10 -2 0 2 34/73
2 when P ✻ ❂ Q Behaviour of ❭ MMD Draw n ❂ 200 i.i.d samples from P and Q Laplace with different y-variance. ♣ n ✂ ❭ 2 ❂ 1 ✿ 2 MMD 10 8 P 8 Q 7 6 4 6 2 5 0 4 -2 3 -4 -6 2 -8 1 -10 -2 0 2 0 0 0.5 1 1.5 2 2.5 35/73
2 when P ✻ ❂ Q Behaviour of ❭ MMD Draw n ❂ 200 new samples from P and Q Laplace with different y-variance. ♣ n ✂ ❭ 2 ❂ 1 ✿ 5 MMD 10 4 P 8 Q 3.5 6 4 3 2 2.5 0 2 -2 1.5 -4 -6 1 -8 0.5 -10 -2 0 2 0 0 0.5 1 1.5 2 2.5 36/73
2 when P ✻ ❂ Q Behaviour of ❭ MMD Repeat this 150 times ✿ ✿ ✿ 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 37/73
2 when P ✻ ❂ Q Behaviour of ❭ MMD Repeat this 300 times ✿ ✿ ✿ 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 37/73
2 when P ✻ ❂ Q Behaviour of ❭ MMD Repeat this 3000 times ✿ ✿ ✿ 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 37/73
2 when P ✻ ❂ Q Asymptotics of ❭ MMD When P ✻ ❂ Q , statistic is asymptotically normal, 2 � MMD ✭ P ❀ Q ✮ ❭ MMD D ♣ � ✦ ◆ ✭ 0 ❀ 1 ✮ ❀ V n ✭ P ❀ Q ✮ � n � 1 ✁ . where variance V n ✭ P ❀ Q ✮ ❂ O Two Laplace distributions with different variances 1.5 P X 1.5 Q X Empirical PDF Gaussian fit Prob. density 1 1 0.5 0 −6 −4 −2 0 2 4 6 0.5 X 0 0 0.5 1 1.5 2 2.5 3 3.5 38/73
2 when P ❂ Q Behaviour of ❭ MMD What happens when P and Q are the same? 39/73
2 when P ❂ Q Behaviour of ❭ MMD Case of P ❂ Q ❂ ◆ ✭ 0 ❀ 1 ✮ 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2 0 2 4 6 40/73
2 when P ❂ Q Behaviour of ❭ MMD Case of P ❂ Q ❂ ◆ ✭ 0 ❀ 1 ✮ 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2 0 2 4 6 40/73
2 when P ❂ Q Behaviour of ❭ MMD Case of P ❂ Q ❂ ◆ ✭ 0 ❀ 1 ✮ 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2 0 2 4 6 40/73
2 when P ❂ Q Behaviour of ❭ MMD Case of P ❂ Q ❂ ◆ ✭ 0 ❀ 1 ✮ 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2 0 2 4 6 40/73
2 when P ❂ Q Behaviour of ❭ MMD Case of P ❂ Q ❂ ◆ ✭ 0 ❀ 1 ✮ 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2 0 2 4 6 40/73
2 when P ❂ Q Asymptotics of ❭ MMD Where P ❂ Q , statistic has asymptotic distribution ✶ ❤ ✐ ❳ 2 ✘ n ❭ z 2 l � 2 MMD ✕ l l ❂ 1 where ❩ ✕ i ✥ i ✭ x ✵ ✮ ❂ k ✭ x ❀ x ✵ ✮ ⑦ ✥ i ✭ x ✮ dP ✭ x ✮ ⑤ ④③ ⑥ 0.6 ❳ centred z l ✘ ◆ ✭ 0 ❀ 2 ✮ i ✿ i ✿ d ✿ 0.4 0.2 0 -2 0 2 4 6 41/73
A statistical test A summary of the asymptotics: 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2 -1 0 1 2 3 4 5 6 42/73
A statistical test Test construction: (G., Borgwardt, Rasch, Schoelkopf, and Smola, JMLR 2012) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2 -1 0 1 2 3 4 5 6 42/73
How do we get test threshold c ☛ ? Original empirical MMD for dogs and fish: ❳ 2 ❂ 1 k ( x i , x j ) k ( x i , y j ) ❭ k ✭ x i ❀ x j ✮ MMD n ✭ n � 1 ✮ i ✻ ❂ j ❳ 1 ✰ k ✭ y i ❀ y j ✮ n ✭ n � 1 ✮ i ✻ ❂ j ❳ � 2 k ( y i , y j ) k ✭ x i ❀ y j ✮ n 2 i ❀ j 43/73
How do we get test threshold c ☛ ? Permuted dog and fish samples ( merdogs ): 44/73
How do we get test threshold c ☛ ? Permuted dog and fish samples ( merdogs ): ❳ 2 ❂ 1 ❭ k ✭⑦ x i ❀ ⑦ x j ✮ MMD n ✭ n � 1 ✮ k (˜ x i , ˜ x j ) k (˜ x i , ˜ y j ) i ✻ ❂ j ❳ 1 ✰ k ✭ ~ y i ❀ ~ y j ✮ n ✭ n � 1 ✮ i ✻ ❂ j ❳ � 2 k ✭⑦ x i ❀ ~ y j ✮ n 2 k (˜ y i , ˜ y j ) i ❀ j Permutation simulates P ❂ Q 44/73
How to choose the best kernel: optimising the kernel parameters 45/73
Graphical illustration Maximising test power same as minimizing false negatives 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -2 -1 0 1 2 3 4 5 6 46/73
Optimizing kernel for test power The power of our test (Pr 1 denotes probability under P ✻ ❂ Q ): ✒ ✓ 2 ❃ ❫ n ❭ Pr 1 MMD c ☛ 47/73
Optimizing kernel for test power The power of our test (Pr 1 denotes probability under P ✻ ❂ Q ): ✒ ✓ 2 ❃ ❫ n ❭ Pr 1 MMD c ☛ ✥ ✦ n MMD 2 ✭ P ❀ Q ✮ c ☛ ✦ ✟ ♣ � ♣ V n ✭ P ❀ Q ✮ V n ✭ P ❀ Q ✮ where ✟ is the CDF of the standard normal distribution. c ☛ is an estimate of c ☛ test threshold. ❫ 47/73
Optimizing kernel for test power The power of our test (Pr 1 denotes probability under P ✻ ❂ Q ): ✒ ✓ 2 ❃ ❫ n ❭ Pr 1 MMD c ☛ ✥ ✦ MMD 2 ✭ P ❀ Q ✮ c ☛ ✦ ✟ ♣ � ♣ V n ✭ P ❀ Q ✮ n V n ✭ P ❀ Q ✮ ⑤ ④③ ⑥ ⑤ ④③ ⑥ O ✭ n 1 ❂ 2 ✮ O ✭ n � 1 ❂ 2 ✮ ♣ V n ✭ P ❀ Q ✮ ✘ O ✭ n � 1 ❂ 2 ✮ Variance under ❍ 1 decreases as For large n , second term negligible! 47/73
Optimizing kernel for test power The power of our test (Pr 1 denotes probability under P ✻ ❂ Q ): ✒ ✓ 2 ❃ ❫ n ❭ Pr 1 MMD c ☛ ✥ ✦ MMD 2 ✭ P ❀ Q ✮ c ☛ ✦ ✟ ♣ V n ✭ P ❀ Q ✮ � ♣ n V n ✭ P ❀ Q ✮ To maximize test power, maximize MMD 2 ✭ P ❀ Q ✮ ♣ V n ✭ P ❀ Q ✮ (Sutherland, Tung, Strathmann, De, Ramdas, Smola, G., ICLR 2017) Code: github.com/dougalsutherland/opt-mmd 47/73
Troubleshooting for generative adversarial networks Samples from a GAN MNIST samples 48/73
Troubleshooting for generative adversarial networks Samples from a GAN MNIST samples Power for optimzed ARD kernel : 1.00 at ☛ ❂ 0 ✿ 01 Power for optimized RBF kernel: 0.57 at ☛ ❂ 0 ✿ 01 48/73 ARD map
Troubleshooting generative adversarial networks 49/73
Training GANs with MMD 50/73
What is a Generative Adversarial Network (GAN)? 51/73
What is a Generative Adversarial Network (GAN)? 51/73
What is a Generative Adversarial Network (GAN)? 51/73
What is a Generative Adversarial Network (GAN)? 51/73
Why is classification not enough? 52/73
MMD for GAN critic Can you use MMD as a critic to train GANs? From ICML 2015: From UAI 2015: 53/73
MMD for GAN critic Can you use MMD as a critic to train GANs? Need better image features. 53/73
How to improve the critic witness Add convolutional features! The critic (teacher) also needs to be trained. How to regularise? MMD GAN Li et al., [NIPS 2017] Coulomb GAN Unterthiner et al., [ICLR 2018] 54/73
WGAN-GP Wasserstein GAN Arjovsky et al. [ICML 2017] WGAN-GP Gukrajani et al. [NIPS 2017] 55/73
WGAN-GP Wasserstein GAN Arjovsky et al. [ICML 2017] WGAN-GP Gukrajani et al. [NIPS 2017] Given a generator G ✒ with parameters ✒ to be trained. Samples Y ✘ G ✒ ✭ Z ✮ where Z ✘ R Given critic features h ✥ with parameters ✥ to be trained. f ✥ a linear function of h ✥ . 55/73
WGAN-GP Wasserstein GAN Arjovsky et al. [ICML 2017] WGAN-GP Gukrajani et al. [NIPS 2017] Given a generator G ✒ with parameters ✒ to be trained. Samples Y ✘ G ✒ ✭ Z ✮ where Z ✘ R Given critic features h ✥ with parameters ✥ to be trained. f ✥ a linear function of h ✥ . WGAN-GP gradient penalty: ✏✌ ✌ ✑ 2 ✌ ✌ X f ✒ ✭ ❢ max E X ✘ P f ✥ ✭ X ✮ � E Z ✘ R f ✥ ✭ G ✒ ✭ Z ✮✮ ✰ ✕ E ❡ ✌ r ❡ X ✮ ✌ � 1 X ✥ where ❢ X ❂ ✌ x i ✰ ✭ 1 � ✌ ✮ G ✥ ✭ z j ✮ x i ✷ ❢ x ❵ ❣ m z j ✷ ❢ z ❵ ❣ n 55/73 ✌ ✘ ❯ ✭❬ 0 ❀ 1 ❪✮ ❵ ❂ 1 ❵ ❂ 1
The (W)MMD Train MMD critic features with the witness function gradient penalty Binkowski, Sutherland, Arbel, G. [ICLR 2018], Bellemare et al. [2017] for energy distance : ✏✌ ✌ ✑ 2 ✌ ✌ MMD 2 ✭ h ✥ ✭ X ✮ ❀ h ✥ ✭ G ✒ ✭ Z ✮✮✮ ✰ ✕ E ❡ X f ✥ ✭ ❢ ✌ r ❡ ✌ � 1 max X ✮ X ✥ where ❢ X ❂ ✌ x i ✰ ✭ 1 � ✌ ✮ G ✥ ✭ z j ✮ x i ✷ ❢ x ❵ ❣ m z j ✷ ❢ z ❵ ❣ n ✌ ✘ ❯ ✭❬ 0 ❀ 1 ❪✮ ❵ ❂ 1 ❵ ❂ 1 Remark by Bottou et al. (2017): gradient penalty modifies the function class. So critic is 56/73 not an MMD in RKHS ❋ .
MMD for GAN critic: revisited From ICLR 2018: 57/73
MMD for GAN critic: revisited Samples are better! 57/73
MMD for GAN critic: revisited Samples are better! Can we do better still? 57/73
Recommend
More recommend