Relative Goodness-of-Fit Tests for Models with Latent Variables Arthur Gretton Gatsby Computational Neuroscience Unit, University College London June 15, 2019 1/37
Model Criticism 2/37
Model Criticism 2/37
Model Criticism Data = robbery events in Chicago in 2016. 2/37
Model Criticism Is this a good model? 2/37
Model Criticism Goals : Test if a (complicated) model fits the data. 2/37
Model Criticism "All models are wrong." G. Box (1976) 3/37
Relative model comparison Have: two candidate models P and Q , and samples ❢ x i ❣ n i ❂ 1 from reference distribution R Goal: which of P and Q is better? Samples from LSGAN, Samples from GAN, Goodfellow et al. (2014) Mao et al. (2017) Which model is better? 4/37
Most interesting models have latent structure Graphical model representation of hierarchical LDA with a nested CRP prior, Blei et al. (2003) & # c 1 % c 2 $ c 3 z " w ! c L N M 8 5/37 (b)
Outline Relative goodness-of-fit tests for Models with Latent Variables The kernel Stein discrepancy ✎ Comparing two models via samples: MMD and the witness function. ✎ Comparing a sample and a model: Stein modification of the witness class Constructing a relative hypothesis test using the KSD Relative hypothesis tests with latent variables (new, unpublished) 6/37
Kernel Stein Discrepancy Model P , data ❢ x i ❣ n i ❂ 1 ✘ Q . “All models are wrong” ( P ✻ ❂ Q ). 7/37
Integral probability metrics Integral probability metric: Find a "well behaved function" f ✭ x ✮ to maximize E Q f ✭ Y ✮ � E P f ✭ X ✮ Smooth function 1 0.5 f(x) 0 -0.5 -1 0 0.2 0.4 0.6 0.8 1 x 8/37
Integral probability metrics Integral probability metric: Find a "well behaved function" f ✭ x ✮ to maximize E Q f ✭ Y ✮ � E P f ✭ X ✮ Smooth function 1 0.5 f(x) 0 -0.5 -1 0 0.2 0.4 0.6 0.8 1 x 9/37
All of kernel methods Functions are linear combinations of features: ❦ f ❦ 2 ❋ ✿❂ P ✶ i ❂ 1 f i 2 10/37
All of kernel methods “The kernel trick” 0.8 0.6 ✶ f(x) ❳ 0.4 f ✭ x ✮ ❂ f ❵ ✬ ❵ ✭ x ✮ ❵ ❂ 1 0.2 m ❳ ❂ ☛ i k ✭ x i ❀ x ✮ 0 -6 -4 -2 0 2 4 6 8 i ❂ 1 x 11/37
All of kernel methods “The kernel trick” 0.8 0.6 0.4 f(x) 0.2 ✶ ❳ f ✭ x ✮ ❂ f ❵ ✬ ❵ ✭ x ✮ 0 ❵ ❂ 1 -0.2 m ❳ -0.4 ❂ ☛ i k ✭ x i ❀ x ✮ -6 -4 -2 0 2 4 6 8 x i ❂ 1 f ❵ ✿❂ P m i ❂ 1 ☛ i ✬ ❵ ✭ x i ✮ Function of infinitely many features expressed using m coefficients. 11/37
MMD as an integral probability metric Maximum mean discrepancy : smooth function for P vs Q MMD ✭ P ❀ Q ❀ ❋ ✮ ✿❂ sup ❬ E P f ✭ X ✮ � E Q f ✭ Y ✮❪ ❦ f ❦ ❋ ✔ 1 12/37
MMD as an integral probability metric Maximum mean discrepancy : smooth function for P vs Q MMD ✭ P ❀ Q ❀ ❋ ✮ ✿❂ sup ❬ E P f ✭ X ✮ � E Q f ✭ Y ✮❪ ❦ f ❦ ❋ ✔ 1 0.4 0.3 0.2 p ( x ) 0.1 q ( x ) f * ( x ) - 4 - 2 2 4 - 0.1 - 0.2 - 0.3 12/37
MMD as an integral probability metric Maximum mean discrepancy : smooth function for P vs Q MMD ✭ P ❀ Q ❀ ❋ ✮ ✿❂ sup ❬ E P f ✭ X ✮ � E Q f ✭ Y ✮❪ ❦ f ❦ ❋ ✔ 1 For characteristic RKHS ❋ , MMD ✭ P ❀ Q ❀ ❋ ✮ ❂ 0 iff P ❂ Q Other choices for witness function class: Bounded continuous [Dudley, 2002] Bounded variation 1 (Kolmogorov metric) [Müller, 1997] 1-Lipschitz (Wasserstein distances) [Dudley, 2002] 12/37
Statistical model criticism: toy example MMD ✭ P ❀ Q ✮ ❂ sup ❦ f ❦ ❋ ✔ 1 ❬ E q f � E p f ❪ 0.4 0.3 0.2 p ( x ) 0.1 q ( x ) f * ( x ) - 4 - 2 2 4 - 0.1 - 0.2 - 0.3 Can we compute MMD with samples from Q and a model P ? Problem: usualy can’t compute E p f in closed form. 13/37
Stein idea To get rid of E p f in sup ❬ E q f � E p f ❪ ❦ f ❦ ❋ ✔ 1 we define the (1-D) Stein operator 1 d ❬ ❆ p f ❪ ✭ x ✮ ❂ dx ✭ f ✭ x ✮ p ✭ x ✮✮ p ✭ x ✮ Then E p ❆ p f ❂ 0 subject to appropriate boundary conditions. Gorham and Mackey (NeurIPS 15), Oates, Girolami, Chopin (JRSS B 2016) 14/37
Kernel Stein Discrepancy Stein operator 1 d ❆ p f ❂ dx ✭ f ✭ x ✮ p ✭ x ✮✮ p ✭ x ✮ Kernel Stein Discrepancy (KSD) KSD p ✭ Q ✮ ❂ sup E q ❆ p g � E p ❆ p g ❦ g ❦ ❋ ✔ 1 15/37
Kernel Stein Discrepancy Stein operator 1 d ❆ p f ❂ dx ✭ f ✭ x ✮ p ✭ x ✮✮ p ✭ x ✮ Kernel Stein Discrepancy (KSD) KSD p ✭ Q ✮ ❂ sup E q ❆ p g � ✘✘✘ E p ❆ p g ❂ sup E q ❆ p g ✘ ❦ g ❦ ❋ ✔ 1 ❦ g ❦ ❋ ✔ 1 15/37
Kernel Stein Discrepancy Stein operator 1 d ❆ p f ❂ dx ✭ f ✭ x ✮ p ✭ x ✮✮ p ✭ x ✮ Kernel Stein Discrepancy (KSD) KSD p ✭ Q ✮ ❂ sup E q ❆ p g � ✘✘✘ E p ❆ p g ❂ sup E q ❆ p g ✘ ❦ g ❦ ❋ ✔ 1 ❦ g ❦ ❋ ✔ 1 0.4 0.2 p ( x ) - 4 - 2 2 4 q ( x ) - 0.2 g * ( x ) - 0.4 - 0.6 15/37
Kernel Stein Discrepancy Stein operator 1 d ❆ p f ❂ dx ✭ f ✭ x ✮ p ✭ x ✮✮ p ✭ x ✮ Kernel Stein Discrepancy (KSD) KSD p ✭ Q ✮ ❂ sup E q ❆ p g � ✘✘✘ E p ❆ p g ❂ sup E q ❆ p g ✘ ❦ g ❦ ❋ ✔ 1 ❦ g ❦ ❋ ✔ 1 0.4 0.3 p ( x ) 0.2 q ( x ) g * ( x ) 0.1 - 4 - 2 2 4 15/37
Simple expression using kernels Re-write stein operator as: 1 d ❬ ❆ p f ❪ ✭ x ✮ ❂ dx ✭ f ✭ x ✮ p ✭ x ✮✮ p ✭ x ✮ ❂ f ✭ x ✮ d dx log p ✭ x ✮ ✰ d dx f ✭ x ✮ Can we define “Stein features”? ✒ d ✓ f ✭ x ✮ ✰ d ❬ ❆ p f ❪ ✭ x ✮ ❂ dx log p ✭ x ✮ dx f ✭ x ✮ ✡ f ❀ ☛ ❂✿ ✘ ✭ x ✮ ❋ ⑤④③⑥ stein features where E x ✘ p ✘ ✭ x ✮ ❂ 0. 16/37
Simple expression using kernels Re-write stein operator as: 1 d ❬ ❆ p f ❪ ✭ x ✮ ❂ dx ✭ f ✭ x ✮ p ✭ x ✮✮ p ✭ x ✮ ❂ f ✭ x ✮ d dx log p ✭ x ✮ ✰ d dx f ✭ x ✮ Can we define “Stein features”? ✒ d ✓ f ✭ x ✮ ✰ d ❬ ❆ p f ❪ ✭ x ✮ ❂ dx log p ✭ x ✮ dx f ✭ x ✮ ✡ f ❀ ☛ ❂✿ ✘ ✭ x ✮ ❋ ⑤④③⑥ stein features where E x ✘ p ✘ ✭ x ✮ ❂ 0. 16/37
✭ ✮ ✒ ✓ ❬ ❆ ❪ ✭ ✮ ❂ ✭ ✮ ✭ ✮ ✰ ✭ ✮ ✯ ✰ ✒ ✓ ❂ ❀ ✭ ✮ ✬ ✭ ✮ ✰ ✬ ✭ ✮ ❋ ⑤ ④③ ⑥ ✭ ✮ ❂✿ ❤ ❀ ✘ ✭ ✮ ✐ ❋ ✿ The kernel trick for derivatives Reproducing property for the derivative: for differentiable k ✭ x ❀ x ✵ ✮ , ✜ ✢ d f ❀ d dx f ✭ x ✮ ❂ dx ✬ ✭ x ✮ ❋ 17/37
The kernel trick for derivatives Reproducing property for the derivative: for differentiable k ✭ x ❀ x ✵ ✮ , ✜ ✢ d f ❀ d dx f ✭ x ✮ ❂ dx ✬ ✭ x ✮ ❋ Using kernel derivative trick in ✭ a ✮ , ✒ d ✓ f ✭ x ✮ ✰ d ❬ ❆ p f ❪ ✭ x ✮ ❂ dx log p ✭ x ✮ dx f ✭ x ✮ ✒ d ✯ ✰ ✓ ✬ ✭ x ✮ ✰ d ❂ f ❀ dx log p ✭ x ✮ dx ✬ ✭ x ✮ ❋ ⑤ ④③ ⑥ ✭ a ✮ ❂✿ ❤ f ❀ ✘ ✭ x ✮ ✐ ❋ ✿ 17/37
✭ ✮ ✒ ✓ ✭ ✮ ❁ ✶ ✿ ✘ Kernel stein discrepancy: derivation Closed-form expression for KSD: given independent x ❀ x ✵ ✘ Q , then KSD p ✭ Q ✮ ❂ sup E x ✘ q ✭❬ ❆ p g ❪ ✭ x ✮✮ ❦ g ❦ ❋ ✔ 1 E x ✘ q ❤ g ❀ ✘ x ✐ ❋ ❂ sup ❦ g ❦ ❋ ✔ 1 ❂ sup ❤ g ❀ E x ✘ q ✘ x ✐ ❋ ❂ ❦ E x ✘ q ✘ x ❦ ❋ ✭ a ✮ ❦ g ❦ ❋ ✔ 1 Chwialkowski, Strathmann, G., (ICML 2016) Liu, Lee, Jordan (ICML 2016) 18/37
✭ ✮ ✒ ✓ ✭ ✮ ❁ ✶ ✿ ✘ Kernel stein discrepancy: derivation Closed-form expression for KSD: given independent x ❀ x ✵ ✘ Q , then KSD p ✭ Q ✮ ❂ sup E x ✘ q ✭❬ ❆ p g ❪ ✭ x ✮✮ ❦ g ❦ ❋ ✔ 1 E x ✘ q ❤ g ❀ ✘ x ✐ ❋ ❂ sup ❦ g ❦ ❋ ✔ 1 ❂ sup ❤ g ❀ E x ✘ q ✘ x ✐ ❋ ❂ ❦ E x ✘ q ✘ x ❦ ❋ ✭ a ✮ ❦ g ❦ ❋ ✔ 1 Chwialkowski, Strathmann, G., (ICML 2016) Liu, Lee, Jordan (ICML 2016) 18/37
Kernel stein discrepancy: derivation Closed-form expression for KSD: given independent x ❀ x ✵ ✘ Q , then KSD p ✭ Q ✮ ❂ sup E x ✘ q ✭❬ ❆ p g ❪ ✭ x ✮✮ ❦ g ❦ ❋ ✔ 1 E x ✘ q ❤ g ❀ ✘ x ✐ ❋ ❂ sup ❦ g ❦ ❋ ✔ 1 ❂ sup ❤ g ❀ E x ✘ q ✘ x ✐ ❋ ❂ ❦ E x ✘ q ✘ x ❦ ❋ ✭ a ✮ ❦ g ❦ ❋ ✔ 1 Caution: ✭ a ✮ requires a condition for the Riesz theorem to hold, ✒ d ✓ 2 E x ✘ q dx log p ✭ x ✮ ❁ ✶ ✿ Chwialkowski, Strathmann, G., (ICML 2016) Liu, Lee, Jordan (ICML 2016) 18/37
The witness function: Chicago Crime Model ❂ 10-component p Gaussian mixture. 19/37
The witness function: Chicago Crime Witness function shows g mismatch 19/37
Does the Riesz condition matter? Consider the standard normal, 1 ✏ ✑ � x 2 ❂ 2 ♣ p ✭ x ✮ ❂ 2 ✙ exp ✿ Then d dx log p ✭ x ✮ ❂ � x ✿ If q is a Cauchy distribution, then the integral ✒ d ❩ ✶ ✓ 2 x 2 q ✭ x ✮ dx E x ✘ q dx log p ✭ x ✮ ❂ �✶ is undefined. 20/37
Recommend
More recommend