Références Comparing distributions: ℓ 1 geometry improves kernel two-sample testing M. Scetbon 1,2 G. Varoquaux 1 1 Inria, Université Paris-Saclay 2 CREST, ENSAE 12 décembre 2019 1 / 11
Références Two collections of samples X , Y from unknown distributions P and Q . McDonald's KFC Problem : Are the two set of observations X and Y drawn from the same distribution ? 2 / 11
Références Two collections of samples X , Y from unknown distributions P and Q . McDonald's KFC Problem : Are the two set of observations X and Y drawn from the same distribution ? 2 / 11
Références Two-Sample Test Test the null hypothesis H 0 : P = Q against H 1 : P � = Q Samples : X = { x i } n i =1 ∼ P and Y = { y i } n i =1 ∼ Q 3 / 11
Références Two-Sample Test Test the null hypothesis H 0 : P = Q against H 1 : P � = Q Samples : X = { x i } n i =1 ∼ P and Y = { y i } n i =1 ∼ Q 3 / 11
Références � � − � x − y � 2 Gaussian Kernel : k σ ( x, y ) = exp 2 2 σ 2 Empirical Mean Embeddings of P and Q : n n � � µ P ( T ) = � k ( x i , T ) µ Q ( T ) = � k ( y j , T ) i =1 j =1 ' ( ' + 4 / 11
Références Aboslute difference of the Mean Embeddings : � S ( T ) = | � µ P ( T ) − � µ Q ( T ) | | ' ( − ' + | ' ( ' + 5 / 11
Références Aboslute difference of the Mean Embeddings : � S ( T ) = | � µ P ( T ) − � µ Q ( T ) | Test locations : ( T j ) J j =1 ∼ Γ ' ( | ' ( − ' + | ' + ! & ! $ ! # ! " ! % 6 / 11
Références Test Statistic 1 with p ≥ 1 : � � p J � p � µ Q ( T j ) | p d ℓ p ,µ,J ( X , Y ) := n | � µ P ( T j ) − � 2 j =1 These Statistics are derived from metrics which metrize the weak convergence : �� � � � p � � � � 1 /p d L p ,µ ( P , Q ) := � µ P ( t ) − µ Q ( t ) d Γ( t ) � t ∈ R d Theorem : Weak Convergence D α n − → α ⇐ ⇒ d L p ,µ ( α n , α ) → 0 1. The case when p = 2 has been studied by [1, 2] 7 / 11
Références Test Statistic 1 with p ≥ 1 : � � p J � p � µ Q ( T j ) | p d ℓ p ,µ,J ( X , Y ) := n | � µ P ( T j ) − � 2 j =1 These Statistics are derived from metrics which metrize the weak convergence : �� � � � p � � � � 1 /p d L p ,µ ( P , Q ) := � µ P ( t ) − µ Q ( t ) d Γ( t ) � t ∈ R d Theorem : Weak Convergence D α n − → α ⇐ ⇒ d L p ,µ ( α n , α ) → 0 1. The case when p = 2 has been studied by [1, 2] 7 / 11
Références Test Statistic 1 with p ≥ 1 : � � p J � p � µ Q ( T j ) | p d ℓ p ,µ,J ( X , Y ) := n | � µ P ( T j ) − � 2 j =1 These Statistics are derived from metrics which metrize the weak convergence : �� � � � p � � � � 1 /p d L p ,µ ( P , Q ) := � µ P ( t ) − µ Q ( t ) d Γ( t ) � t ∈ R d Theorem : Weak Convergence D α n − → α ⇐ ⇒ d L p ,µ ( α n , α ) → 0 1. The case when p = 2 has been studied by [1, 2] 7 / 11
Références ! < =# | ! < =# − ! < =# | ! < =# : ; : ; # ) * + ,- (: ; # , : ; ) → 0 67(: ; # , : ; ) = 2 8 / 11
Références � � p � Test of level α : Compute d ℓ p ,µ,J ( X , Y ) and reject H 0 if � � p � d ℓ p ,µ,J ( X , Y ) > T α, p = 1 − α quantile of the asymptotic null distribution. Proposition : ℓ 1 geometry improves power Let δ > 0 . Under the alternative hypothesis H 1 , almost surely there exist N ≥ 1 such that for all n ≥ N with a probability 1 − δ : � � 2 � > T α, 2 ⇒ � d ℓ 2 ,µ,J ( X , Y ) d ℓ 1 ,µ,J ( X , Y ) > T α, 1 9 / 11
Références � � p � Test of level α : Compute d ℓ p ,µ,J ( X , Y ) and reject H 0 if � � p � d ℓ p ,µ,J ( X , Y ) > T α, p = 1 − α quantile of the asymptotic null distribution. Proposition : ℓ 1 geometry improves power Let δ > 0 . Under the alternative hypothesis H 1 , almost surely there exist N ≥ 1 such that for all n ≥ N with a probability 1 − δ : � � 2 � > T α, 2 ⇒ � d ℓ 2 ,µ,J ( X , Y ) d ℓ 1 ,µ,J ( X , Y ) > T α, 1 9 / 11
Références Conclusion Under the alternative hypothesis, Analytic Kernel (e.g Gaussian Kernel) guarantees dense differences between � µ P and � µ Q We have also considered statistics based on Smooth Characteristic Functions and obtained similar results. Finally we have normalized the tests to obtain a simple null distribution and learn the locations where the distributions differ the most. @ East Exhibition Hall B + C #6 10 / 11
Références Conclusion Under the alternative hypothesis, Analytic Kernel (e.g Gaussian Kernel) guarantees dense differences between � µ P and � µ Q ℓ 1 geometry captures better these dense differences. We have also considered statistics based on Smooth Characteristic Functions and obtained similar results. Finally we have normalized the tests to obtain a simple null distribution and learn the locations where the distributions differ the most. @ East Exhibition Hall B + C #6 10 / 11
Références Conclusion Under the alternative hypothesis, Analytic Kernel (e.g Gaussian Kernel) guarantees dense differences between � µ P and � µ Q ℓ 1 geometry captures better these dense differences. We have also considered statistics based on Smooth Characteristic Functions and obtained similar results. Finally we have normalized the tests to obtain a simple null distribution and learn the locations where the distributions differ the most. @ East Exhibition Hall B + C #6 10 / 11
Références Conclusion Under the alternative hypothesis, Analytic Kernel (e.g Gaussian Kernel) guarantees dense differences between � µ P and � µ Q ℓ 1 geometry captures better these dense differences. We have also considered statistics based on Smooth Characteristic Functions and obtained similar results. Finally we have normalized the tests to obtain a simple null distribution and learn the locations where the distributions differ the most. @ East Exhibition Hall B + C #6 10 / 11
Références Conclusion Under the alternative hypothesis, Analytic Kernel (e.g Gaussian Kernel) guarantees dense differences between � µ P and � µ Q ℓ 1 geometry captures better these dense differences. We have also considered statistics based on Smooth Characteristic Functions and obtained similar results. Finally we have normalized the tests to obtain a simple null distribution and learn the locations where the distributions differ the most. @ East Exhibition Hall B + C #6 10 / 11
Références References I [1] K. P. Chwialkowski, A. Ramdas, D. Sejdinovic, and A. Gretton. Fast two-sample testing with analytic representations of probability measures. In Advances in Neural Information Processing Systems , pages 1981–1989, 2015. [2] W. Jitkrittum, Z. Szabó, K. P. Chwialkowski, and A. Gretton. Interpretable distribution features with maximum testing power. In Advances in Neural Information Processing Systems , pages 181–189, 2016. 11 / 11
Recommend
More recommend