 
              Measuring Dependence and Conditional Dependence with Kernels Kenji Fukumizu The Institute of Statistical Mathematics, Japan June 25, 2014. ICML 2014, Causality Workshop 1
2 Introduction
Dependence Measures  Dependence measures and causality – Constraint methods for causal structure learning are based on measuring or testing (conditional) dependence. e.g. PC Algorithm (Spirtes et al. 1991, 2001) (Conditional) independence tests with � � -tests. 1 � � � � � � � � � � ∣ � � 2 3 � � � � � ∣ � � 4 etc. 3
 Problems – Tests for structure learning may involve many variables. – (Conditional) independence test for continuous, high-dimensional domains are not easy. • Discretization causes many bins, requiring a large data size. • Nonparametric methods are often weak for high-dimensionality. KDE, smoothing kernel, ... 9 – Linear correlations may not be sufficient for 8 7 complex relations. 6 5 4 3 2 1 0 -1 -3 -2 -1 0 1 2 3 4
 This talk – As building blocks of causal learning, kernel methods for measuring (in)dependence and conditional (in)dependence are discussed. 5
Outline 1. Introduction 2. Kernel measures for independence 3. Relations with distance covariance 4. How to choose a kernel 5. Conditional independence 6. Conclusions 6
7 Kernel measures for independence
Kernel methods  Feature map and kernel methods H Φ � �  Φ � � x i feature map  x j Space of original data Feature space (RKHS) Do linear analysis in the feature space. – Feature map Φ: Ω → �, � ↦ Φ��� Feature vectors � � … , � � ↦ Φ � � , … , Φ�� � � 8
 Do kernel methods work well for high dimensional data? – Empirical comparison: pos. def. kernel and smoothing kernel Nonparametric regression � � 1/ 1.5 � | �| � � �, � ~ � 0, � � , �~� 0, 0.1 � 0.015 • Kernel ridge regression Kernel method Local linear (Gaussian kernel) • Local linear regression Mean square errors 0.01 (Epanechnikov kernel, ‘locfit’ in R is used) � = 100, 500 runs 0.005 Bandwidth parameters are chosen by CV. 0 0 2 4 6 8 10 Dimension of X – Theory? 9
Representing probabilities � : random variable taking values on Ω . � : pos. def. kernel on Ω . Feature map defines a RKHS-valued random variable Φ��� . The kernel mean ��Φ � � represents the probability distribution of � . � � ≔ � Φ � � � � ⋅, � ����� – Kernel mean can express higher-order moments of � . Suppose � �, � � � � � � � �� � � � �� � � ⋯ e.g., � �� � � � 0 , � � � � � � � � � � � � � � � � � � � � � ⋯ c.f . moment generating function 10
Comparing two probabilities  MMD (Maximum Mean Discrepancy, Gretton et al 2005 ) � ∼ � , � ∼ � (two probabilities on Ω ). � : pos. def. kernel on Ω . MMD � �, � ≔ � � � � � � � � � � � � , � � � � sup � ��,�∈� � � sup � � � � � � � � ��,�∈� Comparing the moments through various functions – Characteristic kernels are defined so that MMD �, � � 0 if and only if � � � . e.g. Gaussian and Laplace kernels Kernel mean � � determines the distribution of � uniquely. MMD is a metric on the probabilities. 11
HSIC: Independence measure  Hilbert-Schmidt Independence Criterion (HSIC) ( X , Y ) : random vector taking values on  X �  Y . ( H X , k X ), ( H Y , k Y ): RKHS on  X and  Y , resp. Compare the joint probability � �� and the product of the marginal � � � � HSIC( � , � ) ≔ MMD � � �� , � � � Def. � � � � �� � � � ⊗ � � � � ⊗� � Theorem Assume: product kernel � � � � is characteristic on Ω � � Ω � . HSIC( � , � ) = 0 if and only if � � � 12
Covariance operator Operator expression: � �� � � � ⊗ � � , � ⊗ � � � ⊗� � � � � � � � � � � � ��� � � Def. covariance operators Σ �� : � � → � � , Σ �� : � � → � � �, Σ �� � � � � � � � � � � � � � ��� � � �∀� ∈ � � , � ∈ � � � �∀�, � ∈ � � � �, Σ �� � � � � � � � � � � � � � ��� � � Simply, extension of covariance matrix (linear map) �� � � �� � � � � ��� � � , � � � �� � � � � � � ⋅ � � � � � � � ����� � � �  X  X ( X )  Y ( Y )  Y   X  Y YX X Y 13 H X H Y
Expressions of HSIC � – HSIC �, � � Σ �� Hilbert-Schmidt norm �� (same as Frobenius norm) � ≔ ∑ ∑ � � , �� � � � �� � � �: � → �. � � � , � � � : ONB of � and � , (resp). HSIC��, �� � � � � �, � � � � �, � � � 2� � � �, � � � � �, � �� – ��� � �, � � � �� � � �, � � � � , � � , � �� , � �� : independent copies of �, � . – Empirical estimator (Gram matrix expression) HSIC ��� �, � � 1 � � ���� � � � � � � � �  Test statistic � � (centering) � �,�� � � � � � , � � , � �,�� � � � � � , � � , � � ≔ � � � � � � � � Given � � , � � , , … , � � , � � ~ � �� , i.i.d., 14
Independence test with HSIC Theorem: null distribution (Gretton, Fukumizu, etc. NIPS2007) If X and Y are independent, then law � � � HSIC ��� �, � ⟹ ∑ � � � � � → ∞ . ��� where Z i : i.i.d. ~ N (0,1), � � � ��� is the eigenvalues of an integral operator. Theorem: consistency of test (Gretton, Fukumizu, etc. NIPS2007) If ���� �, � � 0 , then law � HSIC ��� ��, �� � HSIC��, �� ⇒ ��0, � � � � → ∞ . where        2 2 16 E E [ h ( U , U , U , U ] M a b , c , d a b c d YX 15
 Independent test with HSIC: – How to compute the critical region given significance level. • Simulation of the null distribution ( Gretton, Fukumizu et al NIPS2009 ). The eigenvalues can be estimated with the Gram matrices. • Approximation with two-parameter Gamma by moment matching ( Gretton, Fukumizu et al NIPS2007 ). • Permutation test / Bootstrap Always possible, but time consuming. 16
Experiments: independence test X, Y: 1 dim + noise components HSIC (Gamma approx.) Power divergence ( � � 3/2 ) with discretization (equi-probable) 17 Type II errors
 Power divergence Each dimension is partitioned into � parts. Partitions � � �∈� . ( � � � � ) � � �̂ � 2� ��� ≔ � � � � 2 � �̂ � � � 1 � ��� �̂ � � �∈� ��� �̂ � : frequency in � � ��� : marginal frequency in k-th dimension �̂ � � ��� ⇒ � � � ������� � � � � � 0 : Mutual information � � 2 : � � -divergence (mean square contingency) 18
19 Relation to distance covariance
Distance covariance – Distance covariance (distance correlation) is a recent measure of independence for continuous variables (Székely, Rizzo, Bakirov, AoS 2007) . It is very popular among statistical community. – HSIC is closely related to (more general than, in fact) dCov.  Distance covariance Def. �, � : random vectors (on Euclidean spaces) dCov � �, � ≔ � � � � � � � � � � � � � � � � �� � 2� � � � � � � � � �� � . � � , � � , � �� , � �� : independent copies of �, � . Note: � � � � is NOT positive definite. 20
For be a semi-metric � on Ω , ( � �, � � � � � � , � , and � �, � � � 0 with equality � � � � ), define generalized distance covariance by �, � ≔ � � � �, � � � � ��, � � � � 2� � � �, � � � � ��, � �� � � dCov � � ,� � �� � � ��, � � � � � � ��, � � � . Theorem (Sejdinovic et al. AoS 2013). Assume � is of negative type, i.e., � � ∑ for any �� � � with ∑ � � � � � , � � � 0 � � � 0 . ��� ��� Then, � �, � � ≔ � �� �, � � � � � � , � � � � �, � � � is positive � definite, and with � � and � � induced by � � and � � , resp., HSIC �, � � dCov � � ,� � ��, �� Example: � �, � � � � � � � � ( 0 � � � 2 ), � � �, � � � � � � � � � � � � � � � � � � � � �, � � � � � � � � � � � � � HSIC �, � � dCov � � � � � � � � � �� � � � � � � � � � � � � � � . �2� 21
Experiments � �, � � � � � � � � (B) (A) � �, � ∝ 1 � sin ℓ� sin �ℓ�� independent harder dependent easier 22
23 How to choose a kernel
Recommend
More recommend