Measuring Dependence and Conditional Dependence with Kernels Kenji - PowerPoint PPT Presentation

Measuring Dependence and Conditional Dependence with Kernels Kenji Fukumizu The Institute of Statistical Mathematics, Japan June 25, 2014. ICML 2014, Causality Workshop 1

2 Introduction

Dependence Measures  Dependence measures and causality – Constraint methods for causal structure learning are based on measuring or testing (conditional) dependence. e.g. PC Algorithm (Spirtes et al. 1991, 2001) (Conditional) independence tests with � � -tests. 1 � � � � � � � � � � ∣ � � 2 3 � � � � � ∣ � � 4 etc. 3

 Problems – Tests for structure learning may involve many variables. – (Conditional) independence test for continuous, high-dimensional domains are not easy. • Discretization causes many bins, requiring a large data size. • Nonparametric methods are often weak for high-dimensionality. KDE, smoothing kernel, ... 9 – Linear correlations may not be sufficient for 8 7 complex relations. 6 5 4 3 2 1 0 -1 -3 -2 -1 0 1 2 3 4

 This talk – As building blocks of causal learning, kernel methods for measuring (in)dependence and conditional (in)dependence are discussed. 5

Outline 1. Introduction 2. Kernel measures for independence 3. Relations with distance covariance 4. How to choose a kernel 5. Conditional independence 6. Conclusions 6

7 Kernel measures for independence

Kernel methods  Feature map and kernel methods H Φ � �  Φ � � x i feature map  x ｊ Space of original data Feature space (RKHS) Do linear analysis in the feature space. – Feature map Φ: Ω → �, � ↦ Φ�� Feature vectors � � … , � � ↦ Φ � � , … , Φ�� 8

 Do kernel methods work well for high dimensional data? – Empirical comparison: pos. def. kernel and smoothing kernel Nonparametric regression � � 1/ 1.5 � | �| � � �, � ~ � 0, � � , �~� 0, 0.1 � 0.015 • Kernel ridge regression Kernel method Local linear (Gaussian kernel) • Local linear regression Mean square errors 0.01 (Epanechnikov kernel, ‘locfit’ in R is used) � = 100, 500 runs 0.005 Bandwidth parameters are chosen by CV. 0 0 2 4 6 8 10 Dimension of X – Theory? 9

Representing probabilities � : random variable taking values on Ω . � : pos. def. kernel on Ω . Feature map defines a RKHS-valued random variable Φ�� . The kernel mean ��Φ � � represents the probability distribution of � . � � ≔ � Φ � � � � ⋅, � �� – Kernel mean can express higher-order moments of � . Suppose � �, � � � � � � � �� ⋯ e.g., � �� 0 , � � � � � � � � � � � � � � � � � � � � � ⋯ c.f . moment generating function 10

Comparing two probabilities  MMD (Maximum Mean Discrepancy, Gretton et al 2005 ) � ∼ � , � ∼ � (two probabilities on Ω ). � : pos. def. kernel on Ω . MMD � �, � ≔ � � � � � � � � � � � � , � � � � sup � ��,�∈� � � sup � � � � � � � � ��,�∈� Comparing the moments through various functions – Characteristic kernels are defined so that MMD �, � � 0 if and only if � � � . e.g. Gaussian and Laplace kernels Kernel mean � � determines the distribution of � uniquely. MMD is a metric on the probabilities. 11

HSIC: Independence measure  Hilbert-Schmidt Independence Criterion (HSIC) ( X , Y ) : random vector taking values on  X �  Y . ( H X , k X ), ( H Y , k Y ): RKHS on  X and  Y , resp. Compare the joint probability � �� and the product of the marginal � � � � HSIC( � , � ) ≔ MMD � � �� , � � � Def. � � � � �� ⊗ � � � � ⊗� � Theorem Assume: product kernel � � � � is characteristic on Ω � � Ω � . HSIC( � , � ) = 0 if and only if � � � 12

Covariance operator Operator expression: � �� ⊗ � � , � ⊗ � � � ⊗� � � � � � � � � � � � �� Def. covariance operators Σ �� : � � → � � , Σ �� : � � → � � �, Σ �� ∀� ∈ � � , � ∈ � � � �∀�, � ∈ � � � �, Σ �� Simply, extension of covariance matrix (linear map) �� , � � � �� ⋅ � � � � � � � ��  X  X ( X )  Y ( Y )  Y   X  Y YX X Y 13 H X H Y

Expressions of HSIC � – HSIC �, � � Σ �� Hilbert-Schmidt norm �� (same as Frobenius norm) � ≔ ∑ ∑ � � , �� : � → �. � � � , � � � : ONB of � and � , (resp). HSIC��, �� , � � � � �, � � � 2� � � �, � � � � �, � �� – �� , � � � �� , � � � � , � � , � �� , � �� : independent copies of �, � . – Empirical estimator (Gram matrix expression) HSIC �� , � � 1 � � ��  Test statistic � � (centering) � �,�� , � � , � �,�� , � � , � � ≔ � � � � � � � � Given � � , � � , , … , � � , � � ~ � �� , i.i.d., 14

Independence test with HSIC Theorem: null distribution (Gretton, Fukumizu, etc. NIPS2007) If X and Y are independent, then law � � � HSIC �� , � ⟹ ∑ � � � � � → ∞ . �� where Z i : i.i.d. ~ N (0,1), � � � �� is the eigenvalues of an integral operator. Theorem: consistency of test (Gretton, Fukumizu, etc. NIPS2007) If �� , � � 0 , then law � HSIC �� , �� HSIC��, �� ⇒ ��0, � � � � → ∞ . where        2 2 16 E E [ h ( U , U , U , U ] M a b , c , d a b c d YX 15

 Independent test with HSIC: – How to compute the critical region given significance level. • Simulation of the null distribution ( Gretton, Fukumizu et al NIPS2009 ). The eigenvalues can be estimated with the Gram matrices. • Approximation with two-parameter Gamma by moment matching ( Gretton, Fukumizu et al NIPS2007 ). • Permutation test / Bootstrap Always possible, but time consuming. 16

Experiments: independence test X, Y: 1 dim + noise components HSIC (Gamma approx.) Power divergence ( � � 3/2 ) with discretization (equi-probable) 17 Type II errors

 Power divergence Each dimension is partitioned into � parts. Partitions � � �∈� . ( � � � � ) � � �̂ � 2� �� ≔ � � � � 2 � �̂ � � � 1 � �� ̂ � � �∈� �� ̂ � : frequency in � � �� : marginal frequency in k-th dimension �̂ � � �� ⇒ � � � �� 0 : Mutual information � � 2 : � � -divergence (mean square contingency) 18

19 Relation to distance covariance

Distance covariance – Distance covariance (distance correlation) is a recent measure of independence for continuous variables (Székely, Rizzo, Bakirov, AoS 2007) . It is very popular among statistical community. – HSIC is closely related to (more general than, in fact) dCov.  Distance covariance Def. �, � : random vectors (on Euclidean spaces) dCov � �, � ≔ � � � � � � � � � � � � � � � � �� 2� � � � � � � � � �� . � � , � � , � �� , � �� : independent copies of �, � . Note: � � � � is NOT positive definite. 20

For be a semi-metric � on Ω , ( � �, � � � � � � , � , and � �, � � � 0 with equality � � � � ), define generalized distance covariance by �, � ≔ � � � �, � � � � ��, � � � � 2� � � �, � � � � ��, � �� dCov � � ,� � �� , � � � � � � ��, � � � . Theorem (Sejdinovic et al. AoS 2013). Assume � is of negative type, i.e., � � ∑ for any �� with ∑ � � � � � , � � � 0 � � � 0 . �� Then, � �, � � ≔ � �� , � � � � � � , � � � � �, � � � is positive � definite, and with � � and � � induced by � � and � � , resp., HSIC �, � � dCov � � ,� � ��, �� Example: � �, � � � � � � � � ( 0 � � � 2 ), � � �, � � � � � � � � � � � � � � � � � � � � �, � � � � � � � � � � � � � HSIC �, � � dCov � � � � � � � � � �� . �2� 21

Experiments � �, � � � � � � � � (B) (A) � �, � ∝ 1 � sin ℓ� sin �ℓ�� independent harder dependent easier 22

23 How to choose a kernel

Measuring Dependence and Conditional Dependence with Kernels Kenji - PowerPoint PPT Presentation

Measuring Dependence and Conditional Dependence with Kernels Kenji Fukumizu The Institute of Statistical Mathematics, Japan June 25, 2014. ICML 2014, Causality Workshop 1 2 Introduction Dependence Measures Dependence measures and

11/15/16 Conditional distributions Let X and Y be discrete r.v.s. Conditional probability mass

Review: Conditional Probability Conditional Probability The conditional probability of event

Linear dependence and independence Linear dependence 1 Definition (linear (in)dependence) Let {

15. The Conditional 15.1 The conditional: Formation and uses 15.2 Mise en pratique 15.1 The

Formal Modeling in Cognitive Science Independence Lecture 23: Conditional Probability; Bayes

Markov random fields 2. conditional specifications 3. conditional auto-regression Rasmus

Conditional Statements Python Conditional Statements Sometimes a statement (or a block of

Conditional Sentences as Conditional Speech Acts Workshop Questioning Speech Acts Universitt

Conditional Probability & Independence Conditional Probabilities Question : How should we

P( ) 1 conditional probability where P(F) > 0 Conditional probability of E given F:

Multiscale Conditional 1) Generalization of conditional random fields (CRF) to multiscale

Conditional Probability & Independence Conditional Probabilities Question : How should we

Conditional Random Fields [Hanna M. Wallach, Conditional Random Fields: An Introduction,

Blended Conditional Gradients: The unconditioning of conditional gradients Joint work with Gabor

Protocol for Booleans ifTrue:ifFalse: trueBlock falseBlock Full conditional Part conditional

Treating Tobacco Treating Tobacco Treating Tobacco Treating Tobacco Dependence and Providing

Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels Jiyan Yang Stanford University June

Goals and Motivations Measure how well an automatic system can describe a video in natural

Image Captioning Image Captioning Image Captioning A survey of recent deep-learning approaches

Corpus Linguistics Statistical Measures in Information Retrieval Niko Schenk Institut f ur

A method for primary calibration of AM and PM noise measurements TimeNav 07 May 31, 2007

Evaluation Philipp Koehn 22 September 2020 Philipp Koehn Machine Translation: Evaluation 22

Evaluation measures in NLP Zdenk abokrtsk October 30, 2020 NPFL070 Language Data Resources

More Distributional Semantics: New Models & Applications CMSC 723 / LING 723 / INST 725 M

Measuring Dependence and Conditional Dependence with Kernels Kenji - PowerPoint PPT Presentation

Measuring Dependence and Conditional Dependence with Kernels Kenji Fukumizu The Institute of Statistical Mathematics, Japan June 25, 2014. ICML 2014, Causality Workshop 1 2 Introduction Dependence Measures Dependence measures and

11/15/16 Conditional distributions Let X and Y be discrete r.v.s. Conditional probability mass

Review: Conditional Probability Conditional Probability The conditional probability of event

Linear dependence and independence Linear dependence 1 Definition (linear (in)dependence) Let {

15. The Conditional 15.1 The conditional: Formation and uses 15.2 Mise en pratique 15.1 The

Formal Modeling in Cognitive Science Independence Lecture 23: Conditional Probability; Bayes

Markov random fields 2. conditional specifications 3. conditional auto-regression Rasmus

Conditional Statements Python Conditional Statements Sometimes a statement (or a block of

Conditional Sentences as Conditional Speech Acts Workshop Questioning Speech Acts Universitt

Conditional Probability &amp; Independence Conditional Probabilities Question : How should we

P( ) 1 conditional probability where P(F) &gt; 0 Conditional probability of E given F:

Multiscale Conditional 1) Generalization of conditional random fields (CRF) to multiscale

Conditional Probability &amp; Independence Conditional Probabilities Question : How should we

Conditional Random Fields [Hanna M. Wallach, Conditional Random Fields: An Introduction,

Blended Conditional Gradients: The unconditioning of conditional gradients Joint work with Gabor

Protocol for Booleans ifTrue:ifFalse: trueBlock falseBlock Full conditional Part conditional

Treating Tobacco Treating Tobacco Treating Tobacco Treating Tobacco Dependence and Providing

Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels Jiyan Yang Stanford University June

Goals and Motivations Measure how well an automatic system can describe a video in natural

Image Captioning Image Captioning Image Captioning A survey of recent deep-learning approaches

Corpus Linguistics Statistical Measures in Information Retrieval Niko Schenk Institut f ur

A method for primary calibration of AM and PM noise measurements TimeNav 07 May 31, 2007

Evaluation Philipp Koehn 22 September 2020 Philipp Koehn Machine Translation: Evaluation 22

Evaluation measures in NLP Zdenk abokrtsk October 30, 2020 NPFL070 Language Data Resources

More Distributional Semantics: New Models &amp; Applications CMSC 723 / LING 723 / INST 725 M

Conditional Probability & Independence Conditional Probabilities Question : How should we

P( ) 1 conditional probability where P(F) > 0 Conditional probability of E given F:

Conditional Probability & Independence Conditional Probabilities Question : How should we

More Distributional Semantics: New Models & Applications CMSC 723 / LING 723 / INST 725 M