Total independence test • Total independence test: H 0 : P XY Z = P X P Y P Z vs. H 1 : P XY Z � = P X P Y P Z • For ( X 1 , . . . , X D ) ∼ P X , and κ = � D i =1 k ( i ) : � � 2 � � � � � � � � � � D n n D n D n � � � � � � � 1 2 � � K ( i ) K ( i ) ˆ ˆ P X − ab − µ κ P X i = � � ab n 2 n D +1 � � � � a =1 a =1 i =1 b =1 i =1 i =1 b =1 � �� � � � � � ∆ tot ˆ P H κ D n n � � � 1 K ( i ) + ab . n 2 D i =1 a =1 b =1 • Coincides with the test proposed by Kankainen (1995) using empirical characteristic functions: similar relationship to that between dCov and HSIC (DS et al, 2013)
Example B: total independence tests Total independence test: Dataset B 1 Null acceptance rate (Type II error) 0 . 8 0 . 6 0 . 4 ∆ L : total indep. 0 . 2 ∆ tot : total indep. 0 1 3 5 7 9 11 13 15 17 19 Dimension Figure 4: Total independence: ∆ tot ˆ P vs. ∆ L ˆ P , n = 500
Kernel dependence measures - in detail
MMD for independence: HSIC !"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',& !" #" -"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*& 2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#& -"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-& 2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%%& 2',&0(//(7&3(+#%37"#%#:& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& Empirical HSIC ( P XY , P X P Y ): 1 n 2 ( HKH ◦ HLH ) ++
Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1
Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Correlation: −0.00 1.5 1 0.5 Y 0 −0.5 −1 −1.5 −2 0 2 X
Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Dependence witness, X 0.5 Correlation: −0.00 1.5 0 f(x) 1 −0.5 0.5 −1 −2 0 2 x 0 Y Dependence witness, Y 0.5 −0.5 0 −1 g(y) −0.5 −1.5 −2 0 2 X −1 −2 0 2 y
Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Dependence witness, X 0.5 Correlation: −0.00 Correlation: −0.90 COCO: 0.14 1.5 0 1 f(x) 1 −0.5 0.5 0.5 −1 −2 0 2 g(Y) x 0 Y 0 Dependence witness, Y 0.5 −0.5 −0.5 0 −1 g(y) −1 −0.5 −1.5 −1 −0.5 0 0.5 −2 0 2 f(X) X −1 −2 0 2 y
Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Dependence witness, X 0.5 Correlation: −0.00 Correlation: −0.90 COCO: 0.14 1.5 0 1 f(x) 1 −0.5 0.5 0.5 −1 −2 0 2 g(Y) x 0 Y 0 Dependence witness, Y 0.5 −0.5 −0.5 0 −1 g(y) −1 −0.5 −1.5 −1 −0.5 0 0.5 −2 0 2 f(X) X −1 −2 0 2 y How do we define covariance in (infinite) feature spaces?
Covariance to reveal dependence Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ R d , y ∈ R d ′ . Are they linearly dependent?
Covariance to reveal dependence Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ R d , y ∈ R d ′ . Are they linearly dependent? Compute their covariance matrix: (ignore centering) � xy ⊤ � C xy = E How to get a single “summary” number?
Covariance to reveal dependence Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ R d , y ∈ R d ′ . Are they linearly dependent? Compute their covariance matrix: (ignore centering) � xy ⊤ � C xy = E How to get a single “summary” number? Solve for vectors f ∈ R d , g ∈ R d ′ �� � � �� f ⊤ C xy g = f ⊤ x g ⊤ y argmax argmax E xy � f � =1 , � g � =1 � f � =1 , � g � =1 = argmax E x , y [ f ( x ) g ( y )] = argmax cov ( f ( x ) g ( y )) � f � =1 , � g � =1 � f � =1 , � g � =1 (maximum singular value)
Challenges in defining feature space covariance Given features φ ( x ) ∈ F and ψ ( y ) ∈ G : Challenge 1: Can we define a feature space analog to x y ⊤ ? YES: • Given f ∈ R d , g ∈ R d ′ , h ∈ R d ′ , define matrix f g ⊤ such that ( f g ⊤ ) h = f ( g ⊤ h ). • Given f ∈ F , g ∈ G , h ∈ G , define tensor product operator f ⊗ g such that ( f ⊗ g ) h = f � g, h � G . • Now just set f := φ ( x ), g = ψ ( y ), to get x y ⊤ → φ ( x ) ⊗ ψ ( y )
Challenges in defining feature space covariance Given features φ ( x ) ∈ F and ψ ( y ) ∈ G : Challenge 2: Does a covariance “matrix” (operator) in feature space exist? I.e. is there some C XY : G → F such that � f, C XY g � F = E x , y [ f ( x ) g ( y )] = cov ( f ( x ) , g ( y ))
Challenges in defining feature space covariance Given features φ ( x ) ∈ F and ψ ( y ) ∈ G : Challenge 2: Does a covariance “matrix” (operator) in feature space exist? I.e. is there some C XY : G → F such that � f, C XY g � F = E x , y [ f ( x ) g ( y )] = cov ( f ( x ) , g ( y )) YES: via Bochner integrability argument (as with mean embedding). �� � < ∞ , we can define: Under the condition E x , y k ( x , x ) l ( y , y ) C XY := E x , y [ φ ( x ) ⊗ ψ ( y )] which is a Hilbert-Schmidt operator (sum of squared singular values is finite).
REMINDER: functions revealing dependence COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Dependence witness, X 0.5 Correlation: −0.00 Correlation: −0.90 COCO: 0.14 1.5 0 1 f(x) 1 −0.5 0.5 0.5 −1 −2 0 2 g(Y) x Y 0 0 Dependence witness, Y 0.5 −0.5 −0.5 0 −1 g(y) −1 −0.5 −1.5 −1 −0.5 0 0.5 −2 0 2 f(X) X −1 −2 0 2 y How do we compute this from finite data?
Empirical covariance operator The empirical covariance given z := ( x i , y i ) n i =1 (now include centering) n � C XY := 1 � φ ( x i ) ⊗ ψ ( y i ) − ˆ µ x ⊗ ˆ µ y , n i =1 � n µ x := 1 where ˆ i =1 φ ( x i ). More concisely, n C XY = 1 � nXHY ⊤ , where H = I n − n − 1 1 n , and 1 n is an n × n matrix of ones, and � � � � X = Y = . φ ( x 1 ) . . . φ ( x n ) ψ ( y 1 ) . . . ψ ( y n ) Define the kernel matrices � � X ⊤ X K ij = ij = k ( x i , x j ) L ij = l ( y i , y j ) ,
Functions revealing dependence Optimization problem: � � f, � COCO( z ; F , G ) := max C XY g F � f � F ≤ 1 subject to � g � G ≤ 1 Assume n n � � α i [ φ ( x i ) − ˆ β i [ ψ ( y i ) − ˆ f = µ x ] = XHα g = µ y ] = Y Hβ, i =1 j =1 The associated Lagrangian is � � � � C XY g − λ − γ L ( f, g, λ, γ ) = f ⊤ � � f � 2 � g � 2 F − 1 G − 1 , 2 2
Covariance to reveal dependence • Empirical COCO( z ; F , G ) largest eigenvalue of n � K � � 1 0 L α K 0 α = γ . n � L � � 1 K 0 β 0 L β • � K and � L are matrices of inner products between centred observations in respective feature spaces: H = I − 1 � n 11 ⊤ K = HKH where
Covariance to reveal dependence • Empirical COCO( z ; F , G ) largest eigenvalue of n � K � � 1 0 L α K 0 α = γ . n � L � � 1 K 0 β 0 L β • � K and � L are matrices of inner products between centred observations in respective feature spaces: H = I − 1 � n 11 ⊤ K = HKH where • Mapping function for x : n n � � k ( x i , x ) − 1 f ( x ) = α i k ( x j , x ) n i =1 j =1
Hard-to-detect dependence Smooth density Rough density 3 3 2 2 1 1 Y 0 Y 0 −1 −1 −2 −2 Density takes the form: −3 −3 −2 0 2 −2 0 2 X X 500 Samples, smooth density 500 samples, rough density P x , y ∝ 1 + sin( ω x ) sin( ω y ) 4 4 2 2 Y Y 0 0 −2 −2 −4 −4 −4 −2 0 2 4 −4 −2 0 2 4 X X
Hard-to-detect dependence • Example: sinusoids of increasing frequency ω =1 ω =2 COCO (empirical average, 1500 samples) 0.1 0.09 0.08 ω =3 ω =4 0.07 COCO 0.06 0.05 0.04 ω =5 ω =6 0.03 0.02 0.01 0 1 2 3 4 5 6 7 Frequency of non−constant density component
Hard-to-detect dependence COCO vs frequency of perturbation from independence.
Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω = 1 Dependence witness, X 1 0.5 Correlation: 0.27 Correlation: −0.50 COCO: 0.09 4 0.6 f(x) 0 3 0.4 −0.5 2 0.2 1 −1 −2 0 2 0 x g(Y) Y 0 −0.2 Dependence witness, Y −1 1 −0.4 −2 0.5 −0.6 −3 g(y) 0 −0.8 −4 −1 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y
Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω = 2 Dependence witness, X 1 0.5 Correlation: 0.04 Correlation: 0.51 COCO: 0.07 4 0.6 f(x) 0 3 0.4 −0.5 2 0.2 1 −1 −2 0 2 0 x g(Y) Y 0 −0.2 Dependence witness, Y −1 1 −0.4 −2 0.5 −0.6 −3 g(y) 0 −0.8 −4 −1 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y
Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω = 3 Dependence witness, X 1 0.5 Correlation: 0.03 Correlation: −0.45 COCO: 0.03 4 0.5 f(x) 0 3 −0.5 2 1 −1 −2 0 2 x g(Y) Y 0 0 Dependence witness, Y −1 1 −2 0.5 −3 g(y) 0 −0.5 −4 −0.6 −0.4 −0.2 0 0.2 0.4 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y
Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω = 4 Dependence witness, X 1 0.5 Correlation: 0.03 Correlation: 0.21 COCO: 0.02 4 0.6 f(x) 0 3 0.4 −0.5 2 0.2 1 −1 −2 0 2 0 x g(Y) Y 0 −0.2 Dependence witness, Y −1 1 −0.4 −2 0.5 −0.6 −3 g(y) 0 −0.8 −4 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y
Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω =?? Dependence witness, X 1 0.5 Correlation: 0.00 Correlation: −0.13 COCO: 0.02 3 0.6 f(x) 0 0.4 2 −0.5 0.2 1 −1 −2 0 2 0 x g(Y) Y 0 −0.2 Dependence witness, Y 1 −1 −0.4 0.5 −2 −0.6 g(y) 0 −0.8 −3 −1 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y
Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of uniform noise! This bias will decrease with increasing sample size. Dependence witness, X 1 Correlation: 0.00 0.5 Correlation: −0.13 COCO: 0.02 3 0.6 f(x) 0 0.4 2 −0.5 0.2 1 −1 −2 0 2 0 g(Y) x Y 0 Dependence witness, Y −0.2 1 −1 −0.4 0.5 −2 −0.6 g(y) 0 −0.8 −3 −1 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y
Hard-to-detect dependence COCO vs frequency of perturbation from independence. • As dependence is encoded at higher frequencies, the smooth mappings f, g achieve lower linear covariance. • Even for independent variables, COCO will not be zero at finite sample sizes, since some mild linear dependence will be induced by f, g (bias) • This bias will decrease with increasing sample size.
More functions revealing dependence • Can we do better than COCO?
More functions revealing dependence • Can we do better than COCO? • A second example with zero correlation Dependence witness, X 0.5 Correlation: 0 Correlation: −0.80 COCO: 0.11 1 0 1 f(x) 0.5 −0.5 0.5 −1 0 −2 0 2 g(Y) x Y 0 Dependence witness, Y −0.5 0.5 −0.5 −1 0 g(y) −1 −0.5 −1.5 −1 −0.5 0 0.5 −1 0 1 f(X) X −1 −2 0 2 y
More functions revealing dependence • Can we do better than COCO? • A second example with zero correlation 2nd dependence witness, X 1 Correlation: 0 1 0.5 f 2 (x) 0 0.5 −0.5 −1 0 −2 0 2 Y x 2nd dependence witness, Y −0.5 1 0.5 −1 g 2 (y) 0 −1.5 −0.5 −1 0 1 −1 X −2 0 2 y
More functions revealing dependence • Can we do better than COCO? • A second example with zero correlation 2nd dependence witness, X 1 Correlation: 0 Correlation: −0.37 COCO 2 : 0.06 1 0.5 1 f 2 (x) 0 0.5 −0.5 0.5 −1 0 −2 0 2 g 2 (Y) Y x 0 2nd dependence witness, Y −0.5 1 −0.5 0.5 −1 g 2 (y) 0 −1 −1.5 −1 0 1 −0.5 −1 0 1 f 2 (X) −1 X −2 0 2 y
Hilbert-Schmidt Independence Criterion • Given γ i := COCO i ( z ; F , G ), define Hilbert-Schmidt Independence Criterion (HSIC) [ALT05, NIPS07a, JMLR10] : n � γ 2 HSIC( z ; F , G ) := i i =1
Hilbert-Schmidt Independence Criterion • Given γ i := COCO i ( z ; F , G ), define Hilbert-Schmidt Independence Criterion (HSIC) [ALT05, NIPS07a, JMLR10] : n � γ 2 HSIC( z ; F , G ) := i i =1 • In limit of infinite samples: HSIC( P ; F, G ) := � C xy � 2 HS = � C xy , C xy � HS = E x , x ′ , y , y ′ [ k ( x , x ′ ) l ( y , y ′ )] + E x , x ′ [ k ( x , x ′ )] E y , y ′ [ l ( y , y ′ )] � � E x ′ [ k ( x , x ′ )] E y ′ [ l ( y , y ′ )] − 2 E x , y x ′ an independent copy of x , y ′ a copy of y – HSIC is identical to MMD ( P XY , P X P Y )
When does HSIC determine independence? Theorem: When kernels k and l are each characteristic, then HSIC = 0 iff P x , y = P x P y [Gretton, 2015] . Weaker than MMD condition (which requires a kernel characteristic on X × Y to distinguish P x , y from Q x , y ).
Intuition: why characteristic needed on both X and Y Question: Wouldn’t it be enough just to use a rich mapping from X to Y , e.g. via ridge regression with characteristic F : � � E XY ( Y − � f, φ ( X ) � F ) 2 + λ � f � 2 f ∗ = arg min , F f ∈F
Intuition: why characteristic needed on both X and Y Question: Wouldn’t it be enough just to use a rich mapping from X to Y , e.g. via ridge regression with characteristic F : � � E XY ( Y − � f, φ ( X ) � F ) 2 + λ � f � 2 f ∗ = arg min , F f ∈F Counterexample: density symmetric about x -axis, s.t. p ( x, y ) = p ( x, − y ) Correlation: −0.00 1.5 1 0.5 Y 0 −0.5 −1 −1.5 −2 0 2 X
Energy Distance and the MMD
Energy distance and MMD Distance between probability distributions: Energy distance: [Baringhaus and Franz, 2004, Sz´ ekely and Rizzo, 2004, 2005] D E ( P , Q ) = E P � X − X ′ � q + E Q � Y − Y ′ � q − 2 E P , Q � X − Y � q 0 < q ≤ 2 Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012] MMD 2 ( P , Q ; F ) = E P k ( X, X ′ ) + E Q k ( Y, Y ′ ) − 2 E P , Q k ( X, Y )
Energy distance and MMD Distance between probability distributions: Energy distance: [Baringhaus and Franz, 2004, Sz´ ekely and Rizzo, 2004, 2005] D E ( P , Q ) = E P � X − X ′ � q + E Q � Y − Y ′ � q − 2 E P , Q � X − Y � q 0 < q ≤ 2 Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012] MMD 2 ( P , Q ; F ) = E P k ( X, X ′ ) + E Q k ( Y, Y ′ ) − 2 E P , Q k ( X, Y ) Energy distance is MMD with a particular kernel! [Sejdinovic et al., 2013b]
Distance covariance and HSIC Distance covariance (0 < q, r ≤ 2) [Feuerverger, 1993, Sz´ ekely et al., 2007] V 2 ( X, Y ) = E XY E X ′ Y ′ � � X − X ′ � q � Y − Y ′ � r � + E X E X ′ � X − X ′ � q E Y E Y ′ � Y − Y ′ � r � E X ′ � X − X ′ � q E Y ′ � Y − Y ′ � r � − 2 E XY Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al., 2008, Gretton and Gyorfi, 2010] Define RKHS F on X with kernel k , RKHS G on Y with kernel l . Then HSIC( P XY , P X P Y ) = E XY E X ′ Y ′ k ( X, X ′ ) l ( Y, Y ′ ) + E X E X ′ k ( X, X ′ ) E Y E Y ′ l ( Y, Y ′ ) − 2 E X ′ Y ′ � � E X k ( X, X ′ ) E Y l ( Y, Y ′ ) .
Distance covariance and HSIC Distance covariance (0 < q, r ≤ 2) [Feuerverger, 1993, Sz´ ekely et al., 2007] V 2 ( X, Y ) = E XY E X ′ Y ′ � � X − X ′ � q � Y − Y ′ � r � + E X E X ′ � X − X ′ � q E Y E Y ′ � Y − Y ′ � r � E X ′ � X − X ′ � q E Y ′ � Y − Y ′ � r � − 2 E XY Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al., 2008, Gretton and Gyorfi, 2010] Define RKHS F on X with kernel k , RKHS G on Y with kernel l . Then HSIC( P XY , P X P Y ) = E XY E X ′ Y ′ k ( X, X ′ ) l ( Y, Y ′ ) + E X E X ′ k ( X, X ′ ) E Y E Y ′ l ( Y, Y ′ ) − 2 E X ′ Y ′ � � E X k ( X, X ′ ) E Y l ( Y, Y ′ ) . Distance covariance is HSIC with particular kernels! [Sejdinovic et al., 2013b]
Semimetrics and Hilbert spaces Theorem [Berg et al., 1984, Lemma 2.1, p. 74] ρ : X × X → R is a semimetric (no triangle inequality) on X . Let z 0 ∈ X , and denote k ρ ( z, z ′ ) = ρ ( z, z 0 ) + ρ ( z ′ , z 0 ) − ρ ( z, z ′ ) . Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS) iff ρ is of negative type. Call k ρ a distance induced kernel Negative type: The semimetric space ( Z , ρ ) is said to have negative type if ∀ n ≥ 2, z 1 , . . . , z n ∈ Z , and α 1 , . . . , α n ∈ R , with � n i =1 α i = 0, n n � � α i α j ρ ( z i , z j ) ≤ 0 . (1) i =1 j =1
Semimetrics and Hilbert spaces Theorem [Berg et al., 1984, Lemma 2.1, p. 74] ρ : X × X → R is a semimetric (no triangle inequality) on X . Let z 0 ∈ X , and denote k ρ ( z, z ′ ) = ρ ( z, z 0 ) + ρ ( z ′ , z 0 ) − ρ ( z, z ′ ) . Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS) iff ρ is of negative type. Call k ρ a distance induced kernel Special case: Z ⊆ R d and ρ q ( z, z ′ ) = � z − z ′ � q . Then ρ q is a valid semimetric of negative type for 0 < q ≤ 2.
Semimetrics and Hilbert spaces Theorem [Berg et al., 1984, Lemma 2.1, p. 74] ρ : X × X → R is a semimetric (no triangle inequality) on X . Let z 0 ∈ X , and denote k ρ ( z, z ′ ) = ρ ( z, z 0 ) + ρ ( z ′ , z 0 ) − ρ ( z, z ′ ) . Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS) iff ρ is of negative type. Call k ρ a distance induced kernel Special case: Z ⊆ R d and ρ q ( z, z ′ ) = � z − z ′ � q . Then ρ q is a valid semimetric of negative type for 0 < q ≤ 2. Energy distance is MMD with a distance induced kernel Distance covariance is HSIC with distance induced kernels
Two-sample testing benchmark Two-sample testing example in 1-D: 0.4 0.3 Q(X) 0.2 0.1 0.35 −6 −4 −2 0 2 4 6 X 0.3 0.4 0.25 0.3 0.2 Q(X) P(X) VS 0.2 0.15 0.1 0.1 −6 −4 −2 0 2 4 6 X 0.05 0.4 0 −6 −4 −2 0 2 4 6 0.3 X Q(X) 0.2 0.1 −6 −4 −2 0 2 4 6 X
Two-sample test, MMD with distance kernel Obtain more powerful tests on this problem when q � = 1 (exponent of distance) Key: • Gaussian kernel • q = 1 • Best: q = 1 / 3 • Worst: q = 2
Nonparametric Bayesian inference using distribution embeddings
Motivating Example: Bayesian inference without a model • 3600 downsampled frames of 20 × 20 RGB pixels ( Y t ∈ [0 , 1] 1200 ) • 1800 training frames, remaining for test. • Gaussian noise added to Y t . Challenges: • No parametric model of camera dynamics (only samples) • No parametric model of map from camera angle to image (only samples) • Want to do filtering: Bayesian inference
ABC: an approach to Bayesian inference without a model Bayes rule: P ( x | y ) π ( y ) � P ( y | x ) = P ( x | y ) π ( y ) dy • P ( x | y ) is likelihood • π ( y ) is prior One approach: Approximate Bayesian Computation (ABC)
ABC: an approach to Bayesian inference without a model Approximate Bayesian Computation (ABC): ABC demonstration 10 Likelihood x ∼ P ( X | y ) 5 0 −5 −10 −10 −5 0 5 10 Prior y ∼ π ( Y )
ABC: an approach to Bayesian inference without a model Approximate Bayesian Computation (ABC): ABC demonstration 10 Likelihood x ∼ P ( X | y ) 5 x ∗ 0 −5 −10 −10 −5 0 5 10 Prior y ∼ π ( Y )
ABC: an approach to Bayesian inference without a model Approximate Bayesian Computation (ABC): ABC demonstration 10 Likelihood x ∼ P ( X | y ) 5 x ∗ 0 −5 −10 −10 −5 0 5 10 Prior y ∼ π ( Y )
ABC: an approach to Bayesian inference without a model Approximate Bayesian Computation (ABC): ABC demonstration 10 Likelihood x ∼ P ( X | y ) 5 x ∗ 0 −5 −10 −10 −5 0 5 10 Prior y ∼ π ( Y )
ABC: an approach to Bayesian inference without a model Approximate Bayesian Computation (ABC): ABC demonstration 10 Likelihood x ∼ P ( X | y ) 5 x ∗ 0 −5 ˆ −10 P ( Y | x ∗ ) −10 −5 0 5 10 Prior y ∼ π ( Y )
ABC: an approach to Bayesian inference without a model Approximate Bayesian Computation (ABC): ABC demonstration 10 Likelihood x ∼ P ( X | y ) 5 x ∗ 0 −5 ˆ −10 P ( Y | x ∗ ) −10 −5 0 5 10 Prior y ∼ π ( Y ) Needed: distance measure D , tolerance parameter τ .
ABC: an approach to Bayesian inference without a model Bayes rule: P ( x | y ) π ( y ) � P ( y | x ) = P ( x | y ) π ( y ) dy • P ( x | y ) is likelihood • π ( y ) is prior ABC generates a sample from p ( Y | x ∗ ) as follows: 1. generate a sample y t from the prior π , 2. generate a sample x t from P ( X | y t ), 3. if D ( x ∗ , x t ) < τ , accept y = y t ; otherwise reject, 4. go to (i). In step (3), D is a distance measure, and τ is a tolerance parameter.
Motivating example 2: simple Gaussian case d ) T , V ) with V a randomly generated covariance • p ( x, y ) is N ((0 , 1 T Posterior mean on x : ABC vs kernel approach CPU time vs Error (6 dim.) 3.3 × 10 2 9.5 × 10 2 KBI −1 10 COND 2.5 × 10 3 ABC 1.0 × 10 4 200 Mean Square Errors 400 7.0 × 10 4 600 200 800 400 1000 1.3 × 10 6 600 1500 −2 2000 800 10 3000 4000 1000 5000 2000 1500 5000 6000 3000 4000 6000 0 1 2 3 4 10 10 10 10 10 CPU time (sec)
Bayes again Bayes rule: P ( x | y ) π ( y ) � P ( y | x ) = P ( x | y ) π ( y ) dy • P ( x | y ) is likelihood • π is prior How would this look with kernel embeddings?
Bayes again Bayes rule: P ( x | y ) π ( y ) � P ( y | x ) = P ( x | y ) π ( y ) dy • P ( x | y ) is likelihood • π is prior How would this look with kernel embeddings? Define RKHS G on Y with feature map ψ y and kernel l ( y, · ) We need a conditional mean embedding: for all g ∈ G , E Y | x ∗ g ( Y ) = � g, µ P ( y | x ∗ ) � G This will be obtained by RKHS-valued ridge regression
Ridge regression and the conditional feature mean Ridge regression from X := R d to a finite vector output Y := R d ′ (these could be d ′ nonlinear features of y ): Define training data � � � � ∈ R d ′ × m ∈ R d × m X = Y = x 1 . . . x m y 1 . . . y m
Ridge regression and the conditional feature mean Ridge regression from X := R d to a finite vector output Y := R d ′ (these could be d ′ nonlinear features of y ): Define training data � � � � ∈ R d ′ × m ∈ R d × m X = Y = x 1 . . . x m y 1 . . . y m Solve � � � Y − AX � 2 + λ � A � 2 ˘ A = arg min A ∈ R d ′× d , H S where min { d,d ′ } � � A � 2 H S = tr( A ⊤ A ) = γ 2 A,i i =1
Ridge regression and the conditional feature mean Ridge regression from X := R d to a finite vector output Y := R d ′ (these could be d ′ nonlinear features of y ): Define training data � � � � ∈ R d ′ × m ∈ R d × m X = Y = x 1 . . . x m y 1 . . . y m Solve � � � Y − AX � 2 + λ � A � 2 ˘ A = arg min A ∈ R d ′× d , H S where min { d,d ′ } � � A � 2 H S = tr( A ⊤ A ) = γ 2 A,i i =1 Solution: ˘ A = C Y X ( C XX + mλI ) − 1
Ridge regression and the conditional feature mean Prediction at new point x : ˘ y ∗ = Ax C Y X ( C XX + mλI ) − 1 x = m � = β i ( x ) y i i =1 where β i ( x ) = ( K + λmI ) − 1 � � ⊤ k ( x 1 , x ) . . . k ( x m , x ) and K := X ⊤ X k ( x 1 , x ) = x ⊤ 1 x
Ridge regression and the conditional feature mean Prediction at new point x : ˘ y ∗ = Ax C Y X ( C XX + mλI ) − 1 x = m � = β i ( x ) y i i =1 where β i ( x ) = ( K + λmI ) − 1 � � ⊤ k ( x 1 , x ) . . . k ( x m , x ) and K := X ⊤ X k ( x 1 , x ) = x ⊤ 1 x What if we do everything in kernel space?
Ridge regression and the conditional feature mean Recall our setup: • Given training pairs: ( x i , y i ) ∼ P XY • F on X with feature map ϕ x and kernel k ( x, · ) • G on Y with feature map ψ y and kernel l ( y, · ) We define the covariance between feature maps: C XX = E X ( ϕ X ⊗ ϕ X ) C XY = E XY ( ϕ X ⊗ ψ Y ) and matrices of feature mapped training data � � � � X = Y := ϕ x 1 . . . ϕ x m ψ y 1 . . . ψ y m
Ridge regression and the conditional feature mean Objective: [Weston et al. (2003), Micchelli and Pontil (2005), Caponnetto and De Vito (2007), Grunewalder et al. (2012, 2013) ] � � ∞ � ˘ E XY � Y − AX � 2 G + λ � A � 2 � A � 2 γ 2 A = arg min , H S = H S A,i A ∈ HS( F , G ) i =1 Solution same as vector case: A = C Y X ( C XX + mλI ) − 1 , ˘ Prediction at new x using kernels: � � ( K + λmI ) − 1 � � ˘ Aϕ x = ψ y 1 . . . ψ y m k ( x 1 , x ) . . . k ( x m , x ) m � = β i ( x ) ψ y i i =1 where K ij = k ( x i , x j )
Ridge regression and the conditional feature mean How is loss � Y − AX � 2 G relevant to conditional expectation of some E Y | x g ( Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)] µ Y | x := Aϕ x
Ridge regression and the conditional feature mean How is loss � Y − AX � 2 G relevant to conditional expectation of some E Y | x g ( Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)] µ Y | x := Aϕ x We need A to have the property E Y | x g ( Y ) ≈ � g, µ Y | x � G = � g, Aϕ x � G = � A ∗ g, ϕ x � F = ( A ∗ g )( x )
Ridge regression and the conditional feature mean How is loss � Y − AX � 2 G relevant to conditional expectation of some E Y | x g ( Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)] µ Y | x := Aϕ x We need A to have the property E Y | x g ( Y ) ≈ � g, µ Y | x � G = � g, Aϕ x � G = � A ∗ g, ϕ x � F = ( A ∗ g )( x ) Natural risk function for conditional mean 2 � � ( X ) − ( A ∗ g ) L ( A, P XY ) := sup E Y | X g ( Y ) ( X ) , E X � �� � � �� � � g �≤ 1 Estimator Target
Ridge regression and the conditional feature mean The squared loss risk provides an upper bound on the natural risk. L ( A, P XY ) ≤ E XY � ψ Y − Aϕ X � 2 G
Ridge regression and the conditional feature mean The squared loss risk provides an upper bound on the natural risk. L ( A, P XY ) ≤ E XY � ψ Y − Aϕ X � 2 G Proof: Jensen and Cauchy Schwarz �� � � 2 ( X ) − ( A ∗ g ) ( X ) L ( A, P XY ) := sup E Y | X g ( Y ) E X � g �≤ 1 [ g ( Y ) − ( A ∗ g ) ( X )] 2 ≤ E XY sup � g �≤ 1
Recommend
More recommend