Lecture 3: Dependence measures using RKHS embeddings MLSS T - PowerPoint PPT Presentation

Total independence test • Total independence test: H 0 : P XY Z = P X P Y P Z vs. H 1 : P XY Z � = P X P Y P Z • For ( X 1 , . . . , X D ) ∼ P X , and κ = � D i =1 k ( i ) : � � 2 � � � � � � � � � � D n n D n D n � � � � � � � 1 2 � � K ( i ) K ( i ) ˆ ˆ P X − ab − µ κ P X i = � � ab n 2 n D +1 � � � � a =1 a =1 i =1 b =1 i =1 i =1 b =1 � �� ∆ tot ˆ P H κ D n n � � � 1 K ( i ) + ab . n 2 D i =1 a =1 b =1 • Coincides with the test proposed by Kankainen (1995) using empirical characteristic functions: similar relationship to that between dCov and HSIC (DS et al, 2013)

Example B: total independence tests Total independence test: Dataset B 1 Null acceptance rate (Type II error) 0 . 8 0 . 6 0 . 4 ∆ L : total indep. 0 . 2 ∆ tot : total indep. 0 1 3 5 7 9 11 13 15 17 19 Dimension Figure 4: Total independence: ∆ tot ˆ P vs. ∆ L ˆ P , n = 500

Kernel dependence measures - in detail

MMD for independence: HSIC !"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',& !" #" -"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*& 2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#& -"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-& 2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& Empirical HSIC ( P XY , P X P Y ): 1 n 2 ( HKH ◦ HLH ) ++

Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1

Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Correlation: −0.00 1.5 1 0.5 Y 0 −0.5 −1 −1.5 −2 0 2 X

Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Dependence witness, X 0.5 Correlation: −0.00 1.5 0 f(x) 1 −0.5 0.5 −1 −2 0 2 x 0 Y Dependence witness, Y 0.5 −0.5 0 −1 g(y) −0.5 −1.5 −2 0 2 X −1 −2 0 2 y

Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Dependence witness, X 0.5 Correlation: −0.00 Correlation: −0.90 COCO: 0.14 1.5 0 1 f(x) 1 −0.5 0.5 0.5 −1 −2 0 2 g(Y) x 0 Y 0 Dependence witness, Y 0.5 −0.5 −0.5 0 −1 g(y) −1 −0.5 −1.5 −1 −0.5 0 0.5 −2 0 2 f(X) X −1 −2 0 2 y

Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Dependence witness, X 0.5 Correlation: −0.00 Correlation: −0.90 COCO: 0.14 1.5 0 1 f(x) 1 −0.5 0.5 0.5 −1 −2 0 2 g(Y) x 0 Y 0 Dependence witness, Y 0.5 −0.5 −0.5 0 −1 g(y) −1 −0.5 −1.5 −1 −0.5 0 0.5 −2 0 2 f(X) X −1 −2 0 2 y How do we define covariance in (infinite) feature spaces?

Covariance to reveal dependence Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ R d , y ∈ R d ′ . Are they linearly dependent?

Covariance to reveal dependence Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ R d , y ∈ R d ′ . Are they linearly dependent? Compute their covariance matrix: (ignore centering) � xy ⊤ � C xy = E How to get a single “summary” number?

Covariance to reveal dependence Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ R d , y ∈ R d ′ . Are they linearly dependent? Compute their covariance matrix: (ignore centering) � xy ⊤ � C xy = E How to get a single “summary” number? Solve for vectors f ∈ R d , g ∈ R d ′ �� f ⊤ C xy g = f ⊤ x g ⊤ y argmax argmax E xy � f � =1 , � g � =1 � f � =1 , � g � =1 = argmax E x , y [ f ( x ) g ( y )] = argmax cov ( f ( x ) g ( y )) � f � =1 , � g � =1 � f � =1 , � g � =1 (maximum singular value)

Challenges in defining feature space covariance Given features φ ( x ) ∈ F and ψ ( y ) ∈ G : Challenge 1: Can we define a feature space analog to x y ⊤ ? YES: • Given f ∈ R d , g ∈ R d ′ , h ∈ R d ′ , define matrix f g ⊤ such that ( f g ⊤ ) h = f ( g ⊤ h ). • Given f ∈ F , g ∈ G , h ∈ G , define tensor product operator f ⊗ g such that ( f ⊗ g ) h = f � g, h � G . • Now just set f := φ ( x ), g = ψ ( y ), to get x y ⊤ → φ ( x ) ⊗ ψ ( y )

Challenges in defining feature space covariance Given features φ ( x ) ∈ F and ψ ( y ) ∈ G : Challenge 2: Does a covariance “matrix” (operator) in feature space exist? I.e. is there some C XY : G → F such that � f, C XY g � F = E x , y [ f ( x ) g ( y )] = cov ( f ( x ) , g ( y ))

Challenges in defining feature space covariance Given features φ ( x ) ∈ F and ψ ( y ) ∈ G : Challenge 2: Does a covariance “matrix” (operator) in feature space exist? I.e. is there some C XY : G → F such that � f, C XY g � F = E x , y [ f ( x ) g ( y )] = cov ( f ( x ) , g ( y )) YES: via Bochner integrability argument (as with mean embedding). �� < ∞ , we can define: Under the condition E x , y k ( x , x ) l ( y , y ) C XY := E x , y [ φ ( x ) ⊗ ψ ( y )] which is a Hilbert-Schmidt operator (sum of squared singular values is finite).

REMINDER: functions revealing dependence COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Dependence witness, X 0.5 Correlation: −0.00 Correlation: −0.90 COCO: 0.14 1.5 0 1 f(x) 1 −0.5 0.5 0.5 −1 −2 0 2 g(Y) x Y 0 0 Dependence witness, Y 0.5 −0.5 −0.5 0 −1 g(y) −1 −0.5 −1.5 −1 −0.5 0 0.5 −2 0 2 f(X) X −1 −2 0 2 y How do we compute this from finite data?

Empirical covariance operator The empirical covariance given z := ( x i , y i ) n i =1 (now include centering) n � C XY := 1 � φ ( x i ) ⊗ ψ ( y i ) − ˆ µ x ⊗ ˆ µ y , n i =1 � n µ x := 1 where ˆ i =1 φ ( x i ). More concisely, n C XY = 1 � nXHY ⊤ , where H = I n − n − 1 1 n , and 1 n is an n × n matrix of ones, and � � � � X = Y = . φ ( x 1 ) . . . φ ( x n ) ψ ( y 1 ) . . . ψ ( y n ) Define the kernel matrices � � X ⊤ X K ij = ij = k ( x i , x j ) L ij = l ( y i , y j ) ,

Functions revealing dependence Optimization problem: � � f, � COCO( z ; F , G ) := max C XY g F � f � F ≤ 1 subject to � g � G ≤ 1 Assume n n � � α i [ φ ( x i ) − ˆ β i [ ψ ( y i ) − ˆ f = µ x ] = XHα g = µ y ] = Y Hβ, i =1 j =1 The associated Lagrangian is � � � � C XY g − λ − γ L ( f, g, λ, γ ) = f ⊤ � � f � 2 � g � 2 F − 1 G − 1 , 2 2

Covariance to reveal dependence • Empirical COCO( z ; F , G ) largest eigenvalue of         n � K � � 1 0 L  α K 0  α  = γ  .     n � L � � 1 K 0 β 0 L β • � K and � L are matrices of inner products between centred observations in respective feature spaces: H = I − 1 � n 11 ⊤ K = HKH where

Covariance to reveal dependence • Empirical COCO( z ; F , G ) largest eigenvalue of         n � K � � 1 0 L  α K 0  α  = γ  .     n � L � � 1 K 0 β 0 L β • � K and � L are matrices of inner products between centred observations in respective feature spaces: H = I − 1 � n 11 ⊤ K = HKH where • Mapping function for x :   n n � �  k ( x i , x ) − 1  f ( x ) = α i k ( x j , x ) n i =1 j =1

Hard-to-detect dependence Smooth density Rough density 3 3 2 2 1 1 Y 0 Y 0 −1 −1 −2 −2 Density takes the form: −3 −3 −2 0 2 −2 0 2 X X 500 Samples, smooth density 500 samples, rough density P x , y ∝ 1 + sin( ω x ) sin( ω y ) 4 4 2 2 Y Y 0 0 −2 −2 −4 −4 −4 −2 0 2 4 −4 −2 0 2 4 X X

Hard-to-detect dependence • Example: sinusoids of increasing frequency ω =1 ω =2 COCO (empirical average, 1500 samples) 0.1 0.09 0.08 ω =3 ω =4 0.07 COCO 0.06 0.05 0.04 ω =5 ω =6 0.03 0.02 0.01 0 1 2 3 4 5 6 7 Frequency of non−constant density component

Hard-to-detect dependence COCO vs frequency of perturbation from independence.

Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω = 1 Dependence witness, X 1 0.5 Correlation: 0.27 Correlation: −0.50 COCO: 0.09 4 0.6 f(x) 0 3 0.4 −0.5 2 0.2 1 −1 −2 0 2 0 x g(Y) Y 0 −0.2 Dependence witness, Y −1 1 −0.4 −2 0.5 −0.6 −3 g(y) 0 −0.8 −4 −1 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω = 2 Dependence witness, X 1 0.5 Correlation: 0.04 Correlation: 0.51 COCO: 0.07 4 0.6 f(x) 0 3 0.4 −0.5 2 0.2 1 −1 −2 0 2 0 x g(Y) Y 0 −0.2 Dependence witness, Y −1 1 −0.4 −2 0.5 −0.6 −3 g(y) 0 −0.8 −4 −1 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω = 3 Dependence witness, X 1 0.5 Correlation: 0.03 Correlation: −0.45 COCO: 0.03 4 0.5 f(x) 0 3 −0.5 2 1 −1 −2 0 2 x g(Y) Y 0 0 Dependence witness, Y −1 1 −2 0.5 −3 g(y) 0 −0.5 −4 −0.6 −0.4 −0.2 0 0.2 0.4 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω = 4 Dependence witness, X 1 0.5 Correlation: 0.03 Correlation: 0.21 COCO: 0.02 4 0.6 f(x) 0 3 0.4 −0.5 2 0.2 1 −1 −2 0 2 0 x g(Y) Y 0 −0.2 Dependence witness, Y −1 1 −0.4 −2 0.5 −0.6 −3 g(y) 0 −0.8 −4 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω =?? Dependence witness, X 1 0.5 Correlation: 0.00 Correlation: −0.13 COCO: 0.02 3 0.6 f(x) 0 0.4 2 −0.5 0.2 1 −1 −2 0 2 0 x g(Y) Y 0 −0.2 Dependence witness, Y 1 −1 −0.4 0.5 −2 −0.6 g(y) 0 −0.8 −3 −1 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of uniform noise! This bias will decrease with increasing sample size. Dependence witness, X 1 Correlation: 0.00 0.5 Correlation: −0.13 COCO: 0.02 3 0.6 f(x) 0 0.4 2 −0.5 0.2 1 −1 −2 0 2 0 g(Y) x Y 0 Dependence witness, Y −0.2 1 −1 −0.4 0.5 −2 −0.6 g(y) 0 −0.8 −3 −1 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

Hard-to-detect dependence COCO vs frequency of perturbation from independence. • As dependence is encoded at higher frequencies, the smooth mappings f, g achieve lower linear covariance. • Even for independent variables, COCO will not be zero at finite sample sizes, since some mild linear dependence will be induced by f, g (bias) • This bias will decrease with increasing sample size.

More functions revealing dependence • Can we do better than COCO?

More functions revealing dependence • Can we do better than COCO? • A second example with zero correlation Dependence witness, X 0.5 Correlation: 0 Correlation: −0.80 COCO: 0.11 1 0 1 f(x) 0.5 −0.5 0.5 −1 0 −2 0 2 g(Y) x Y 0 Dependence witness, Y −0.5 0.5 −0.5 −1 0 g(y) −1 −0.5 −1.5 −1 −0.5 0 0.5 −1 0 1 f(X) X −1 −2 0 2 y

More functions revealing dependence • Can we do better than COCO? • A second example with zero correlation 2nd dependence witness, X 1 Correlation: 0 1 0.5 f 2 (x) 0 0.5 −0.5 −1 0 −2 0 2 Y x 2nd dependence witness, Y −0.5 1 0.5 −1 g 2 (y) 0 −1.5 −0.5 −1 0 1 −1 X −2 0 2 y

More functions revealing dependence • Can we do better than COCO? • A second example with zero correlation 2nd dependence witness, X 1 Correlation: 0 Correlation: −0.37 COCO 2 : 0.06 1 0.5 1 f 2 (x) 0 0.5 −0.5 0.5 −1 0 −2 0 2 g 2 (Y) Y x 0 2nd dependence witness, Y −0.5 1 −0.5 0.5 −1 g 2 (y) 0 −1 −1.5 −1 0 1 −0.5 −1 0 1 f 2 (X) −1 X −2 0 2 y

Hilbert-Schmidt Independence Criterion • Given γ i := COCO i ( z ; F , G ), define Hilbert-Schmidt Independence Criterion (HSIC) [ALT05, NIPS07a, JMLR10] : n � γ 2 HSIC( z ; F , G ) := i i =1

Hilbert-Schmidt Independence Criterion • Given γ i := COCO i ( z ; F , G ), define Hilbert-Schmidt Independence Criterion (HSIC) [ALT05, NIPS07a, JMLR10] : n � γ 2 HSIC( z ; F , G ) := i i =1 • In limit of infinite samples: HSIC( P ; F, G ) := � C xy � 2 HS = � C xy , C xy � HS = E x , x ′ , y , y ′ [ k ( x , x ′ ) l ( y , y ′ )] + E x , x ′ [ k ( x , x ′ )] E y , y ′ [ l ( y , y ′ )] � � E x ′ [ k ( x , x ′ )] E y ′ [ l ( y , y ′ )] − 2 E x , y x ′ an independent copy of x , y ′ a copy of y – HSIC is identical to MMD ( P XY , P X P Y )

When does HSIC determine independence? Theorem: When kernels k and l are each characteristic, then HSIC = 0 iff P x , y = P x P y [Gretton, 2015] . Weaker than MMD condition (which requires a kernel characteristic on X × Y to distinguish P x , y from Q x , y ).

Intuition: why characteristic needed on both X and Y Question: Wouldn’t it be enough just to use a rich mapping from X to Y , e.g. via ridge regression with characteristic F : � � E XY ( Y − � f, φ ( X ) � F ) 2 + λ � f � 2 f ∗ = arg min , F f ∈F

Intuition: why characteristic needed on both X and Y Question: Wouldn’t it be enough just to use a rich mapping from X to Y , e.g. via ridge regression with characteristic F : � � E XY ( Y − � f, φ ( X ) � F ) 2 + λ � f � 2 f ∗ = arg min , F f ∈F Counterexample: density symmetric about x -axis, s.t. p ( x, y ) = p ( x, − y ) Correlation: −0.00 1.5 1 0.5 Y 0 −0.5 −1 −1.5 −2 0 2 X

Energy Distance and the MMD

Energy distance and MMD Distance between probability distributions: Energy distance: [Baringhaus and Franz, 2004, Sz´ ekely and Rizzo, 2004, 2005] D E ( P , Q ) = E P � X − X ′ � q + E Q � Y − Y ′ � q − 2 E P , Q � X − Y � q 0 < q ≤ 2 Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012] MMD 2 ( P , Q ; F ) = E P k ( X, X ′ ) + E Q k ( Y, Y ′ ) − 2 E P , Q k ( X, Y )

Energy distance and MMD Distance between probability distributions: Energy distance: [Baringhaus and Franz, 2004, Sz´ ekely and Rizzo, 2004, 2005] D E ( P , Q ) = E P � X − X ′ � q + E Q � Y − Y ′ � q − 2 E P , Q � X − Y � q 0 < q ≤ 2 Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012] MMD 2 ( P , Q ; F ) = E P k ( X, X ′ ) + E Q k ( Y, Y ′ ) − 2 E P , Q k ( X, Y ) Energy distance is MMD with a particular kernel! [Sejdinovic et al., 2013b]

Distance covariance and HSIC Distance covariance (0 < q, r ≤ 2) [Feuerverger, 1993, Sz´ ekely et al., 2007] V 2 ( X, Y ) = E XY E X ′ Y ′ � � X − X ′ � q � Y − Y ′ � r � + E X E X ′ � X − X ′ � q E Y E Y ′ � Y − Y ′ � r � E X ′ � X − X ′ � q E Y ′ � Y − Y ′ � r � − 2 E XY Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al., 2008, Gretton and Gyorfi, 2010] Define RKHS F on X with kernel k , RKHS G on Y with kernel l . Then HSIC( P XY , P X P Y ) = E XY E X ′ Y ′ k ( X, X ′ ) l ( Y, Y ′ ) + E X E X ′ k ( X, X ′ ) E Y E Y ′ l ( Y, Y ′ ) − 2 E X ′ Y ′ � � E X k ( X, X ′ ) E Y l ( Y, Y ′ ) .

Distance covariance and HSIC Distance covariance (0 < q, r ≤ 2) [Feuerverger, 1993, Sz´ ekely et al., 2007] V 2 ( X, Y ) = E XY E X ′ Y ′ � � X − X ′ � q � Y − Y ′ � r � + E X E X ′ � X − X ′ � q E Y E Y ′ � Y − Y ′ � r � E X ′ � X − X ′ � q E Y ′ � Y − Y ′ � r � − 2 E XY Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al., 2008, Gretton and Gyorfi, 2010] Define RKHS F on X with kernel k , RKHS G on Y with kernel l . Then HSIC( P XY , P X P Y ) = E XY E X ′ Y ′ k ( X, X ′ ) l ( Y, Y ′ ) + E X E X ′ k ( X, X ′ ) E Y E Y ′ l ( Y, Y ′ ) − 2 E X ′ Y ′ � � E X k ( X, X ′ ) E Y l ( Y, Y ′ ) . Distance covariance is HSIC with particular kernels! [Sejdinovic et al., 2013b]

Semimetrics and Hilbert spaces Theorem [Berg et al., 1984, Lemma 2.1, p. 74] ρ : X × X → R is a semimetric (no triangle inequality) on X . Let z 0 ∈ X , and denote k ρ ( z, z ′ ) = ρ ( z, z 0 ) + ρ ( z ′ , z 0 ) − ρ ( z, z ′ ) . Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS) iff ρ is of negative type. Call k ρ a distance induced kernel Negative type: The semimetric space ( Z , ρ ) is said to have negative type if ∀ n ≥ 2, z 1 , . . . , z n ∈ Z , and α 1 , . . . , α n ∈ R , with � n i =1 α i = 0, n n � � α i α j ρ ( z i , z j ) ≤ 0 . (1) i =1 j =1

Semimetrics and Hilbert spaces Theorem [Berg et al., 1984, Lemma 2.1, p. 74] ρ : X × X → R is a semimetric (no triangle inequality) on X . Let z 0 ∈ X , and denote k ρ ( z, z ′ ) = ρ ( z, z 0 ) + ρ ( z ′ , z 0 ) − ρ ( z, z ′ ) . Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS) iff ρ is of negative type. Call k ρ a distance induced kernel Special case: Z ⊆ R d and ρ q ( z, z ′ ) = � z − z ′ � q . Then ρ q is a valid semimetric of negative type for 0 < q ≤ 2.

Semimetrics and Hilbert spaces Theorem [Berg et al., 1984, Lemma 2.1, p. 74] ρ : X × X → R is a semimetric (no triangle inequality) on X . Let z 0 ∈ X , and denote k ρ ( z, z ′ ) = ρ ( z, z 0 ) + ρ ( z ′ , z 0 ) − ρ ( z, z ′ ) . Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS) iff ρ is of negative type. Call k ρ a distance induced kernel Special case: Z ⊆ R d and ρ q ( z, z ′ ) = � z − z ′ � q . Then ρ q is a valid semimetric of negative type for 0 < q ≤ 2. Energy distance is MMD with a distance induced kernel Distance covariance is HSIC with distance induced kernels

Two-sample testing benchmark Two-sample testing example in 1-D: 0.4 0.3 Q(X) 0.2 0.1 0.35 −6 −4 −2 0 2 4 6 X 0.3 0.4 0.25 0.3 0.2 Q(X) P(X) VS 0.2 0.15 0.1 0.1 −6 −4 −2 0 2 4 6 X 0.05 0.4 0 −6 −4 −2 0 2 4 6 0.3 X Q(X) 0.2 0.1 −6 −4 −2 0 2 4 6 X

Two-sample test, MMD with distance kernel Obtain more powerful tests on this problem when q � = 1 (exponent of distance) Key: • Gaussian kernel • q = 1 • Best: q = 1 / 3 • Worst: q = 2

Nonparametric Bayesian inference using distribution embeddings

Motivating Example: Bayesian inference without a model • 3600 downsampled frames of 20 × 20 RGB pixels ( Y t ∈ [0 , 1] 1200 ) • 1800 training frames, remaining for test. • Gaussian noise added to Y t . Challenges: • No parametric model of camera dynamics (only samples) • No parametric model of map from camera angle to image (only samples) • Want to do filtering: Bayesian inference

ABC: an approach to Bayesian inference without a model Bayes rule: P ( x | y ) π ( y ) � P ( y | x ) = P ( x | y ) π ( y ) dy • P ( x | y ) is likelihood • π ( y ) is prior One approach: Approximate Bayesian Computation (ABC)

ABC: an approach to Bayesian inference without a model Approximate Bayesian Computation (ABC): ABC demonstration 10 Likelihood x ∼ P ( X | y ) 5 0 −5 −10 −10 −5 0 5 10 Prior y ∼ π ( Y )

ABC: an approach to Bayesian inference without a model Approximate Bayesian Computation (ABC): ABC demonstration 10 Likelihood x ∼ P ( X | y ) 5 x ∗ 0 −5 −10 −10 −5 0 5 10 Prior y ∼ π ( Y )

ABC: an approach to Bayesian inference without a model Approximate Bayesian Computation (ABC): ABC demonstration 10 Likelihood x ∼ P ( X | y ) 5 x ∗ 0 −5 ˆ −10 P ( Y | x ∗ ) −10 −5 0 5 10 Prior y ∼ π ( Y )

ABC: an approach to Bayesian inference without a model Approximate Bayesian Computation (ABC): ABC demonstration 10 Likelihood x ∼ P ( X | y ) 5 x ∗ 0 −5 ˆ −10 P ( Y | x ∗ ) −10 −5 0 5 10 Prior y ∼ π ( Y ) Needed: distance measure D , tolerance parameter τ .

ABC: an approach to Bayesian inference without a model Bayes rule: P ( x | y ) π ( y ) � P ( y | x ) = P ( x | y ) π ( y ) dy • P ( x | y ) is likelihood • π ( y ) is prior ABC generates a sample from p ( Y | x ∗ ) as follows: 1. generate a sample y t from the prior π , 2. generate a sample x t from P ( X | y t ), 3. if D ( x ∗ , x t ) < τ , accept y = y t ; otherwise reject, 4. go to (i). In step (3), D is a distance measure, and τ is a tolerance parameter.

Motivating example 2: simple Gaussian case d ) T , V ) with V a randomly generated covariance • p ( x, y ) is N ((0 , 1 T Posterior mean on x : ABC vs kernel approach CPU time vs Error (6 dim.) 3.3 × 10 2 9.5 × 10 2 KBI −1 10 COND 2.5 × 10 3 ABC 1.0 × 10 4 200 Mean Square Errors 400 7.0 × 10 4 600 200 800 400 1000 1.3 × 10 6 600 1500 −2 2000 800 10 3000 4000 1000 5000 2000 1500 5000 6000 3000 4000 6000 0 1 2 3 4 10 10 10 10 10 CPU time (sec)

Bayes again Bayes rule: P ( x | y ) π ( y ) � P ( y | x ) = P ( x | y ) π ( y ) dy • P ( x | y ) is likelihood • π is prior How would this look with kernel embeddings?

Bayes again Bayes rule: P ( x | y ) π ( y ) � P ( y | x ) = P ( x | y ) π ( y ) dy • P ( x | y ) is likelihood • π is prior How would this look with kernel embeddings? Define RKHS G on Y with feature map ψ y and kernel l ( y, · ) We need a conditional mean embedding: for all g ∈ G , E Y | x ∗ g ( Y ) = � g, µ P ( y | x ∗ ) � G This will be obtained by RKHS-valued ridge regression

Ridge regression and the conditional feature mean Ridge regression from X := R d to a finite vector output Y := R d ′ (these could be d ′ nonlinear features of y ): Define training data � � � � ∈ R d ′ × m ∈ R d × m X = Y = x 1 . . . x m y 1 . . . y m

Ridge regression and the conditional feature mean Ridge regression from X := R d to a finite vector output Y := R d ′ (these could be d ′ nonlinear features of y ): Define training data � � � � ∈ R d ′ × m ∈ R d × m X = Y = x 1 . . . x m y 1 . . . y m Solve � � � Y − AX � 2 + λ � A � 2 ˘ A = arg min A ∈ R d ′× d , H S where min { d,d ′ } � � A � 2 H S = tr( A ⊤ A ) = γ 2 A,i i =1

Ridge regression and the conditional feature mean Ridge regression from X := R d to a finite vector output Y := R d ′ (these could be d ′ nonlinear features of y ): Define training data � � � � ∈ R d ′ × m ∈ R d × m X = Y = x 1 . . . x m y 1 . . . y m Solve � � � Y − AX � 2 + λ � A � 2 ˘ A = arg min A ∈ R d ′× d , H S where min { d,d ′ } � � A � 2 H S = tr( A ⊤ A ) = γ 2 A,i i =1 Solution: ˘ A = C Y X ( C XX + mλI ) − 1

Ridge regression and the conditional feature mean Prediction at new point x : ˘ y ∗ = Ax C Y X ( C XX + mλI ) − 1 x = m � = β i ( x ) y i i =1 where β i ( x ) = ( K + λmI ) − 1 � � ⊤ k ( x 1 , x ) . . . k ( x m , x ) and K := X ⊤ X k ( x 1 , x ) = x ⊤ 1 x

Ridge regression and the conditional feature mean Prediction at new point x : ˘ y ∗ = Ax C Y X ( C XX + mλI ) − 1 x = m � = β i ( x ) y i i =1 where β i ( x ) = ( K + λmI ) − 1 � � ⊤ k ( x 1 , x ) . . . k ( x m , x ) and K := X ⊤ X k ( x 1 , x ) = x ⊤ 1 x What if we do everything in kernel space?

Ridge regression and the conditional feature mean Recall our setup: • Given training pairs: ( x i , y i ) ∼ P XY • F on X with feature map ϕ x and kernel k ( x, · ) • G on Y with feature map ψ y and kernel l ( y, · ) We define the covariance between feature maps: C XX = E X ( ϕ X ⊗ ϕ X ) C XY = E XY ( ϕ X ⊗ ψ Y ) and matrices of feature mapped training data � � � � X = Y := ϕ x 1 . . . ϕ x m ψ y 1 . . . ψ y m

Ridge regression and the conditional feature mean Objective: [Weston et al. (2003), Micchelli and Pontil (2005), Caponnetto and De Vito (2007), Grunewalder et al. (2012, 2013) ] � � ∞ � ˘ E XY � Y − AX � 2 G + λ � A � 2 � A � 2 γ 2 A = arg min , H S = H S A,i A ∈ HS( F , G ) i =1 Solution same as vector case: A = C Y X ( C XX + mλI ) − 1 , ˘ Prediction at new x using kernels: � � ( K + λmI ) − 1 � � ˘ Aϕ x = ψ y 1 . . . ψ y m k ( x 1 , x ) . . . k ( x m , x ) m � = β i ( x ) ψ y i i =1 where K ij = k ( x i , x j )

Ridge regression and the conditional feature mean How is loss � Y − AX � 2 G relevant to conditional expectation of some E Y | x g ( Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)] µ Y | x := Aϕ x

Ridge regression and the conditional feature mean How is loss � Y − AX � 2 G relevant to conditional expectation of some E Y | x g ( Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)] µ Y | x := Aϕ x We need A to have the property E Y | x g ( Y ) ≈ � g, µ Y | x � G = � g, Aϕ x � G = � A ∗ g, ϕ x � F = ( A ∗ g )( x )

Ridge regression and the conditional feature mean How is loss � Y − AX � 2 G relevant to conditional expectation of some E Y | x g ( Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)] µ Y | x := Aϕ x We need A to have the property E Y | x g ( Y ) ≈ � g, µ Y | x � G = � g, Aϕ x � G = � A ∗ g, ϕ x � F = ( A ∗ g )( x ) Natural risk function for conditional mean   2 � �   ( X ) − ( A ∗ g ) L ( A, P XY ) := sup E Y | X g ( Y ) ( X ) , E X   � �� g �≤ 1 Estimator Target

Ridge regression and the conditional feature mean The squared loss risk provides an upper bound on the natural risk. L ( A, P XY ) ≤ E XY � ψ Y − Aϕ X � 2 G

Ridge regression and the conditional feature mean The squared loss risk provides an upper bound on the natural risk. L ( A, P XY ) ≤ E XY � ψ Y − Aϕ X � 2 G Proof: Jensen and Cauchy Schwarz �� 2 ( X ) − ( A ∗ g ) ( X ) L ( A, P XY ) := sup E Y | X g ( Y ) E X � g �≤ 1 [ g ( Y ) − ( A ∗ g ) ( X )] 2 ≤ E XY sup � g �≤ 1

Lecture 3: Dependence measures using RKHS embeddings MLSS T - PowerPoint PPT Presentation

Lecture 3: Dependence measures using RKHS embeddings MLSS T ubingen, 2015 Arthur Gretton Gatsby Unit, CSML, UCL Outline Three or more variable interactions, comparison with conditional dependence testing [Sejdinovic et al., 2013a]

Lecture 3: Dependence measures using RKHS embeddings MLSS Cadiz, 2016 Arthur Gretton Gatsby

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Lecture 1: Introduction to RKHS MLSS Cadiz, 2016 Gatsby Unit, CSML, UCL May 12, 2016 Lecture 1:

Measuring Dependence and Conditional Dependence with Kernels Kenji Fukumizu The Institute of

Lecture 1: Introduction to RKHS MLSS Tbingen, 2015 Gatsby Unit, CSML, UCL July 22, 2015

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 L. Rosasco RKHS About this

Linear dependence and independence Linear dependence 1 Definition (linear (in)dependence) Let {

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS Cadiz, 2016 Arthur Gretton

The Axiom of Interdependence The Role of Aggregate Demand and Aggregate Supply in the New

Endogenous Regime Switching Near the Zero Lower Bound 1 Kevin J. Lansing Federal Reserve Bank of

A Cognitively Plausible Adaptive Neural Language Model Marten van Schijndel and Tal Linzen May

Inflation and the Theory of the Phillips Curve Thomas I. Palley New America Foundation

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling Instructor: Sham Kakade 1 The

Discussionof of: : A A Quanti titati tive Model el for th the Integ egrated ed

What is accomplished by successful non stationary stochastic prediction? Glenn Shafer, Rutgers

The Impact of Credit Market Sentiment Shocks A TVAR Approach NED 2019, Kiev Maximilian Bck