lecture 2 mappings of probabilities to rkhs and
play

Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS - PowerPoint PPT Presentation

Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS Cadiz, 2016 Arthur Gretton Gatsby Unit, CSML, UCL Outline Kernel metric on the space of probability measures Function revealing differences in distributions


  1. Function Showing Difference in Distributions • What if the function is not smooth? MMD( P , Q ; F ) := sup [ E P f ( x ) − E Q f ( y )] . f ∈ F Bounded continuous function 1 0.5 f(x) 0 −0.5 −1 0 0.2 0.4 0.6 0.8 1 x

  2. Function Showing Difference in Distributions • What if the function is not smooth? MMD( P , Q ; F ) := sup [ E P f ( x ) − E Q f ( y )] . f ∈ F Bounded continuous function 1 0.5 f(x) 0 −0.5 −1 0 0.2 0.4 0.6 0.8 1 x

  3. Function Showing Difference in Distributions • Maximum mean discrepancy: smooth function for P vs Q MMD( P , Q ; F ) := sup [ E P f ( x ) − E Q f ( y )] . f ∈ F • Gauss P vs Laplace Q Witness f for Gauss and Laplace densities 0.8 f Gauss 0.6 Laplace Prob. density and f 0.4 0.2 0 −0.2 −0.4 −0.6 −6 −4 −2 0 2 4 6 X

  4. Function Showing Difference in Distributions • Maximum mean discrepancy: smooth function for P vs Q MMD( P , Q ; F ) := sup [ E P f ( x ) − E Q f ( y )] . f ∈ F • Classical results: MMD( P , Q ; F ) = 0 iff P = Q , when – F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨ uller, 1997] – F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]

  5. Function Showing Difference in Distributions • Maximum mean discrepancy: smooth function for P vs Q MMD( P , Q ; F ) := sup [ E P f ( x ) − E Q f ( y )] . f ∈ F • Classical results: MMD( P , Q ; F ) = 0 iff P = Q , when – F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨ uller, 1997] – F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002] • MMD( P , Q ; F ) = 0 iff P = Q when F =the unit ball in a characteristic RKHS F (coming soon!) [ISMB06, NIPS06a, NIPS07b, NIPS08a, JMLR10]

  6. Function Showing Difference in Distributions • Maximum mean discrepancy: smooth function for P vs Q MMD( P , Q ; F ) := sup [ E P f ( x ) − E Q f ( y )] . f ∈ F • Classical results: MMD( P , Q ; F ) = 0 iff P = Q , when – F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨ uller, 1997] – F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002] • MMD( P , Q ; F ) = 0 iff P = Q when F =the unit ball in a characteristic RKHS F (coming soon!) [ISMB06, NIPS06a, NIPS07b, NIPS08a, JMLR10] How do smooth functions relate to feature maps?

  7. Function view vs feature mean view • The (kernel) MMD: [ISMB06, NIPS06a] MMD( P , Q ; F ) [ E P f ( x ) − E Q f ( y )] = sup f ∈ F Witness f for Gauss and Laplace densities 0.8 f Gauss 0.6 Laplace Prob. density and f 0.4 0.2 0 −0.2 −0.4 −0.6 −6 −4 −2 0 2 4 6 X

  8. Function view vs feature mean view • The (kernel) MMD: [ISMB06, NIPS06a] MMD( P , Q ; F ) use [ E P f ( x ) − E Q f ( y )] = sup � µ P , f � F E P ( f ( x )) =: f ∈ F

  9. Function view vs feature mean view • The (kernel) MMD: [ISMB06, NIPS06a] MMD( P , Q ; F ) use [ E P f ( x ) − E Q f ( y )] = sup E P ( f ( x )) =: � µ P , f � F f ∈ F = sup � f, µ P − µ Q � F f ∈ F

  10. Function view vs feature mean view • The (kernel) MMD: [ISMB06, NIPS06a] MMD( P , Q ; F ) use [ E P f ( x ) − E Q f ( y )] = sup � θ � F = sup � f, θ � F f ∈ F f ∈ F = sup � f, µ P − µ Q � F since F := { f ∈ F : f ∈ F = � µ P − µ Q � F � f � ≤ 1 } Function view and feature view equivalent

  11. MMD for independence: HSIC • Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05, NIPS07a, ALT07, ALT08, JMLR10] Related to [Feuerverger, 1993] and [Sz´ ekely and Rizzo, 2009, Sz´ ekely et al., 2007] HSIC ( P XY , P X P Y ) := � µ P XY − µ P X P Y � 2

  12. MMD for independence: HSIC • Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05, NIPS07a, ALT07, ALT08, JMLR10] Related to [Feuerverger, 1993] and [Sz´ ekely and Rizzo, 2009, Sz´ ekely et al., 2007] HSIC ( P XY , P X P Y ) := � µ P XY − µ P X P Y � 2 l ( ) , k ( ) , #" !" !" #" κ ( ) = #" #" !" !" , k ( ) × l ( ) , , !" #" !" #"

  13. MMD for independence: HSIC • Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05, NIPS07a, ALT07, ALT08, JMLR10] Related to [Feuerverger, 1993] and [Sz´ ekely and Rizzo, 2009, Sz´ ekely et al., 2007] HSIC ( P XY , P X P Y ) := � µ P XY − µ P X P Y � 2 HSIC using expectations of kernels: Define RKHS F on X with kernel k , RKHS G on Y with kernel l . Then HSIC( P XY , P X P Y ) = E XY E X ′ Y ′ k ( x , x ′ ) l ( y , y ′ ) + E X E X ′ k ( x , x ′ ) E Y E Y ′ l ( y , y ′ ) − 2 E X ′ Y ′ � � E X k ( x , x ′ ) E Y l ( y , y ′ ) .

  14. HSIC: empirical estimate and intuition !"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',& -"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*& 2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#& -"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-& 2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.&

  15. HSIC: empirical estimate and intuition !"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',& !" #" -"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*& 2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#& -"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-& 2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.&

  16. HSIC: empirical estimate and intuition !"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',& !" #" -"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*& 2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#& -"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-& 2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& Empirical HSIC ( P XY , P X P Y ): 1 n 2 ( HKH ◦ HLH ) ++

  17. Characteristic kernels (Via Fourier, on the torus T )

  18. Characteristic Kernels (via Fourier) Reminder: Characteristic: MMD a metric (MMD = 0 iff P = Q ) [NIPS07b, JMLR10] In the next slides: 1. Characteristic property on [ − π, π ] with periodic boundary 2. Characteristic property on R d

  19. Characteristic Kernels (via Fourier) Reminder: Fourier series • Function [ − π, π ] with periodic boundary. ∞ ∞ � � ˆ ˆ f ( x ) = f ℓ exp( ıℓx ) = f ℓ (cos( ℓx ) + ı sin( ℓx )) . ℓ = −∞ l = −∞ Top hat Fourier series coefficients 0.5 1.4 0.4 1.2 1 0.3 0.8 0.2 f ( x ) f ℓ 0.6 ˆ 0.1 0.4 0 0.2 0 −0.1 −0.2 −0.2 −4 −2 0 2 4 −10 −5 0 5 10 x ℓ

  20. Characteristic Kernels (via Fourier) Reminder: Fourier series of kernel ∞ � ˆ k ( x, y ) = k ( x − y ) = k ( z ) , k ( z ) = k ℓ exp ( ıℓz ) , ℓ = −∞ � � � � 2 π , ıσ 2 − σ 2 ℓ 2 ˆ 1 x 1 E.g., k ( x ) = k ℓ = 2 π exp 2 π ϑ , . 2 π 2 ϑ is the Jacobi theta function, close to Gaussian when σ 2 sufficiently narrower than [ − π, π ]. Kernel Fourier series coefficients 0.16 0.6 0.14 0.5 0.12 0.4 0.1 k ( x ) f ℓ 0.3 ˆ 0.08 0.2 0.06 0.1 0.04 0 0.02 −0.1 0 −4 −2 0 2 4 −10 −5 0 5 10 x ℓ

  21. Characteristic Kernels (via Fourier) Maximum mean embedding via Fourier series: • Fourier series for P is characteristic function ¯ φ P • Fourier series for mean embedding is product of fourier series! (convolution theorem) � π µ P ,ℓ = ˆ k ℓ × ¯ µ P ( x ) = E P k ( x − x ) = k ( x − t ) d P ( t ) ˆ φ P ,ℓ − π

  22. Characteristic Kernels (via Fourier) Maximum mean embedding via Fourier series: • Fourier series for P is characteristic function ¯ φ P • Fourier series for mean embedding is product of fourier series! (convolution theorem) � π µ P ,ℓ = ˆ k ℓ × ¯ µ P ( x ) = E P k ( x − x ) = k ( x − t ) d P ( t ) ˆ φ P ,ℓ − π • MMD can be written in terms of Fourier series: � � � � �� ¯ � ∞ � � ˆ � � φ P ,ℓ − ¯ MMD( P , Q ; F ) := exp( ıℓx ) φ Q ,ℓ k ℓ � � � � ℓ = −∞ F

  23. A simpler Fourier expression for MMD • From previous slide, � � � � �� ¯ � ∞ � � ˆ � � φ P ,ℓ − ¯ MMD( P , Q ; F ) := exp( ıℓx ) � φ Q ,ℓ k ℓ � � � ℓ = −∞ F • The squared norm of a function f in F is: ∞ � | ˆ f ℓ | 2 � f � 2 F = � f, f � F = . ˆ k ℓ l = −∞ • Simple, interpretable expression for squared MMD: ∞ ∞ [ | φ P ,ℓ − φ Q ,ℓ | 2 ˆ � � k ℓ ] 2 | φ P ,ℓ − φ Q ,ℓ | 2 ˆ MMD 2 ( P , Q ; F ) = = k ℓ ˆ k ℓ l = −∞ l = −∞

  24. Example • Example: P differs from Q at one frequency 0.2 0.15 P ( x ) 0.1 0.05 0 −2 0 2 x 0.2 0.15 Q ( x ) 0.1 0.05 0 −2 0 2 x

  25. Characteristic Kernels (2) • Example: P differs from Q at (roughly) one frequency 0.2 1 0.15 F P ( x ) → φ P,ℓ 0.1 0.5 0.05 0 0 −2 0 2 −10 0 10 x ℓ 0.2 1 0.15 Q ( x ) φ Q,ℓ F 0.5 → 0.1 0.05 0 0 −10 0 10 −2 0 2 ℓ x

  26. Characteristic Kernels (2) • Example: P differs from Q at (roughly) one frequency 0.2 1 0.15 F P ( x ) Characteristic function difference → φ P,ℓ 0.1 1 0.5 0.05 ց 0.8 φ P,ℓ − φ Q,ℓ 0 0 0.6 −2 0 2 −10 0 10 x ℓ 0.2 1 0.4 ր 0.15 0.2 Q ( x ) φ Q,ℓ F 0.5 → 0.1 0 −10 0 10 ℓ 0.05 0 0 −10 0 10 −2 0 2 ℓ x

  27. Example Is the Gaussian-spectrum kernel characteristic? Kernel Fourier series coefficients 0.16 0.6 0.14 0.5 0.12 0.4 0.1 k ( x ) f ℓ 0.3 ˆ 0.08 0.2 0.06 0.1 0.04 0 0.02 −0.1 0 −4 −2 0 2 4 −10 −5 0 5 10 x ℓ ∞ � | φ P ,ℓ − φ Q ,ℓ | 2 ˆ MMD 2 ( P , Q ; F ) := k ℓ l = −∞

  28. Example Is the Gaussian-spectrum kernel characteristic? YES Kernel Fourier series coefficients 0.16 0.6 0.14 0.5 0.12 0.4 0.1 k ( x ) f ℓ 0.3 ˆ 0.08 0.2 0.06 0.1 0.04 0 0.02 −0.1 0 −4 −2 0 2 4 −10 −5 0 5 10 x ℓ ∞ � | φ P ,ℓ − φ Q ,ℓ | 2 ˆ MMD 2 ( P , Q ; F ) := k ℓ l = −∞

  29. Example Is the triangle kernel characteristic? Triangle Fourier series coefficients 0.3 0.07 0.25 0.06 0.2 0.05 0.15 0.04 f ( x ) f ℓ 0.1 ˆ 0.03 0.05 0.02 0 0.01 −0.05 −0.1 0 −4 −2 0 2 4 −10 −5 0 5 10 x ℓ ∞ � | φ P ,ℓ − φ Q ,ℓ | 2 ˆ MMD 2 ( P , Q ; F ) := k ℓ l = −∞

  30. Example Is the triangle kernel characteristic? NO Triangle Fourier series coefficients 0.3 0.07 0.25 0.06 0.2 0.05 0.15 0.04 f ( x ) f ℓ 0.1 ˆ 0.03 0.05 0.02 0 0.01 −0.05 −0.1 0 −4 −2 0 2 4 −10 −5 0 5 10 x ℓ ∞ � | φ P ,ℓ − φ Q ,ℓ | 2 ˆ MMD 2 ( P , Q ; F ) := k ℓ l = −∞

  31. Characteristic kernels (Via Fourier, on R d )

  32. Characteristic Kernels (via Fourier) • Can we prove characteristic on R d ?

  33. Characteristic Kernels (via Fourier) • Can we prove characteristic on R d ? • Characteristic function of P via Fourier transform � R d e ix ⊤ ω d P ( x ) φ P ( ω ) =

  34. Characteristic Kernels (via Fourier) • Can we prove characteristic on R d ? • Characteristic function of P via Fourier transform � R d e ix ⊤ ω d P ( x ) φ P ( ω ) = • Translation invariant kernels: k ( x, y ) = k ( x − y ) = k ( z ) • Bochner’s theorem: � R d e − iz ⊤ ω d Λ( ω ) k ( z ) = – Λ finite non-negative Borel measure

  35. Characteristic Kernels (via Fourier) • Can we prove characteristic on R d ? • Characteristic function of P via Fourier transform � R d e ix ⊤ ω d P ( x ) φ P ( ω ) = • Translation invariant kernels: k ( x, y ) = k ( x − y ) = k ( z ) • Bochner’s theorem: � R d e − iz ⊤ ω d Λ( ω ) k ( z ) = – Λ finite non-negative Borel measure

  36. Characteristic Kernels (via Fourier) Fourier representation of MMD: � | φ P ( ω ) − φ Q ( ω ) | 2 d Λ( ω ) MMD 2 ( P , Q ; F ) = φ P characteristic function of P Proof: Using Bochner’s theorem (a)... and Fubini’s theorem (b) MMD 2 ( P , Q ) := E P k ( x − x ′ ) + E Q k ( y − y ′ ) − 2 E P , Q k ( x , y ) � � � � = k ( s − t ) d ( P − Q )( s ) d ( P − Q )( t ) � � � ( a ) R d e − i ( s − t ) T ω d Λ( ω ) d ( P − Q )( s ) d ( P − Q )( t ) = � � � R d e − ix T ω d ( P − Q )( s ) R d e iy T ω d ( P − Q )( t ) d Λ( ω ) ( b ) = � R d | φ P ( ω ) − φ Q ( ω ) | 2 d Λ( ω ) =

  37. Example • Example: P differs from Q at (roughly) one frequency 0.35 0.3 0.25 0.2 P(X) 0.15 0.1 0.05 0 −10 −5 0 5 10 X 0.5 0.4 0.3 Q(X) 0.2 0.1 0 −10 −5 0 5 10 X

  38. Example • Example: P differs from Q at (roughly) one frequency 0.35 0.4 0.3 0.3 F 0.25 → 0.2 P(X) | φ P | 0.2 0.15 0.1 0.1 0.05 0 0 −10 −5 0 5 10 −20 −10 0 10 20 X ω 0.5 0.4 0.4 0.3 0.3 Q(X) | φ Q | F 0.2 → 0.2 0.1 0.1 0 0 −10 −5 0 5 10 −20 −10 0 10 20 X ω

  39. Example • Example: P differs from Q at (roughly) one frequency 0.35 0.4 0.3 0.3 F 0.25 → 0.2 P(X) | φ P | 0.2 Characteristic function difference 0.15 0.2 0.1 ց 0.1 0.05 0.15 0 0 −10 −5 0 5 10 −20 −10 0 10 20 | φ P − φ Q | X ω 0.1 0.5 0.4 0.05 ր 0.4 0.3 0.3 0 Q(X) | φ Q | −30 −20 −10 0 10 20 30 F 0.2 → ω 0.2 0.1 0.1 0 0 −10 −5 0 5 10 −20 −10 0 10 20 X ω

  40. Example • Example: P differs from Q at (roughly) one frequency Exponentiated quadratic kernel Difference | φ P − φ Q | 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω

  41. Example • Example: P differs from Q at (roughly) one frequency Characteristic 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω

  42. Example • Example: P differs from Q at (roughly) one frequency Sinc kernel Difference | φ P − φ Q | 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω

  43. Example • Example: P differs from Q at (roughly) one frequency NOT characteristic 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω

  44. Example • Example: P differs from Q at (roughly) one frequency Triangle (B-spline) kernel Difference | φ P − φ Q | 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω

  45. Example • Example: P differs from Q at (roughly) one frequency ??? 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω

  46. Example • Example: P differs from Q at (roughly) one frequency Characteristic 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω

  47. Summary: Characteristic Kernels Characteristic kernel: (MMD = 0 iff P = Q ) [NIPS07b, COLT08] Main theorem: A translation invariant k characteristic for prob. measures on R d if and only if supp(Λ) = R d (i.e. support zero on at most a countable set) [COLT08, JMLR10] Corollary: continuous, compactly supported k characteristic (since Fourier spectrum Λ( ω ) cannot be zero on an interval). 1-D proof sketch from [Mallat, 1999, Theorem 2.6] proof on R d via distribution theory in [Sriperumbudur et al., 2010, Corollary 10 p. 1535]

  48. k characteristic iff supp(Λ) = R d Proof: supp { Λ } = R d = ⇒ k characteristic: Recall Fourier definition of MMD: � R d | φ P ( ω ) − φ Q ( ω ) | 2 d Λ( ω ) . MMD 2 ( P , Q ) = Characteristic functions φ P ( ω ) and φ Q ( ω ) uniformly continuous, hence their difference cannot be non-zero only on a countable set. Map φ P uniformly continuous: ∀ ǫ > 0 , ∃ δ > 0 such that ∀ ( ω 1 , ω 2 ) ∈ Ω for which d ( ω 1 , ω 2 ) < δ , we have d ( φ P ( ω 1 ) , φ P ( ω 2 )) < ǫ . Uniform: δ depends only on ǫ , not on ω 1 , ω 2 .

  49. k characteristic iff supp(Λ) = R d ⇒ supp { Λ } = R d : Proof: k characteristic = Proof by contrapositive. Given supp { Λ } � R d , hence ∃ open interval U such that Λ( ω ) zero on U . Construct densities p ( x ), q ( x ) such that φ P , φ Q differ only inside U

  50. Further extensions • Similar reasoning wherever extensions of Bochner’s theorem exist: [Fukumizu et al., 2009] – Locally compact Abelian groups (periodic domains, as we saw) – Compact, non-Abelian groups (orthogonal matrices) – The semigroup R + n (histograms) • Related kernel statistics: Fisher statistic [Harchaoui et al., 2008] (zero iff P = Q for characteristic kernels), other distances [Zhou and Chellappa, 2006] (not yet shown to establish whether P = Q ), energy distances

  51. Statistical hypothesis testing

  52. Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst? LFP near spike burst LFP without spike burst 0.3 0.3 0.2 0.2 0.1 0.1 LFP amplitude LFP amplitude 0 0 −0.1 −0.1 −0.2 −0.2 −0.3 −0.3 −0.4 −0.4 0 20 40 60 80 100 0 20 40 60 80 100 Time Time

  53. Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst?

  54. Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst?

  55. Statistical test using MMD (1) • Two hypotheses: – H 0 : null hypothesis ( P = Q ) – H 1 : alternative hypothesis ( P � = Q )

  56. Statistical test using MMD (1) • Two hypotheses: – H 0 : null hypothesis ( P = Q ) – H 1 : alternative hypothesis ( P � = Q ) • Observe samples x := { x 1 , . . . , x n } from P and y from Q • If empirical MMD( x , y ; F ) is – “far from zero”: reject H 0 – “close to zero”: accept H 0

  57. Statistical test using MMD (2) • “far from zero” vs “close to zero” - threshold? 2 One answer: asymptotic distribution of � • MMD

  58. Statistical test using MMD (2) • “far from zero” vs “close to zero” - threshold? 2 One answer: asymptotic distribution of � • MMD • An unbiased empirical estimate (quadratic cost): � 2 = � 1 MMD k ( x i , x j ) − k ( x i , y j ) − k ( y i , x j ) + k ( y i , y j ) n ( n − 1) � �� � i � = j h (( x i ,y i ) , ( x j ,y j ))

  59. Statistical test using MMD (2) • “far from zero” vs “close to zero” - threshold? 2 One answer: asymptotic distribution of � • MMD • An unbiased empirical estimate (quadratic cost): � 2 = � 1 MMD k ( x i , x j ) − k ( x i , y j ) − k ( y i , x j ) + k ( y i , y j ) n ( n − 1) � �� � i � = j h (( x i ,y i ) , ( x j ,y j )) • When P � = Q , asymptotically normal � 2 − MMD 2 � ( √ n ) � ∼ N (0 , σ 2 MMD u ) [Hoeffding, 1948, Serfling, 1980] • Expression for the variance: z i := ( x i , y i ) � � 2 � � ( E z ′ h ( z , z ′ )) 2 � � σ 2 E z , z ′ ( h ( z , z ′ )) E z − u = 4

  60. Statistical test using MMD (3) • Example: laplace distributions with different variance MMD distribution and Gaussian fit under H1 Two Laplace distributions with different variances 14 1.5 P X Empirical PDF Q X Gaussian fit 12 Prob. density 1 10 Prob. density 0.5 8 0 −6 −4 −2 0 2 4 6 X 6 4 2 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 MMD

  61. Statistical test using MMD (4) • When P = Q , U-statistic degenerate: E z ′ [ h ( z , z ′ )] = 0 [Anderson et al., 1994] • Distribution is ∞ � � � z 2 n MMD( x , y ; F ) ∼ l − 2 λ l l =1 • where – z l ∼ N (0 , 2) i.i.d � X ˜ k ( x, x ′ ) ψ i ( x ) d P ( x ) = λ i ψ i ( x ′ ) – � �� � centred

  62. Statistical test using MMD (4) • When P = Q , U-statistic degenerate: E z ′ [ h ( z , z ′ )] = 0 [Anderson et al., 1994] • Distribution is ∞ � � � z 2 n MMD( x , y ; F ) ∼ l − 2 λ l l =1 MMD density under H0 • where 0.7 χ 2 sum Empirical PDF – z l ∼ N (0 , 2) i.i.d 0.6 � X ˜ k ( x, x ′ ) ψ i ( x ) d P ( x ) = λ i ψ i ( x ′ ) 0.5 – Prob. density � �� � 0.4 centred 0.3 0.2 0.1 0 −2 −1 0 1 2 3 4 5 6 n × MMD 2

  63. Statistical test using MMD (5) • Given P = Q , want threshold T such that P (MMD > T ) ≤ 0 . 05 2 = K P,P + K Q,Q − 2 K P,Q � MMD MMD density under H0 and H1 0.7 null alternative 0.6 0.5 Prob. density 1− α null quantile 0.4 0.3 0.2 Type II error 0.1 0 −2 −1 0 1 2 3 4 5 6 n × MMD 2

  64. Statistical test using MMD (5) • Given P = Q , want threshold T such that P (MMD > T ) ≤ 0 . 05

  65. Statistical test using MMD (5) • Given P = Q , want threshold T such that P (MMD > T ) ≤ 0 . 05 • Permutation for empirical CDF [Arcones and Gin´ e, 1992, Alba Fern´ andez et al., 2008] • Pearson curves by matching first four moments [Johnson et al., 1994] • Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989] • Consistent test using kernel eigenspectrum [NIPS09b]

  66. Statistical test using MMD (5) • Given P = Q , want threshold T such that P (MMD > T ) ≤ 0 . 05 • Permutation for empirical CDF [Arcones and Gin´ e, 1992, Alba Fern´ andez et al., 2008] • Pearson curves by matching first four moments [Johnson et al., 1994] • Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989] • Consistent test using kernel eigenspectrum [NIPS09b] CDF of the MMD and Pearson fit 1 P(MMD < mmd) 0.8 0.6 0.4 0.2 MMD Pearson 0 −0.02 0 0.02 0.04 0.06 0.08 0.1 mmd

  67. Approximate null distribution of � MMD via permutation Empirical MMD: ) ⊤ w = (1 , 1 , 1 , . . . 1 , − 1 . . . , − 1 , − 1 , − 1 � �� � � �� � n n � �  2 � � � � 1 ≈  K P,P K P,Q MMD ww ⊤  ⊙ n 2 K Q,P K Q,Q

  68. Approximate null distribution of � MMD via permutation Permuted case: [Alba Fern´ andez et al., 2008] ) ⊤ w = (1 , − 1 , 1 , . . . 1 , − 1 . . . , 1 , − 1 , − 1 � �� � � �� � n n � � (equal number of +1 and − 1)  � � � 1 � ? �  K P,P K P,Q ww ⊤  ⊙ = n 2 K Q,P K Q,Q

  69. Approximate null distribution of � MMD via permutation Permuted case: [Alba Fern´ andez et al., 2008] ) ⊤ w = (1 , − 1 , 1 , . . . 1 , − 1 . . . , 1 , − 1 , − 1 � �� � � �� � n n � � (equal number of +1 and − 1)  � � � 1 � ? �  K P,P K P,Q ww ⊤  ⊙ = n 2 K Q,P K Q,Q ⊙ = Figure thanks to Kacper Chwialkowski.

  70. 2 via permutation Approximate null distribution of � MMD � �  � � � p ≈ 1  K P,P K P,Q ww ⊤ 2 �  ⊙ MMD n 2 K Q,P K Q,Q MMD density under H0 0.7 Null PDF Null PDF from permutation 0.6 0.5 Prob. density 0.4 0.3 0.2 0.1 0 −2 −1 0 1 2 3 4 5 6 n × MMD 2

  71. Detecting differences in brain signals Do local field potential (LFP) signals change when measured near a spike burst? LFP near spike burst LFP without spike burst 0.3 0.3 0.2 0.2 0.1 0.1 LFP amplitude LFP amplitude 0 0 −0.1 −0.1 −0.2 −0.2 −0.3 −0.3 −0.4 −0.4 0 20 40 60 80 100 0 20 40 60 80 100 Time Time

  72. Neuro data: consistent test w/o permutation • Maximum mean discrepancy (MMD): distance between P and Q MMD( P , Q ; F ) := � µ P − µ Q � 2 F • Is � MMD significantly > 0? P ≠ Q (neuro) 0.5 • P = Q , null distrib. of � MMD: Spectral Permutation 0.4 Type II error ∞ � n � λ l ( z 2 MMD → l − 2) , 0.3 D l =1 0.2 – λ l is l th eigenvalue of 0.1 kernel ˜ k ( x i , x j ) 0 100 150 200 250 300 Sample size m Use Gram matrix spectrum for ˆ λ l : consistent test without permutation

  73. Hypothesis testing with HSIC

  74. Distribution of HSIC at independence • (Biased) empirical HSIC a v-statistic HSIC b = 1 n 2 trace( KHLH ) – Statistical testing: How do we find when this is larger enough that the null hypothesis P = P x P y is unlikely? – Formally: given P = P x P y , what is the threshold T such that P (HSIC > T ) < α for small α ?

Recommend


More recommend