 
              Function Showing Difference in Distributions • What if the function is not smooth? MMD( P , Q ; F ) := sup [ E P f ( x ) − E Q f ( y )] . f ∈ F Bounded continuous function 1 0.5 f(x) 0 −0.5 −1 0 0.2 0.4 0.6 0.8 1 x
Function Showing Difference in Distributions • What if the function is not smooth? MMD( P , Q ; F ) := sup [ E P f ( x ) − E Q f ( y )] . f ∈ F Bounded continuous function 1 0.5 f(x) 0 −0.5 −1 0 0.2 0.4 0.6 0.8 1 x
Function Showing Difference in Distributions • Maximum mean discrepancy: smooth function for P vs Q MMD( P , Q ; F ) := sup [ E P f ( x ) − E Q f ( y )] . f ∈ F • Gauss P vs Laplace Q Witness f for Gauss and Laplace densities 0.8 f Gauss 0.6 Laplace Prob. density and f 0.4 0.2 0 −0.2 −0.4 −0.6 −6 −4 −2 0 2 4 6 X
Function Showing Difference in Distributions • Maximum mean discrepancy: smooth function for P vs Q MMD( P , Q ; F ) := sup [ E P f ( x ) − E Q f ( y )] . f ∈ F • Classical results: MMD( P , Q ; F ) = 0 iff P = Q , when – F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨ uller, 1997] – F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002]
Function Showing Difference in Distributions • Maximum mean discrepancy: smooth function for P vs Q MMD( P , Q ; F ) := sup [ E P f ( x ) − E Q f ( y )] . f ∈ F • Classical results: MMD( P , Q ; F ) = 0 iff P = Q , when – F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨ uller, 1997] – F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002] • MMD( P , Q ; F ) = 0 iff P = Q when F =the unit ball in a characteristic RKHS F (coming soon!) [ISMB06, NIPS06a, NIPS07b, NIPS08a, JMLR10]
Function Showing Difference in Distributions • Maximum mean discrepancy: smooth function for P vs Q MMD( P , Q ; F ) := sup [ E P f ( x ) − E Q f ( y )] . f ∈ F • Classical results: MMD( P , Q ; F ) = 0 iff P = Q , when – F =bounded continuous [Dudley, 2002] – F = bounded variation 1 (Kolmogorov metric) [M¨ uller, 1997] – F = bounded Lipschitz (Earth mover’s distances) [Dudley, 2002] • MMD( P , Q ; F ) = 0 iff P = Q when F =the unit ball in a characteristic RKHS F (coming soon!) [ISMB06, NIPS06a, NIPS07b, NIPS08a, JMLR10] How do smooth functions relate to feature maps?
Function view vs feature mean view • The (kernel) MMD: [ISMB06, NIPS06a] MMD( P , Q ; F ) [ E P f ( x ) − E Q f ( y )] = sup f ∈ F Witness f for Gauss and Laplace densities 0.8 f Gauss 0.6 Laplace Prob. density and f 0.4 0.2 0 −0.2 −0.4 −0.6 −6 −4 −2 0 2 4 6 X
Function view vs feature mean view • The (kernel) MMD: [ISMB06, NIPS06a] MMD( P , Q ; F ) use [ E P f ( x ) − E Q f ( y )] = sup � µ P , f � F E P ( f ( x )) =: f ∈ F
Function view vs feature mean view • The (kernel) MMD: [ISMB06, NIPS06a] MMD( P , Q ; F ) use [ E P f ( x ) − E Q f ( y )] = sup E P ( f ( x )) =: � µ P , f � F f ∈ F = sup � f, µ P − µ Q � F f ∈ F
Function view vs feature mean view • The (kernel) MMD: [ISMB06, NIPS06a] MMD( P , Q ; F ) use [ E P f ( x ) − E Q f ( y )] = sup � θ � F = sup � f, θ � F f ∈ F f ∈ F = sup � f, µ P − µ Q � F since F := { f ∈ F : f ∈ F = � µ P − µ Q � F � f � ≤ 1 } Function view and feature view equivalent
MMD for independence: HSIC • Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05, NIPS07a, ALT07, ALT08, JMLR10] Related to [Feuerverger, 1993] and [Sz´ ekely and Rizzo, 2009, Sz´ ekely et al., 2007] HSIC ( P XY , P X P Y ) := � µ P XY − µ P X P Y � 2
MMD for independence: HSIC • Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05, NIPS07a, ALT07, ALT08, JMLR10] Related to [Feuerverger, 1993] and [Sz´ ekely and Rizzo, 2009, Sz´ ekely et al., 2007] HSIC ( P XY , P X P Y ) := � µ P XY − µ P X P Y � 2 l ( ) , k ( ) , #" !" !" #" κ ( ) = #" #" !" !" , k ( ) × l ( ) , , !" #" !" #"
MMD for independence: HSIC • Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05, NIPS07a, ALT07, ALT08, JMLR10] Related to [Feuerverger, 1993] and [Sz´ ekely and Rizzo, 2009, Sz´ ekely et al., 2007] HSIC ( P XY , P X P Y ) := � µ P XY − µ P X P Y � 2 HSIC using expectations of kernels: Define RKHS F on X with kernel k , RKHS G on Y with kernel l . Then HSIC( P XY , P X P Y ) = E XY E X ′ Y ′ k ( x , x ′ ) l ( y , y ′ ) + E X E X ′ k ( x , x ′ ) E Y E Y ′ l ( y , y ′ ) − 2 E X ′ Y ′ � � E X k ( x , x ′ ) E Y l ( y , y ′ ) .
HSIC: empirical estimate and intuition !"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',& -"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*& 2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#& -"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-& 2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%%& 2',&0(//(7&3(+#%37"#%#:& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.&
HSIC: empirical estimate and intuition !"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',& !" #" -"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*& 2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#& -"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-& 2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%%& 2',&0(//(7&3(+#%37"#%#:& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.&
HSIC: empirical estimate and intuition !"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',& !" #" -"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*& 2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#& -"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-& 2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%%& 2',&0(//(7&3(+#%37"#%#:& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& Empirical HSIC ( P XY , P X P Y ): 1 n 2 ( HKH ◦ HLH ) ++
Characteristic kernels (Via Fourier, on the torus T )
Characteristic Kernels (via Fourier) Reminder: Characteristic: MMD a metric (MMD = 0 iff P = Q ) [NIPS07b, JMLR10] In the next slides: 1. Characteristic property on [ − π, π ] with periodic boundary 2. Characteristic property on R d
Characteristic Kernels (via Fourier) Reminder: Fourier series • Function [ − π, π ] with periodic boundary. ∞ ∞ � � ˆ ˆ f ( x ) = f ℓ exp( ıℓx ) = f ℓ (cos( ℓx ) + ı sin( ℓx )) . ℓ = −∞ l = −∞ Top hat Fourier series coefficients 0.5 1.4 0.4 1.2 1 0.3 0.8 0.2 f ( x ) f ℓ 0.6 ˆ 0.1 0.4 0 0.2 0 −0.1 −0.2 −0.2 −4 −2 0 2 4 −10 −5 0 5 10 x ℓ
Characteristic Kernels (via Fourier) Reminder: Fourier series of kernel ∞ � ˆ k ( x, y ) = k ( x − y ) = k ( z ) , k ( z ) = k ℓ exp ( ıℓz ) , ℓ = −∞ � � � � 2 π , ıσ 2 − σ 2 ℓ 2 ˆ 1 x 1 E.g., k ( x ) = k ℓ = 2 π exp 2 π ϑ , . 2 π 2 ϑ is the Jacobi theta function, close to Gaussian when σ 2 sufficiently narrower than [ − π, π ]. Kernel Fourier series coefficients 0.16 0.6 0.14 0.5 0.12 0.4 0.1 k ( x ) f ℓ 0.3 ˆ 0.08 0.2 0.06 0.1 0.04 0 0.02 −0.1 0 −4 −2 0 2 4 −10 −5 0 5 10 x ℓ
Characteristic Kernels (via Fourier) Maximum mean embedding via Fourier series: • Fourier series for P is characteristic function ¯ φ P • Fourier series for mean embedding is product of fourier series! (convolution theorem) � π µ P ,ℓ = ˆ k ℓ × ¯ µ P ( x ) = E P k ( x − x ) = k ( x − t ) d P ( t ) ˆ φ P ,ℓ − π
Characteristic Kernels (via Fourier) Maximum mean embedding via Fourier series: • Fourier series for P is characteristic function ¯ φ P • Fourier series for mean embedding is product of fourier series! (convolution theorem) � π µ P ,ℓ = ˆ k ℓ × ¯ µ P ( x ) = E P k ( x − x ) = k ( x − t ) d P ( t ) ˆ φ P ,ℓ − π • MMD can be written in terms of Fourier series: � � � � �� ¯ � ∞ � � ˆ � � φ P ,ℓ − ¯ MMD( P , Q ; F ) := exp( ıℓx ) φ Q ,ℓ k ℓ � � � � ℓ = −∞ F
A simpler Fourier expression for MMD • From previous slide, � � � � �� ¯ � ∞ � � ˆ � � φ P ,ℓ − ¯ MMD( P , Q ; F ) := exp( ıℓx ) � φ Q ,ℓ k ℓ � � � ℓ = −∞ F • The squared norm of a function f in F is: ∞ � | ˆ f ℓ | 2 � f � 2 F = � f, f � F = . ˆ k ℓ l = −∞ • Simple, interpretable expression for squared MMD: ∞ ∞ [ | φ P ,ℓ − φ Q ,ℓ | 2 ˆ � � k ℓ ] 2 | φ P ,ℓ − φ Q ,ℓ | 2 ˆ MMD 2 ( P , Q ; F ) = = k ℓ ˆ k ℓ l = −∞ l = −∞
Example • Example: P differs from Q at one frequency 0.2 0.15 P ( x ) 0.1 0.05 0 −2 0 2 x 0.2 0.15 Q ( x ) 0.1 0.05 0 −2 0 2 x
Characteristic Kernels (2) • Example: P differs from Q at (roughly) one frequency 0.2 1 0.15 F P ( x ) → φ P,ℓ 0.1 0.5 0.05 0 0 −2 0 2 −10 0 10 x ℓ 0.2 1 0.15 Q ( x ) φ Q,ℓ F 0.5 → 0.1 0.05 0 0 −10 0 10 −2 0 2 ℓ x
Characteristic Kernels (2) • Example: P differs from Q at (roughly) one frequency 0.2 1 0.15 F P ( x ) Characteristic function difference → φ P,ℓ 0.1 1 0.5 0.05 ց 0.8 φ P,ℓ − φ Q,ℓ 0 0 0.6 −2 0 2 −10 0 10 x ℓ 0.2 1 0.4 ր 0.15 0.2 Q ( x ) φ Q,ℓ F 0.5 → 0.1 0 −10 0 10 ℓ 0.05 0 0 −10 0 10 −2 0 2 ℓ x
Example Is the Gaussian-spectrum kernel characteristic? Kernel Fourier series coefficients 0.16 0.6 0.14 0.5 0.12 0.4 0.1 k ( x ) f ℓ 0.3 ˆ 0.08 0.2 0.06 0.1 0.04 0 0.02 −0.1 0 −4 −2 0 2 4 −10 −5 0 5 10 x ℓ ∞ � | φ P ,ℓ − φ Q ,ℓ | 2 ˆ MMD 2 ( P , Q ; F ) := k ℓ l = −∞
Example Is the Gaussian-spectrum kernel characteristic? YES Kernel Fourier series coefficients 0.16 0.6 0.14 0.5 0.12 0.4 0.1 k ( x ) f ℓ 0.3 ˆ 0.08 0.2 0.06 0.1 0.04 0 0.02 −0.1 0 −4 −2 0 2 4 −10 −5 0 5 10 x ℓ ∞ � | φ P ,ℓ − φ Q ,ℓ | 2 ˆ MMD 2 ( P , Q ; F ) := k ℓ l = −∞
Example Is the triangle kernel characteristic? Triangle Fourier series coefficients 0.3 0.07 0.25 0.06 0.2 0.05 0.15 0.04 f ( x ) f ℓ 0.1 ˆ 0.03 0.05 0.02 0 0.01 −0.05 −0.1 0 −4 −2 0 2 4 −10 −5 0 5 10 x ℓ ∞ � | φ P ,ℓ − φ Q ,ℓ | 2 ˆ MMD 2 ( P , Q ; F ) := k ℓ l = −∞
Example Is the triangle kernel characteristic? NO Triangle Fourier series coefficients 0.3 0.07 0.25 0.06 0.2 0.05 0.15 0.04 f ( x ) f ℓ 0.1 ˆ 0.03 0.05 0.02 0 0.01 −0.05 −0.1 0 −4 −2 0 2 4 −10 −5 0 5 10 x ℓ ∞ � | φ P ,ℓ − φ Q ,ℓ | 2 ˆ MMD 2 ( P , Q ; F ) := k ℓ l = −∞
Characteristic kernels (Via Fourier, on R d )
Characteristic Kernels (via Fourier) • Can we prove characteristic on R d ?
Characteristic Kernels (via Fourier) • Can we prove characteristic on R d ? • Characteristic function of P via Fourier transform � R d e ix ⊤ ω d P ( x ) φ P ( ω ) =
Characteristic Kernels (via Fourier) • Can we prove characteristic on R d ? • Characteristic function of P via Fourier transform � R d e ix ⊤ ω d P ( x ) φ P ( ω ) = • Translation invariant kernels: k ( x, y ) = k ( x − y ) = k ( z ) • Bochner’s theorem: � R d e − iz ⊤ ω d Λ( ω ) k ( z ) = – Λ finite non-negative Borel measure
Characteristic Kernels (via Fourier) • Can we prove characteristic on R d ? • Characteristic function of P via Fourier transform � R d e ix ⊤ ω d P ( x ) φ P ( ω ) = • Translation invariant kernels: k ( x, y ) = k ( x − y ) = k ( z ) • Bochner’s theorem: � R d e − iz ⊤ ω d Λ( ω ) k ( z ) = – Λ finite non-negative Borel measure
Characteristic Kernels (via Fourier) Fourier representation of MMD: � | φ P ( ω ) − φ Q ( ω ) | 2 d Λ( ω ) MMD 2 ( P , Q ; F ) = φ P characteristic function of P Proof: Using Bochner’s theorem (a)... and Fubini’s theorem (b) MMD 2 ( P , Q ) := E P k ( x − x ′ ) + E Q k ( y − y ′ ) − 2 E P , Q k ( x , y ) � � � � = k ( s − t ) d ( P − Q )( s ) d ( P − Q )( t ) � � � ( a ) R d e − i ( s − t ) T ω d Λ( ω ) d ( P − Q )( s ) d ( P − Q )( t ) = � � � R d e − ix T ω d ( P − Q )( s ) R d e iy T ω d ( P − Q )( t ) d Λ( ω ) ( b ) = � R d | φ P ( ω ) − φ Q ( ω ) | 2 d Λ( ω ) =
Example • Example: P differs from Q at (roughly) one frequency 0.35 0.3 0.25 0.2 P(X) 0.15 0.1 0.05 0 −10 −5 0 5 10 X 0.5 0.4 0.3 Q(X) 0.2 0.1 0 −10 −5 0 5 10 X
Example • Example: P differs from Q at (roughly) one frequency 0.35 0.4 0.3 0.3 F 0.25 → 0.2 P(X) | φ P | 0.2 0.15 0.1 0.1 0.05 0 0 −10 −5 0 5 10 −20 −10 0 10 20 X ω 0.5 0.4 0.4 0.3 0.3 Q(X) | φ Q | F 0.2 → 0.2 0.1 0.1 0 0 −10 −5 0 5 10 −20 −10 0 10 20 X ω
Example • Example: P differs from Q at (roughly) one frequency 0.35 0.4 0.3 0.3 F 0.25 → 0.2 P(X) | φ P | 0.2 Characteristic function difference 0.15 0.2 0.1 ց 0.1 0.05 0.15 0 0 −10 −5 0 5 10 −20 −10 0 10 20 | φ P − φ Q | X ω 0.1 0.5 0.4 0.05 ր 0.4 0.3 0.3 0 Q(X) | φ Q | −30 −20 −10 0 10 20 30 F 0.2 → ω 0.2 0.1 0.1 0 0 −10 −5 0 5 10 −20 −10 0 10 20 X ω
Example • Example: P differs from Q at (roughly) one frequency Exponentiated quadratic kernel Difference | φ P − φ Q | 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω
Example • Example: P differs from Q at (roughly) one frequency Characteristic 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω
Example • Example: P differs from Q at (roughly) one frequency Sinc kernel Difference | φ P − φ Q | 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω
Example • Example: P differs from Q at (roughly) one frequency NOT characteristic 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω
Example • Example: P differs from Q at (roughly) one frequency Triangle (B-spline) kernel Difference | φ P − φ Q | 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω
Example • Example: P differs from Q at (roughly) one frequency ??? 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω
Example • Example: P differs from Q at (roughly) one frequency Characteristic 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω
Summary: Characteristic Kernels Characteristic kernel: (MMD = 0 iff P = Q ) [NIPS07b, COLT08] Main theorem: A translation invariant k characteristic for prob. measures on R d if and only if supp(Λ) = R d (i.e. support zero on at most a countable set) [COLT08, JMLR10] Corollary: continuous, compactly supported k characteristic (since Fourier spectrum Λ( ω ) cannot be zero on an interval). 1-D proof sketch from [Mallat, 1999, Theorem 2.6] proof on R d via distribution theory in [Sriperumbudur et al., 2010, Corollary 10 p. 1535]
k characteristic iff supp(Λ) = R d Proof: supp { Λ } = R d = ⇒ k characteristic: Recall Fourier definition of MMD: � R d | φ P ( ω ) − φ Q ( ω ) | 2 d Λ( ω ) . MMD 2 ( P , Q ) = Characteristic functions φ P ( ω ) and φ Q ( ω ) uniformly continuous, hence their difference cannot be non-zero only on a countable set. Map φ P uniformly continuous: ∀ ǫ > 0 , ∃ δ > 0 such that ∀ ( ω 1 , ω 2 ) ∈ Ω for which d ( ω 1 , ω 2 ) < δ , we have d ( φ P ( ω 1 ) , φ P ( ω 2 )) < ǫ . Uniform: δ depends only on ǫ , not on ω 1 , ω 2 .
k characteristic iff supp(Λ) = R d ⇒ supp { Λ } = R d : Proof: k characteristic = Proof by contrapositive. Given supp { Λ } � R d , hence ∃ open interval U such that Λ( ω ) zero on U . Construct densities p ( x ), q ( x ) such that φ P , φ Q differ only inside U
Further extensions • Similar reasoning wherever extensions of Bochner’s theorem exist: [Fukumizu et al., 2009] – Locally compact Abelian groups (periodic domains, as we saw) – Compact, non-Abelian groups (orthogonal matrices) – The semigroup R + n (histograms) • Related kernel statistics: Fisher statistic [Harchaoui et al., 2008] (zero iff P = Q for characteristic kernels), other distances [Zhou and Chellappa, 2006] (not yet shown to establish whether P = Q ), energy distances
Statistical hypothesis testing
Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst? LFP near spike burst LFP without spike burst 0.3 0.3 0.2 0.2 0.1 0.1 LFP amplitude LFP amplitude 0 0 −0.1 −0.1 −0.2 −0.2 −0.3 −0.3 −0.4 −0.4 0 20 40 60 80 100 0 20 40 60 80 100 Time Time
Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst?
Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst?
Statistical test using MMD (1) • Two hypotheses: – H 0 : null hypothesis ( P = Q ) – H 1 : alternative hypothesis ( P � = Q )
Statistical test using MMD (1) • Two hypotheses: – H 0 : null hypothesis ( P = Q ) – H 1 : alternative hypothesis ( P � = Q ) • Observe samples x := { x 1 , . . . , x n } from P and y from Q • If empirical MMD( x , y ; F ) is – “far from zero”: reject H 0 – “close to zero”: accept H 0
Statistical test using MMD (2) • “far from zero” vs “close to zero” - threshold? 2 One answer: asymptotic distribution of � • MMD
Statistical test using MMD (2) • “far from zero” vs “close to zero” - threshold? 2 One answer: asymptotic distribution of � • MMD • An unbiased empirical estimate (quadratic cost): � 2 = � 1 MMD k ( x i , x j ) − k ( x i , y j ) − k ( y i , x j ) + k ( y i , y j ) n ( n − 1) � �� � i � = j h (( x i ,y i ) , ( x j ,y j ))
Statistical test using MMD (2) • “far from zero” vs “close to zero” - threshold? 2 One answer: asymptotic distribution of � • MMD • An unbiased empirical estimate (quadratic cost): � 2 = � 1 MMD k ( x i , x j ) − k ( x i , y j ) − k ( y i , x j ) + k ( y i , y j ) n ( n − 1) � �� � i � = j h (( x i ,y i ) , ( x j ,y j )) • When P � = Q , asymptotically normal � 2 − MMD 2 � ( √ n ) � ∼ N (0 , σ 2 MMD u ) [Hoeffding, 1948, Serfling, 1980] • Expression for the variance: z i := ( x i , y i ) � � 2 � � ( E z ′ h ( z , z ′ )) 2 � � σ 2 E z , z ′ ( h ( z , z ′ )) E z − u = 4
Statistical test using MMD (3) • Example: laplace distributions with different variance MMD distribution and Gaussian fit under H1 Two Laplace distributions with different variances 14 1.5 P X Empirical PDF Q X Gaussian fit 12 Prob. density 1 10 Prob. density 0.5 8 0 −6 −4 −2 0 2 4 6 X 6 4 2 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 MMD
Statistical test using MMD (4) • When P = Q , U-statistic degenerate: E z ′ [ h ( z , z ′ )] = 0 [Anderson et al., 1994] • Distribution is ∞ � � � z 2 n MMD( x , y ; F ) ∼ l − 2 λ l l =1 • where – z l ∼ N (0 , 2) i.i.d � X ˜ k ( x, x ′ ) ψ i ( x ) d P ( x ) = λ i ψ i ( x ′ ) – � �� � centred
Statistical test using MMD (4) • When P = Q , U-statistic degenerate: E z ′ [ h ( z , z ′ )] = 0 [Anderson et al., 1994] • Distribution is ∞ � � � z 2 n MMD( x , y ; F ) ∼ l − 2 λ l l =1 MMD density under H0 • where 0.7 χ 2 sum Empirical PDF – z l ∼ N (0 , 2) i.i.d 0.6 � X ˜ k ( x, x ′ ) ψ i ( x ) d P ( x ) = λ i ψ i ( x ′ ) 0.5 – Prob. density � �� � 0.4 centred 0.3 0.2 0.1 0 −2 −1 0 1 2 3 4 5 6 n × MMD 2
Statistical test using MMD (5) • Given P = Q , want threshold T such that P (MMD > T ) ≤ 0 . 05 2 = K P,P + K Q,Q − 2 K P,Q � MMD MMD density under H0 and H1 0.7 null alternative 0.6 0.5 Prob. density 1− α null quantile 0.4 0.3 0.2 Type II error 0.1 0 −2 −1 0 1 2 3 4 5 6 n × MMD 2
Statistical test using MMD (5) • Given P = Q , want threshold T such that P (MMD > T ) ≤ 0 . 05
Statistical test using MMD (5) • Given P = Q , want threshold T such that P (MMD > T ) ≤ 0 . 05 • Permutation for empirical CDF [Arcones and Gin´ e, 1992, Alba Fern´ andez et al., 2008] • Pearson curves by matching first four moments [Johnson et al., 1994] • Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989] • Consistent test using kernel eigenspectrum [NIPS09b]
Statistical test using MMD (5) • Given P = Q , want threshold T such that P (MMD > T ) ≤ 0 . 05 • Permutation for empirical CDF [Arcones and Gin´ e, 1992, Alba Fern´ andez et al., 2008] • Pearson curves by matching first four moments [Johnson et al., 1994] • Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989] • Consistent test using kernel eigenspectrum [NIPS09b] CDF of the MMD and Pearson fit 1 P(MMD < mmd) 0.8 0.6 0.4 0.2 MMD Pearson 0 −0.02 0 0.02 0.04 0.06 0.08 0.1 mmd
Approximate null distribution of � MMD via permutation Empirical MMD: ) ⊤ w = (1 , 1 , 1 , . . . 1 , − 1 . . . , − 1 , − 1 , − 1 � �� � � �� � n n � �  2 � � � � 1 ≈  K P,P K P,Q MMD ww ⊤  ⊙ n 2 K Q,P K Q,Q
Approximate null distribution of � MMD via permutation Permuted case: [Alba Fern´ andez et al., 2008] ) ⊤ w = (1 , − 1 , 1 , . . . 1 , − 1 . . . , 1 , − 1 , − 1 � �� � � �� � n n � � (equal number of +1 and − 1)  � � � 1 � ? �  K P,P K P,Q ww ⊤  ⊙ = n 2 K Q,P K Q,Q
Approximate null distribution of � MMD via permutation Permuted case: [Alba Fern´ andez et al., 2008] ) ⊤ w = (1 , − 1 , 1 , . . . 1 , − 1 . . . , 1 , − 1 , − 1 � �� � � �� � n n � � (equal number of +1 and − 1)  � � � 1 � ? �  K P,P K P,Q ww ⊤  ⊙ = n 2 K Q,P K Q,Q ⊙ = Figure thanks to Kacper Chwialkowski.
2 via permutation Approximate null distribution of � MMD � �  � � � p ≈ 1  K P,P K P,Q ww ⊤ 2 �  ⊙ MMD n 2 K Q,P K Q,Q MMD density under H0 0.7 Null PDF Null PDF from permutation 0.6 0.5 Prob. density 0.4 0.3 0.2 0.1 0 −2 −1 0 1 2 3 4 5 6 n × MMD 2
Detecting differences in brain signals Do local field potential (LFP) signals change when measured near a spike burst? LFP near spike burst LFP without spike burst 0.3 0.3 0.2 0.2 0.1 0.1 LFP amplitude LFP amplitude 0 0 −0.1 −0.1 −0.2 −0.2 −0.3 −0.3 −0.4 −0.4 0 20 40 60 80 100 0 20 40 60 80 100 Time Time
Neuro data: consistent test w/o permutation • Maximum mean discrepancy (MMD): distance between P and Q MMD( P , Q ; F ) := � µ P − µ Q � 2 F • Is � MMD significantly > 0? P ≠ Q (neuro) 0.5 • P = Q , null distrib. of � MMD: Spectral Permutation 0.4 Type II error ∞ � n � λ l ( z 2 MMD → l − 2) , 0.3 D l =1 0.2 – λ l is l th eigenvalue of 0.1 kernel ˜ k ( x i , x j ) 0 100 150 200 250 300 Sample size m Use Gram matrix spectrum for ˆ λ l : consistent test without permutation
Hypothesis testing with HSIC
Distribution of HSIC at independence • (Biased) empirical HSIC a v-statistic HSIC b = 1 n 2 trace( KHLH ) – Statistical testing: How do we find when this is larger enough that the null hypothesis P = P x P y is unlikely? – Formally: given P = P x P y , what is the threshold T such that P (HSIC > T ) < α for small α ?
Recommend
More recommend