On the Chi square and higher-order Chi distances for approximating f -divergences Frank Nielsen 1 Richard Nock 2 www.informationgeometry.org 1 Sony Computer Science Laboratories, Inc. 2 UAG-CEREGMIA September 2013 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/17
Statistical divergences Measures the separability between two distributions. Examples: Pearson/Neymann χ 2 , Kullback-Leibler divergence: � ( x 2 ( x ) − x 1 ( x )) 2 χ 2 P ( X 1 : X 2 ) = d ν ( x ) , x 1 ( x ) � ( x 1 ( x ) − x 2 ( x )) 2 χ 2 N ( X 1 : X 2 ) = d ν ( x ) , x 2 ( x ) � x 1 ( x ) log x 1 ( x ) KL ( X 1 : X 2 ) = x 2 ( x ) d ν ( x ) , c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/17
f -divergences: A generic definition � � x 2 ( x ) � I f ( X 1 : X 2 ) = x 1 ( x ) f d ν ( x ) ≥ 0 , x 1 ( x ) where f is a convex function f : (0 , ∞ ) ⊆ dom ( f ) �→ [0 , ∞ ] such that f (1) = 0. � Jensen inequality: I f ( X 1 : X 2 ) ≥ f ( x 2 ( x ) d ν ( x )) = f (1) = 0. May consider f ′ (1) = 0 and fix the scale of divergence by setting f ′′ (1) = 1. Can always be symmetrized: S f ( X 1 : X 2 ) = I f ( X 1 : X 2 ) + I f ∗ ( X 1 : X 2 ) with f ∗ ( u ) = uf (1 / u ), and I f ∗ ( X 1 : X 2 ) = I f ( X 2 : X 1 ). c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/17
f -divergences: Some examples Name of the f -divergence Formula I f ( P : Q ) Generator f ( u ) with f (1) = 0 1 1 � Total variation (metric) | p ( x ) − q ( x ) | d ν ( x ) 2 | u − 1 | 2 ( √ u − 1) 2 q ( x )) 2 d ν ( x ) � � � Squared Hellinger ( p ( x ) − � ( q ( x ) − p ( x ))2 Pearson χ 2 ( u − 1) 2 d ν ( x ) P p ( x ) � ( p ( x ) − q ( x ))2 (1 − u )2 Neyman χ 2 d ν ( x ) N q ( x ) u � ( q ( x ) − λ p ( x )) k Pearson-Vajda χ k ( u − 1) k d ν ( x ) P pk − 1( x ) � | q ( x ) − λ p ( x ) | k Pearson-Vajda | χ | k | u − 1 | k d ν ( x ) P pk − 1( x ) p ( x ) log p ( x ) Kullback-Leibler � q ( x ) d ν ( x ) − log u q ( x ) log q ( x ) � reverse Kullback-Leibler p ( x ) d ν ( x ) u log u 1 − α 1+ α ( x ) q 1+ α ( x ) d ν ( x )) 4 � 4 α -divergence 1 − α 2 (1 − p 2 1 − α 2 (1 − u 2 ) 2 p ( x ) 2 q ( x ) 1 − ( u + 1) log 1+ u Jensen-Shannon � ( p ( x ) log p ( x )+ q ( x ) + q ( x ) log p ( x )+ q ( x ) ) d ν ( x ) + u log u 2 2 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/17
Stochastic approximations of f -divergences � � x 2 ( s i ) � � x 2 ( t i ) �� � n � ( X 1 : X 2 ) ∼ 1 + x 1 ( t i ) I ( n ) f x 2 ( t i ) f , f 2 n x 1 ( s i ) x 1 ( t i ) i =1 with s 1 , ..., s n and t 1 , ..., t n IID. sampled from X 1 and X 2 , respectively. � I ( n ) lim ( X 1 : X 2 ) → I f ( X 1 : X 2 ) f n →∞ ◮ work for any generator f but... ◮ In practice, limited to small dimension support. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/17
Exponential families Canonical decomposition of the probability measure: p θ ( x ) = exp( � t ( x ) , θ � − F ( θ ) + k ( x )) , Here, consider natural parameter space Θ affine. p ( x | λ ) = λ x e − λ Poi ( λ ) : , λ > 0 , x ∈ { 0 , 1 , ... } x ! p ( x | µ ) = (2 π ) − d 2 e − 1 2 ( x − µ ) ⊤ ( x − µ ) , µ ∈ R d , x ∈ R d Nor I ( µ ) : Family θ Θ F ( θ ) k ( x ) t ( x ) ν e θ log λ − log x ! x ν c Poisson R 1 2 log 2 π − 1 R d 2 θ ⊤ θ d 2 x ⊤ x Iso . Gaussian µ x ν L c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/17
χ 2 for affine exponential families Bypass integral computation, Closed-form formula e F (2 θ 2 − θ 1 ) − (2 F ( θ 2 ) − F ( θ 1 )) − 1 , χ 2 P ( X 1 : X 2 ) = e F (2 θ 1 − θ 2 ) − (2 F ( θ 1 ) − F ( θ 2 )) − 1 , χ 2 N ( X 1 : X 2 ) = Kullback-Leibler divergence amounts to a Bregman divergence [3]: KL ( X 1 : X 2 ) = B F ( θ 2 : θ 1 ) B F ( θ : θ ′ ) = F ( θ ) − F ( θ ′ ) − ( θ − θ ′ ) ⊤ ∇ F ( θ ′ ) c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/17
Higher-order Vajda χ k divergences � ( x 2 ( x ) − x 1 ( x )) k χ k P ( X 1 : X 2 ) = d ν ( x ) , x 1 ( x ) k − 1 � | x 2 ( x ) − x 1 ( x ) | k | χ | k P ( X 1 : X 2 ) = d ν ( x ) , x 1 ( x ) k − 1 are f -divergences for the generators ( u − 1) k and | u − 1 | k . � ◮ When k = 1, χ 1 P ( X 1 : X 2 ) = ( x 1 ( x ) − x 2 ( x )) d ν ( x ) = 0 (never discriminative), and | χ 1 P | ( X 1 , X 2 ) is twice the total variation distance. ◮ χ 0 P is the unit constant. ◮ χ k P is a signed distance c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/17
Higher-order Vajda χ k divergences Lemma The (signed) χ k P distance between members X 1 ∼ E F ( θ 1 ) and X 2 ∼ E F ( θ 2 ) of the same affine exponential family is (k ∈ N ) always bounded and equal to: � k � e F ((1 − j ) θ 1 + j θ 2 ) k � χ k ( − 1) k − j P ( X 1 : X 2 ) = e (1 − j ) F ( θ 1 )+ jF ( θ 2 ) . j j =0 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/17
Higher-order Vajda χ k divergences: For Poisson/Normal distributions, we get closed-form formula: � k � � k e λ 1 − j λ j χ k ( − 1) k − j 2 − ((1 − j ) λ 1 + j λ 2 ) , P ( λ 1 : λ 2 ) = 1 j j =0 � k � k � 1 2 j ( j − 1)( µ 1 − µ 2 ) ⊤ ( µ 1 − µ 2 ) . χ k ( − 1) k − j P ( µ 1 : µ 2 ) = e j j =0 signed distances. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/17
f -divergences from Taylor series Lemma (extends Theorem 1 of [1]) When bounded, the f -divergence I f can be expressed as the power series of higher order Chi-type distances: � x 2 ( x ) � i � � ∞ 1 i ! f ( i ) ( λ ) I f ( X 1 : X 2 ) = x 1 ( x ) x 1 ( x ) − λ d ν ( x ) , i =0 ∞ � 1 i ! f ( i ) ( λ ) χ i = λ, P ( X 1 : X 2 ) , i =0 I f < ∞ , and χ i λ, P ( X 1 : X 2 ) is a generalization of the χ i P defined by: � ( x 2 ( x ) − λ x 1 ( x )) i χ i λ, P ( X 1 : X 2 ) = d ν ( x ) . x 1 ( x ) i − 1 and χ 0 λ, P ( X 1 : X 2 ) = 1 by convention. Note that λ, P ≥ f (1) = (1 − λ ) k is a f -divergence for χ i f ( u ) = ( u − λ ) k − (1 − λ ) k c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/17
f -divergences: Analytic formula ◮ λ = 1 ∈ int ( dom ( f ( i ) )), f -divergence (Theorem 1 of [1]): s � f ( k ) (1) χ k | I f ( X 1 : X 2 ) − P ( X 1 : X 2 ) | k ! k =0 1 ( s + 1)! � f ( s +1) � ∞ ( M − m ) s , ≤ where � f ( s +1) � ∞ = sup t ∈ [ m , M ] | f ( s +1) ( t ) | and m ≤ p q ≤ M . ◮ λ = 0 (whenever 0 ∈ int ( dom ( f ( i ) ))) and affine exponential families, simpler expression: � ∞ f ( i ) (0) I f ( X 1 : X 2 ) = I 1 − i , i ( θ 1 : θ 2 ) , i ! i =0 e F ( i θ 2 +(1 − i ) θ 1 ) I 1 − i , i ( θ 1 : θ 2 ) = e iF ( θ 2 )+(1 − i ) F ( θ 1 ) . c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/17
Corollary: Approximating f -divergences by χ 2 divergences Corollary A second-order Taylor expansion yields N ( X 1 : X 2 ) + 1 I f ( X 1 : X 2 ) ∼ f (1) + f ′ (1) χ 1 2 f ′′ (1) χ 2 N ( X 1 : X 2 ) Since f (1) = 0 and χ 1 N ( X 1 : X 2 ) = 0 , it follows that I f ( X 1 : X 2 ) ∼ f ′′ (1) χ 2 N ( X 1 : X 2 ) , 2 (f ′′ (1) > 0 follows from the strict convexity of the generator). When f ( u ) = u log u, this yields the well-known approximation [2]: χ 2 P ( X 1 : X 2 ) ∼ 2 KL ( X 1 : X 2 ) . c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/17
Kullback-Leibler divergence: Analytic expression Kullback-Leibler divergence: f ( u ) = − log u . f ( i ) ( u ) = ( − 1) i ( i − 1)! u − i and hence f ( i ) (1) = ( − 1) i , for i ≥ 1 (with f (1) = 0). i ! i Since χ 1 1 , P = 0, it follows that: ∞ � ( − 1) i χ j KL ( X 1 : X 2 ) = P ( X 1 : X 2 ) . i j =2 → alternating sign sequence Poisson distributions: λ 1 = 0 . 6 and λ 2 = 0 . 3, KL ∼ 0 . 1158 (exact using Bregman divergence), stochastic evaluation with n = 10 6 yields � KL ∼ 0 . 1156 KL divergence from Taylor truncation: 0 . 0809( s = 2), 0 . 0910( s = 3), 0 . 1017( s = 4), 0 . 1135( s = 10), 0 . 1150( s = 15), etc. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/17
Contributions Statistical f -divergences between members of the same exponential family with affine natural space . ◮ Generic closed-form formula for the Pearson/Neyman χ 2 and Vajda χ k -type distance ◮ Analytic expression of f -divergences using Pearson-Vajda-type distances. ◮ Second-order Taylor approximation for fast estimation of f -divergences. Java TM package: www.informationgeometry.org/fDivergence/ c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/17
Thank you. @article{fDivChi-arXiv1309.3029, author="Frank Nielsen and Richard Nock", title="On the {C}hi square and higher-order {C}hi distances for approximating $f$-divergences", year="2013", eprint="arXiv/1309.3029" } www.informationgeometry.org c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/17
Recommend
More recommend