jeffreys centroids a closed form expression for positive
play

Jeffreys centroids: A closed-form expression for positive histograms - PowerPoint PPT Presentation

Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation for frequency histograms Frank Nielsen Frank.Nielsen@acm.org 5793b870 Sony Computer Science Laboratories, Inc. April 2013 c 2013


  1. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation for frequency histograms Frank Nielsen Frank.Nielsen@acm.org 5793b870 Sony Computer Science Laboratories, Inc. April 2013 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/25

  2. Why histogram clustering? Task: Classify documents into categories: Bag-of-Word (BoW) modeling paradigm [3, 6]. ◮ Define a word dictionary, and ◮ Represent each document by a word count histogram. Centroid-based k -means clustering [1]: ◮ Cluster document histograms to learn categories, ◮ Build visual vocabularies by quantizing image features: Compressed Histogram of Gradient descriptors [4]. → histogram centroids w h = � d i =1 h i : cumulative sum of bin values ˜: normalization operator c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/25

  3. Why Jeffreys divergence? Distance between two frequency histograms ˜ p and ˜ q : Kullback-Leibler divergence or relative entropy . H × (˜ KL (˜ p : ˜ q ) = p : ˜ q ) − H (˜ p ) , d p i log 1 � H × (˜ p : ˜ q ) = ˜ q i , cross − entropy ˜ i =1 d p i log 1 � H (˜ p ) = H × (˜ p : ˜ p ) = ˜ p i , Shannon entropy . ˜ i =1 → expected extra number of bits per datum that must be transmitted when using the “wrong” distribution ˜ q instead of the true distribution ˜ p . ˜ p is hidden by nature (and hypothesized), ˜ q is estimated. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/25

  4. Why Jeffreys divergence? When clustering histograms, all histograms play the same role → Jeffreys [8] divergence: J ( p , q ) = KL ( p : q ) + KL ( q : p ) , d ( p i − q i ) log p i � J ( p , q ) = q i = J ( q , p ) . i =1 → symmetrizes the KL divergence. (also called J -divergence or symmetrical Kullback-Leibler divergence, etc.) c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/25

  5. Jeffreys centroids: frequency and positive centroids A set H = { h 1 , ..., h n } of weighted histograms . n � c = arg min π j J ( h j , x ) , x j =1 π j > 0’s histogram positive weights: � n j =1 π j = 1. ◮ Jeffreys positive centroid c : n � c = arg min π j J ( h j , x ) , x ∈ R d + j =1 ◮ Jeffreys frequency centroid ˜ c : n � π j J (˜ c = arg min ˜ h j , x ) , x ∈ ∆ d j =1 ∆ d : Probability ( d − 1)-dimensional simplex. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/25

  6. Prior work ◮ Histogram clustering wrt. χ 2 distance [10] ◮ Histogram clustering wrt. Bhattacharyya distance [11, 13] ◮ Histogram clustering wrt. Kullback-Leibler distance as Bregman k -means clustering [1] ◮ Jeffreys frequency centroid [16] (Newton numerical optimization) ◮ Jeffreys frequency centroid as equivalent symmetrized Bregmen centroid [14] ◮ Mixed Bregman clustering [15] ◮ Smooth family of KL symmetrized centroids including Jensen-Shannon centroids and Jeffreys centroids in limit case [12] c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/25

  7. Jeffreys positive centroid n � c = arg min J ( H , x ) = arg min π j J ( h j , x ) . x ∈ R d x ∈ R d + + j =1 Theorem (Theorem 1) The Jeffreys positive centroid c = ( c 1 , ..., c d ) of a set { h 1 , ..., h n } of n weighted positive histograms with d bins can be calculated component-wise exactly using the Lambert W analytic function: a i c i = , W ( a i g i e ) where a i = � n j =1 π j h i j denotes the coordinate-wise arithmetic weighted means and g i = � n j =1 ( h i j ) π j the coordinate-wise geometric weighted means. Lambert analytic function [2] W ( x ) e W ( x ) = x for x ≥ 0. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/25

  8. Jeffreys positive centroid (proof) n � min π j J ( h j , x ) x j =1 n d � � ( h i j − x i )(log h i j − log x i ) min π j x j =1 i =1 d n � � π j ( x i log x i − x i log h i j − h i j log x i ) ≡ min x i =1 j =1 d n n � � � x i log x i − x i log ( h i π j h i a log x i j ) π j − j i =1 j =1 j =1 � �� � � �� � g d x i log x i � g − a log x i min x i =1 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/25

  9. Jeffreys positive centroid (proof) Coordinate-wise minimize: x x log x min g − a log x Setting the derivative to zero, we solve: log x g + 1 − a x = 0 and get a x = W ( a g e ) c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/25

  10. Jeffreys frequency centroid: A guaranteed approximation n � π j J (˜ ˜ c = arg min h j , x ) , x ∈ ∆ d j =1 Relaxing x from probability simplex ∆ d to R d + , we get a i c ′ = c � , c i = c i ˜ , w c = w c W ( a i g i e ) i Lemma (Lemma 1) The cumulative sum w c of the bin values of the Jeffreys positive centroid c of a set of frequency histograms is less or equal to one: 0 < w c ≤ 1 . c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/25

  11. Proof of Lemma 1 From Theorem 1: d d a i � � c i = w c = . W ( a i g i e ) i =1 i =1 Arithmetic-geometric mean inequality: a i ≥ g i g i e ) ≥ 1 and c i ≤ a i . Thus Therefore W ( a i d d � � c i ≤ a i = 1 w c = i =1 i =1 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/25

  12. Lemma 2 Lemma (Lemma 2) For any histogram x and frequency histogram ˜ h, we have J ( x , ˜ x , ˜ x : ˜ h ) = J (˜ h ) + ( w x − 1)( KL (˜ h ) + log w x ) , where w x denotes the normalization factor (w x = � d i =1 x i ). J ( x , ˜ x , ˜ x : ˜ H ) = J (˜ H ) + ( w x − 1)( KL (˜ H ) + log w x ) , H ) = � n where J ( x , ˜ j =1 π j J ( x , ˜ h j ) and H ) = � n h j ) (with � n x : ˜ x , ˜ KL (˜ j =1 π j KL (˜ j =1 π j = 1). c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/25

  13. Proof of Lemma 2 x i = w x ˜ x i d x i h i ) log w x ˜ � x i − ˜ J ( x , ˜ h ) = ( w x ˜ ˜ h i i =1 d ˜ x i h i x i log ˜ � x i log w x + ˜ h i log h i log w x ) J ( x , ˜ x i − ˜ h ) = ( w x ˜ h i + w x ˜ ˜ ˜ i =1 d x i x i log ˜ � x , ˜ = ( w x − 1) log w x + J (˜ h ) + ( w x − 1) ˜ ˜ h i i =1 x , ˜ x : ˜ = J (˜ h ) + ( w x − 1)( KL (˜ h ) + log w x ) since � d h i = � d x i = 1. i =1 ˜ i =1 ˜ c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/25

  14. Guaranteed approximation of ˜ c Theorem (Theorem 2) c ′ = c Let ˜ c denote the Jeffreys frequency centroid and ˜ w c the normalized Jeffreys positive centroid. Then the approximation c ′ , ˜ c ′ = J (˜ H ) 1 c ′ ≤ factor α ˜ H ) is such that 1 ≤ α ˜ w c (with w c ≤ 1 ). c , ˜ J (˜ c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/25

  15. Proof of Theorem 2 J ( c , ˜ c , ˜ c ′ , ˜ H ) ≤ J (˜ H ) ≤ J (˜ H ) From Lemma 2, since c ′ , ˜ H ) = J ( c , ˜ c ′ , ˜ J (˜ H ) + (1 − w c )( KL (˜ H ) + log w c )) and J ( c , ˜ c , ˜ H ) ≤ J (˜ H ) c ′ , ˜ c ′ ≤ 1 + (1 − w c )( KL (˜ H ) + log w c ) 1 ≤ α ˜ c , ˜ J (˜ H ) H ) = 1 c ′ : ˜ KL ( c , ˜ KL (˜ H ) − log w c w c c ′ ≤ 1 + (1 − w c ) KL ( c , ˜ H ) α ˜ c , ˜ w c J (˜ H ) c , ˜ H ) ≥ J ( c , ˜ H ) and KL ( c , ˜ H ) ≤ J ( c , ˜ Since J (˜ H ), we get 1 c ′ ≤ w c . α ˜ When w c = 1 the bound is tight. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/25

  16. In practice... c in closed-form → compute w c , KL ( c , ˜ H ), J ( c , ˜ H ). Bound the approximation factor α ˜ c ′ as: � 1 � KL ( c , ˜ H ) ≤ 1 c ′ ≤ 1 + − 1 α ˜ J ( c , ˜ w c w c H ) c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/25

  17. Fine approximation From [16, 14], minimization of Jeffreys frequency centroid equivalent to: c = arg min ˜ KL (˜ a : ˜ x ) + KL (˜ x : ˜ g ) x ∈ ∆ d ˜ Lagrangian function enforcing � i c i = 1: c i a i log ˜ g i + 1 − ˜ c i + λ = 0 ˜ ˜ a i ˜ c i = ˜ � � ˜ a i e λ +1 W g i ˜ λ = − KL (˜ c : ˜ g ) ≤ 0 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/25

  18. Fine approximation: Bisection search a i ˜ c i ≤ 1 ⇒ c i = � ≤ 1 � ˜ a i e λ +1 W g i ˜ a i ˜ a i ˜ λ ≥ log( e ˜ g i ) − 1 ∀ i , log( e ˜ g i ) − 1 , 0] λ ∈ [max i d a i ˜ � � c i ( λ ) = s ( λ ) = � � a i e λ +1 ˜ W i i =1 g i ˜ Function s : monotonously decreasing with s (0) ≤ 1. → Bisection search for s ( λ ∗ ) ≃ 1 for arbitrary precision. c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/25

  19. Experiments: Caltech-256 Caltech-256 [7]: 30607 images labeled into 256 categories (256 Jeffreys centroids). Arbitrary floating-point precision: http://www.apfloat.org/ c ′′ = ˜ a + ˜ g ˜ 2 c ′ ( n ′ lized approx . ) wc ≤ 1( n ′ lizing coeff . t ) α c ( optimal positive ) c ′′ ( Veldhuis’ approx. ) α ˜ α ˜ avg 0 . 9648680345638155 1 . 0002205080964255 0 . 9338228644308926 1 . 065590178484613 min 0 . 906414219584823 1 . 0000005079528809 0 . 8342819488534723 1 . 0027707382095195 max 0 . 9956399220678585 1 . 0000031489541772 0 . 9931975105809021 1 . 3582296675397754 c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 19/25

  20. Experiments: Synthetic data-sets Random binary histograms c ′ ) α = J (˜ c ) ≥ 1 J (˜ Performance: α ∼ 1 . 0000009 , α max ∼ 1 . 00181506 , α min = 1 . 000000. ¯ Express better worst-case upper bound performance? c � 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 20/25

Recommend


More recommend