on clustering histograms with k means by using mixed
play

On Clustering Histograms with k -Means by Using Mixed -Divergences - PowerPoint PPT Presentation

On Clustering Histograms with k -Means by Using Mixed -Divergences Entropy 16(6): 3273-3301 (2014) Frank Nielsen 1 , 2 Richard Nock 3 Shun-ichi Amari 4 1 Sony Computer Science Laboratories, Japan E-Mail: Frank.Nielsen@acm.org 2 Ecole


  1. On Clustering Histograms with k -Means by Using Mixed α -Divergences Entropy 16(6): 3273-3301 (2014) Frank Nielsen 1 , 2 Richard Nock 3 Shun-ichi Amari 4 1 Sony Computer Science Laboratories, Japan E-Mail: Frank.Nielsen@acm.org 2 ´ Ecole Polytechnique, France 3 NICTA/ANU, Australia 4 RIKEN Brain Science Institute, Japan 2014 c � 2014 Frank Nielsen 1/29

  2. Clustering histograms ◮ Information Retrieval systems (IRs) based on bag-of-words paradigm (bag-of-textons, bag-of-features, bag-of-X) ◮ The rˆ ole of distances: ◮ Initially, create a dictionary of “words” by quantizing using k -means clustering (depends on the underlying distance) ◮ At query time, find “closest” (histogram) document by querying with the histogram query ◮ Notation: Positive arrays h (counting histogram) versus frequency histograms ˜ h (normalized counting) d bins For IRs, prefer symmetric distances (not necessarily metrics) like the Jeffreys divergence or the Jensen-Shannon divergence (unified by a one parameterized family of divergences in [11]) c � 2014 Frank Nielsen 2/29

  3. Ali-Silvey-Csisz´ ar f -divergences An important class of divergences: f -divergences [10, 1, 7] defined for a convex generator f (with f (1) = f ′ (1) = 0 and f ′′ (1) = 1): d � p i � I f ( p : q ) . � q i f = q i i =1 Those divergences preserve information monotonicity [3] under any arbitrary transition probability (Markov morphisms). f -divergences can be extended to positive arrays [3]. c � 2014 Frank Nielsen 3/29

  4. Mixed divergences Defined on three parameters: M λ ( p : q : r ) . = λ D ( p : q ) + (1 − λ ) D ( q : r ) for λ ∈ [0 , 1]. Mixed divergences include: ◮ the sided divergences for λ ∈ { 0 , 1 } , ◮ the symmetrized (arithmetic mean) divergence for λ = 1 2 . c � 2014 Frank Nielsen 4/29

  5. Mixed divergence-based k -means clustering k distinct seeds from the dataset with l i = r i . Input : Weighted histogram set H , divergence D ( · , · ), integer k > 0, real λ ∈ [0 , 1]; Initialize left-sided/right-sided seeds C = { ( l i , r i ) } k i =1 ; repeat //Assignment for i = 1 , 2 , ..., k do C i ← { h ∈ H : i = arg min j M λ ( l j : h : r j ) } ; // Dual-sided centroid relocation for i = 1 , 2 , ..., k do r i ← arg min x D ( C i : x ) = � h ∈C i w j D ( h : x ); l i ← arg min x D ( x : C i ) = � h ∈C i w j D ( x : h ); until convergence ; Output : Partition of H into k clusters following C ; → different from the k -means clustering with respect to the symmetrized divergences c � 2014 Frank Nielsen 5/29

  6. α -divergences For α ∈ R � = ± 1, define α -divergences [6] on positive arrays [18] : d 4 � 1 − α p i + 1 + α � D α ( p : q ) . q i − ( p i ) 1 − α 1+ α � 2 ( q i ) = 2 1 − α 2 2 2 i =1 with D α ( p : q ) = D − α ( q : p ) and in the limit cases D − 1 ( p : q ) = KL ( p : q ) and D 1 ( p : q ) = KL ( q : p ), where KL is the extended Kullback–Leibler divergence: d p i log p i KL ( p : q ) . q i + q i − p i . � = i =1 c � 2014 Frank Nielsen 6/29

  7. α -divergences belong to f -divergences The α -divergences belong to the class of Csisz´ ar f -divergences with the following generator: 4 1 − t (1+ α ) / 2 �  � , if α � = ± 1 , 1 − α 2  f ( t ) = t ln t , if α = 1 ,  − ln t , if α = − 1 The Pearson and Neyman χ 2 distances are obtained for α = − 3 and α = 3: q i − ˜ p i ) 2 1 (˜ � D 3 (˜ p : ˜ q ) = , p i 2 ˜ i q i − ˜ p i ) 2 1 (˜ � D − 3 (˜ p : ˜ q ) = . q i 2 ˜ i c � 2014 Frank Nielsen 7/29

  8. Squared Hellinger symmetric distance is a α = 0-divergence Divergence D 0 is the squared Hellinger symmetric distance (scaled by 4) extended to positive arrays: � �� � 2 � d x = 4 H 2 ( p , q ) , D 0 ( p : q ) = 2 p ( x ) − q ( x ) with the Hellinger distance: � 1 � �� � 2 � H ( p , q ) = p ( x ) − q ( x ) d x 2 c � 2014 Frank Nielsen 8/29

  9. Mixed α -divergences ◮ Mixed α -divergence between a histogram x to two histograms p and q : M λ,α ( p : x : q ) = λ D α ( p : x ) + (1 − λ ) D α ( x : q ) , = λ D − α ( x : p ) + (1 − λ ) D − α ( q : x ) , = M 1 − λ, − α ( q : x : p ) , ◮ α -Jeffreys symmetrized divergence is obtained for λ = 1 2 : S α ( p , q ) = M 1 2 ,α ( q : p : q ) = M 1 2 ,α ( p : q : p ) ◮ skew symmetrized α -divergence is defined by: S λ,α ( p : q ) = λ D α ( p : q ) + (1 − λ ) D α ( q : p ) c � 2014 Frank Nielsen 9/29

  10. Coupled k -Means++ α -Seeding Algorithm 1: Mixed α -seeding; MAS ( H , k , λ , α ) Input : Weighted histogram set H , integer k ≥ 1, real λ ∈ [0 , 1], real α ∈ R ; Let C ← h j with uniform probability ; for i = 2 , 3 , ..., k do Pick at random histogram h ∈ H with probability: w h M λ,α ( c h : h : c h ) . π H ( h ) = y ∈H w y M λ,α ( c y : y : c y ) , (1) � //where ( c h , c h ) . = arg min ( z , z ) ∈C M λ,α ( z : h : z ); C ← C ∪ { ( h , h ) } ; Output : Set of initial cluster centers C ; c � 2014 Frank Nielsen 10/29

  11. A guaranteed probabilistic initialization Let C λ,α denote for short the cost function related to the clustering type chosen (left-, right-, skew Jeffreys or mixed) in MAS and C opt λ,α denote the optimal related clustering in k clusters, for λ ∈ [0 , 1] and α ∈ ( − 1 , 1). Then, on average, with respect to distribution (1), the initial clustering of MAS satisfies: � f ( λ ) g ( k ) h 2 ( α ) C opt if λ ∈ (0 , 1) λ,α E π [ C λ,α ] ≤ 4 . g ( k ) z ( α ) h 4 ( α ) C opt otherwise λ,α � � 1 − λ λ Here, f ( λ ) = max λ , , g ( k ) = 2(2 + log k ) , z ( α ) = 1 − λ 8 | α | 2 (1 −| α | )2 , h ( α ) = max i p | α | � � / min i p | α | 1+ | α | ; the min is defined on i i 1 −| α | strictly positive coordinates, and π denotes the picking distribution. c � 2014 Frank Nielsen 11/29

  12. Mixed α -hard clustering: MAhC ( H , k , λ , α ) Input : Weighted histogram set H , integer k > 0, real λ ∈ [0 , 1], real α ∈ R ; Let C = { ( l i , r i ) } k i =1 ← MAS ( H , k , λ, α ); repeat //Assignment for i = 1 , 2 , ..., k do A i ← { h ∈ H : i = arg min j M λ,α ( l j : h : r j ) } ; // Centroid relocation for i = 1 , 2 , ..., k do 2 1 − α ; �� 1 − α � r i ← h ∈A i w i h 2 2 1+ α ; �� � 1+ α l i ← h ∈A i w i h 2 until convergence ; Output : Partition of H in k clusters following C ; c � 2014 Frank Nielsen 12/29

  13. Sided Positive α -Centroids [14] The left-sided l α and right-sided r α positive weighted α -centroid coordinates of a set of n positive histograms h 1 , ..., h n are weighted α -means :   n r i α = f − 1 � w j f α ( h i  , l i α = r i j ) α − α  j =1 � 1 − α x α � = ± 1 , 2 with f α ( x ) = log x α = 1 . c � 2014 Frank Nielsen 13/29

  14. Sided Frequency α -Centroids [2] Theorem (Amari, 2007) The coordinates of the sided frequency α -centroids of a set of n weighted frequency histograms are the normalised weighted α -means. c � 2014 Frank Nielsen 14/29

  15. Positive and Frequency α -centroids Summary: � 1 − α 2 ( � n j =1 w j ( h i 2 ) j ) α � = 1 1 − α ◮ r i α = 1 = � n r i j =1 ( h i j ) w j α = 1 � 1+ α 2 ( � n j =1 w j ( h i 2 ) j ) α � = − 1 1+ α ◮ l i α = r i − α = − 1 = � n l i j =1 ( h i j ) w j α = − 1 r i r i ◮ ˜ α = α w (˜ r α ) r i ◮ ˜ l i r i − α α = ˜ − α = w (˜ r − α ) c � 2014 Frank Nielsen 15/29

  16. Mixed α -Centroids Two centroids minimizer of: � w j M λ,α ( l : h j : r ) j Generalizing mixed Bregman divergences [16]: Theorem The two mixed α -centroids are the left-sided and right-sided α -centroids. c � 2014 Frank Nielsen 16/29

  17. Symmetrized Jeffreys-Type α -Centroids 1 S α ( p , q ) = 2 ( D α ( p : q ) + D α ( q : p )) = S − α ( p , q ) , = M 1 2 ( p : q : p ) , For α = ± 1, we get half of Jeffreys divergence: d ( p i − q i ) log p i S ± 1 ( p , q ) = 1 � q i 2 i =1 c � 2014 Frank Nielsen 17/29

  18. Jeffreys α -divergence and Heinz means When p and q are frequency histograms, we have for α � = ± 1: � d � 8 � p i , ˜ q i ) J α (˜ p : ˜ q ) = 1 + H 1 − α 2 (˜ 1 − α 2 i =1 where H 1 − α 2 ( a , b ) a symmetric Heinz mean [8, 5]: H β ( a , b ) = a β b 1 − β + a 1 − β b β 2 Heinz means interpolate the arithmetic and geometric means and satisfies the inequality: √ 2 ( a , b ) ≤ H α ( a , b ) ≤ H 0 ( a , b ) = a + b ab = H 1 . 2 c � 2014 Frank Nielsen 18/29

  19. Jeffreys divergence in the limit case For α = ± 1, S α ( p , q ) tends to the Jeffreys divergence: d ( p i − q i )(log p i − log q i ) � J ( p , q ) = KL ( p , q ) + KL ( q , p ) = i =1 The Jeffreys divergence writes mathematically the same for frequency histograms: d p i − ˜ p i − log ˜ � q i )(log ˜ q i ) J (˜ p , ˜ q ) = KL (˜ p , ˜ q ) + KL (˜ q , ˜ p ) = (˜ i =1 c � 2014 Frank Nielsen 19/29

Recommend


More recommend