optimal interval clustering application to bregman
play

Optimal interval clustering: Application to Bregman clustering and - PowerPoint PPT Presentation

Optimal interval clustering: Application to Bregman clustering and statistical mixture learning Frank Nielsen 1 Richard Nock 2 1 Sony Computer Science Laboratories/Ecole Polytechnique 2 UAG-CEREGMIA/NICTA May 2014 c 2014 Frank Nielsen, Sony


  1. Optimal interval clustering: Application to Bregman clustering and statistical mixture learning Frank Nielsen 1 Richard Nock 2 1 Sony Computer Science Laboratories/Ecole Polytechnique 2 UAG-CEREGMIA/NICTA May 2014 c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/26

  2. Hard clustering: Partitioning the data set ◮ Partition X = { x 1 , ..., x n } ⊂ X into k clusters C 1 ⊂ X , ..., C k ⊂ X : k � X = C i i =1 ◮ Center-based (prototype) hard clustering: k -means [2], k -medians, k -center, ℓ r -center [10], etc. ◮ Model-based hard clustering: statistical mixtures maximizing the complete likelihood (prototype=model paramter). ◮ k -means: NP-hard when d > 1 and k > 1 [11, 7, 1]. ◮ k -medians and k -centers: NP-hard [12] (1984) ◮ In 1D, k -means is polynomial [3, 15]: O ( n 2 k ). c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/26

  3. Euclidean 1D k -means ◮ 1D k -means [8] has contiguous partition. � n − 1 � ◮ Solved by enumerating all partitions in 1D (1958). k − 1 Better than Stirling numbers of the second kind S ( n , k ) that count all partitions. ◮ Polynomial in time O ( n 2 k ) using Dynamic Programming (DP) [3] (sketched in 1973 in two pages). ◮ R package Ckmeans.1d.dp [15] (2011). c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/26

  4. Interval clustering: Structure Sort X ∈ X with respect to total order < on X in O ( n log n ). Output represented by: ◮ k intervals I i = [ x l i , x r i ] such that C i = I i ∩ X . ◮ or better k − 1 delimiters l i ( i ∈ { 2 , ..., k } ) since r i = l i +1 − 1 ( i < k and r k = n ) and l 1 = 1. [ x 1 ... x l 2 − 1 ] [ x l 2 ... x l 3 − 1 ] ... [ x l k ... x n ] � �� � � �� � � �� � C 1 C 2 C k c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/26

  5. Objective function for interval clustering Scalars x 1 < ... < x n are partitioned contiguously into k clusters: C 1 < ... < C k . Clustering objective function: k � min e k ( X ) = e 1 ( C j ) j =1 c 1 ( · ): intra-cluster cost/energy ⊕ : inter-cluster cost/energy (commutative, associative) n = kp + 1 1D points equally distributed → k different optimal clustering partitions c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/26

  6. Examples of objective functions In arbitrary dimension X = R d : ◮ ℓ r -clustering ( r ≥ 1): � = � ⎛ ⎞ ⎝� d ( x , p ) r e 1 ( C j ) = min ⎠ p ∈ X x ∈C j (argmin=prototype p j is the same whether we take power of 1 r of sum or not) Euclidean ℓ r -clustering: r = 1 median, r = 2 means. ◮ k -center (lim r →∞ ): � = max e 1 ( C i ) = min p ∈ X max x ∈C j d ( x , p ) ◮ Discrete clustering: Search space in min is C j instead of X . Note that in 1D, ℓ s -norm distance is always d ( p , q ) = | p − q | , independent of s ≥ 1. c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/26

  7. Optimal interval clustering by Dynamic Programming X j , i = { x j , ..., x i } ( j ≤ i ) X i = X 1 , i = { x 1 , ..., x i } E = [ e i , j ]: n × k cost matrix, O ( n × k ) memory e i , m = e m ( X i ) Optimality equation: e i , m = min m ≤ j ≤ i { e j − 1 , m − 1 ⊕ e 1 ( X j , i ) } Associative/commutative operator ⊕ (+ or max). Initialize with c i , 1 = c 1 ( X i ) E : compute from left to right column, from bottom to top. Best clustering solution cost is at e n , k . Time: n × k × O ( n ) × T 1 ( n )= O ( n 2 kT 1 ( n )), O(nk) memory c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/26

  8. Retrieving the solution: Backtracking Use an auxiliary matrix S = [ s i . j ] for storing the argmin. Backtrack in O ( k ) time. ◮ Left index l k of C k stored at s n , k : l k = s n , k . ◮ Iteratively retrieve the previous left interval indexes at entries l j − 1 = s l j − 1 , i for j = k − 1 , ..., j = 1. Note that l j − 1 = n − � k l = j n l and l j − 1 = � j − 1 l =1 n l . c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/26

  9. Optimizing time with a Look Up Table (LUT) Save time when computing e 1 ( X j , i ) since we perform n × k × O ( n ) such computations. Look Up Table (LUT): Add extra n × n matrix E 1 with E 1 [ j ][ i ] = e 1 ( X j , i ). Build in O ( n 2 T 1 ( n ))... Then DP in O ( n 2 k )= O ( n 2 T 1 ( n ))). → quadratic amount of memory ( n > 10000...) c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/26

  10. DP solver with cluster size constraints n − and n + i : lower/upper bound constraints on n i = |C i | i � k i ≤ n and � k l =1 = n − l =1 = n + i ≥ n . When no constraints: add dummy constraints n − i = 1 and n + i = n − k + 1. n m = |C m | = i − j + 1 such that n − m ≤ n m ≤ n + m . → j ≤ i + 1 − n − m and j ≥ i + 1 − n + m . e i , m = min { e j − 1 , m − 1 ⊕ e 1 ( X j , i ) } , max { 1+ � m − 1 l =1 n − l , i +1 − n + m }≤ j j ≤ i +1 − n − m c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/26

  11. Model selection from the DP table m ( k ) = e k ( X ) e 1 ( X ) decreases with k and reaches minimum when k = n . Model selection: trade-off choose best model among all the models with k ∈ [1 , n ]. Regularized objective function: e ′ k ( X ) = e k ( X ) + f ( k ), f ( k ) related to model complexity. Compute the DP table for k = n , ..., 1 and avoids redundant computations. Then compute the criterion for the last line (indexed by n ) and choose the argmin of e ′ k . c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/26

  12. A Voronoi cell condition for DP optimality elements → interval clusters → prototypes interval clusters ← prototypes Voronoi diagram: Partition X wrt. P = { p 1 , ..., p k } . Voronoi cell: V ( p j ) = { x ∈ X : d r ( x , p j ) ≤ d r ( x , p l ) ∀ l ∈ { 1 , ..., k }} . x r is a monotonically increasing function on R + , equivalent to V ′ ( p j ) = { x ∈ X : d ( x : p j ) < d ( x : p l ) } DP guarantees optimal clustering when ∀P , V ′ ( p j ) is an interval 2-clustering exhibits the Voronoi bisector. c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/26

  13. 1-mean (centroid): O ( n ) time n � ( x i − p ) 2 min p i =1 D ( x , p ) = ( x − p ) 2 , D ′ ( x , p ) = 2( x − p ) , D ′′ ( x , p ) = 2 Convex optimization (existence and unique solution) n n � � D ′ ( x , p ) = 0 ⇒ x i − np = 0 i =1 i =1 � n Center of mass p = 1 i =1 x i − (barycenter) n Extends to Bregman divergence: D F ( x , p ) = F ( x ) − F ( p ) − ( x − p ) F ′ ( p ) c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/26

  14. 2-means: O ( n log n ) time Find x l 2 ( n − 1 potential locations for x l : from x 2 to x n ): min { e 1 ( C 1 ) + e 1 ( C 2 ) } x l 2 Browse from left to right l 2 = x 2 , ...., x n . Update cost in constant time E 2 ( l + 1) from E 2 ( l ) (SATs also O (1)): E 2 ( l ) = e 2 ( x 1 ... x l − 1 | x l ... x n ) µ 1 ( l + 1) = ( l − 1) µ 1 ( l ) + x l µ 2 ( l + 1) = ( n − l + 1) µ 2 ( l ) − x l , l n − l l l � � ( x i − µ 1 ( l + 1)) 2 = x 2 i − l µ 2 v 1 ( l + 1) = 1 ( l + 1) i =1 i =1 ∆ E 2 ( l ) = l − 1 � µ 1 ( l ) − x l � 2 + n − l + 1 � µ 2 ( l ) − x l � 2 l n − l c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/26

  15. 2-means: Experiments Intel Win7 i7-4800 n Brute force SAT Incremental 300000 155 . 022 0 . 010 0 . 0091 1000000 1814 . 44 0 . 018 0 . 015 Do we need sorting and Ω( n log n ) time? ( k = 1 is linear time) Note that MaxGap does not yield the separator (because centroid is sum of squared distance minimizer) c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/26

  16. Optimal 1D Bregman k -means Bregman information [2] e 1 (generalizes cluster variance): e 1 ( C j ) = min x l ∈C j w l B F ( x l : p j ) . (1) Expressed as [14]: �� � �� � ( p j F ′ ( p j ) − F ( p j )) + e 1 ( C j ) = x l ∈C j w l F ( x l ) − x l ∈C j w l �� � F ′ ( p j ) x ∈C j w l x process using Summed Area Tables [6] (SATs) S 1 ( j ) = � j l =1 w l , S 2 ( j ) = � j l =1 w l x l , and S 3 ( j ) = � j l =1 w l F ( x l ) in O ( n ) time at preprocessing stage. Evaluate the Bregman information e 1 ( X j , i ) in constant time O (1). For example, � i l = j w l F ( x l ) = S 3 ( i ) − S 3 ( j − 1) with S 3 (0) = 0. Bregman Voronoi diagrams have connected cells [4] thus DP yields optimal interval clustering. c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/26

  17. Exponential families in statistics Family of probability distributions: F = { p F ( x ; θ ) : θ ∈ Θ } Exponential families [13]: p F ( x | θ ) = exp ( t ( x ) θ − F ( θ ) + k ( x )) , For example: univariate Rayleigh R ( σ ), t ( x ) = x 2 , k ( x ) = log x , θ = − 1 2 σ 2 , η = − 1 θ , F ( θ ) = log − 1 2 θ and F ∗ ( η ) = − 1 + log 2 η . c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/26

  18. Uniorder exponential families: MLE Maximum Likelihood Estimator (MLE) [13]: i 1 � e 1 ( X j , i ) = ˆ l ( x j , ..., x i ) = F ∗ (ˆ η j , i ) + k ( x l ) . i − j + 1 l = j � i 1 with ˆ η j , i = l = j t ( x l ). i − j +1 By making a change of variable y l = t ( x l ), and not accounting the � k ( x l ) terms that are constant for any clustering, we get ⎛ ⎞ i 1 � e 1 ( X j , i ) ≡ F ∗ y l ⎝ ⎠ i − j + 1 l = j c � 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/26

Recommend


More recommend