on the worst case complexity of the k means method
play

On the Worst-Case Complexity of the k-Means Method Sergei - PowerPoint PPT Presentation

On the Worst-Case Complexity of the k-Means Method Sergei Vassilvitskii David Arthur (Stanford University) Clustering R d Given points in split them into similar groups. k n 2 Clustering Objectives Let be the closest


  1. On the Worst-Case Complexity of the k-Means Method Sergei Vassilvitskii David Arthur (Stanford University)

  2. Clustering R d Given points in split them into similar groups. k n 2

  3. Clustering Objectives Let be the closest cluster center to . C ( x ) x k-Center: min max x ∈ X � x − C ( x ) � 3

  4. Clustering Objectives Let be the closest cluster center to . C ( x ) x � k-Median: min � x − C ( x ) � x ∈ X 4

  5. Clustering Objectives Let be the closest cluster center to . C ( x ) x � � x − C ( x ) � 2 k-Median Squared: min x ∈ X Much more sensitive to outliers 5

  6. Lloyd’s Method: K-means Initialize with random clusters 6

  7. Lloyd’s Method: K-means Assign each point to nearest center 7

  8. Lloyd’s Method: K-means Recompute optimum centers (means) 8

  9. Lloyd’s Method: K-means Repeat: Assign points to nearest center 9

  10. Lloyd’s Method: K-means Repeat: Recompute centers 10

  11. Lloyd’s Method: K-means Repeat... 11

  12. Lloyd’s Method: K-means Repeat...Until clustering does not change 12

  13. Analysis How good is this algorithm? Finds a local optimum That’s arbitrarily worse than optimal solution 13

  14. Analysis How fast is this algorithm? In practice: VERY fast: e.g. Digit Recognition dataset with n = 60 , 000 , d = 700 Converges after 60 iterations In theory: Stay Tuned. 14

  15. Previous Work Lower Bounds on the line Ω( n ) Ω( n 2 ) in the plane Upper Bounds: ∆ = max x,y � x − y � O ( n ∆ 2 ) on the line, spread min x,y � x − y � Exponential bounds: O ( k n ) , O ( n kd ) 15

  16. Our Results 2 Ω( √ n ) Lower Bound: Smoothed Upper Bounds: � � � 2 � D n 2+2 /d 2 2 n/d O σ � � 2 � � D n k +2 /d O σ is the smoothness factor σ Diameter of the pointset D 16

  17. Rest of the Talk Lower Bound Sketch Upper Bound Sketch Open Problems 17

  18. Lower Bound General Idea: Make a “Reset Widget”: If k-Means takes time on , create a new X t point set , s.t. k-Means takes time to X � 2 t terminate. 18

  19. Lower Bound: Sketch Initial Clustering: t steps 19

  20. Lower Bound: Sketch With Widget t steps reset t steps

  21. Lower Bound Details Three Main Ideas: Signaling - Recognizing when to start flipping the switch Resetting - Setting the cluster centers back to original position Odds & Ends - Clean-up to make the process recursive 21

  22. Signaling Suppose that when k-Means terminates, there is one cluster center that has never appeared before. We use this as a signal to start the reset sequence. p 22

  23. Signaling Suppose that when k-Means terminates, there is one cluster center that has never appeared before. We use this as a signal to start the reset sequence. p 23

  24. Signaling Suppose that when k-Means terminates, there is one cluster center that has never appeared before. We use this as a signal to start the reset sequence. By setting we can control � d exactly when will switch. p p d − � 24

  25. Signaling Suppose that when k-Means terminates, there is one cluster center that has never appeared before. We use this as a signal to start the reset sequence. p 25

  26. Resetting k properly placed points can reset the positions of the k current centers Easy to compute locations of the reset points, so that new cluster centers are placed correctly: intended center current center 26

  27. Resetting k properly placed points can reset the positions of the k current centers Easy to compute locations of the reset points, so that new cluster centers are placed correctly: intended center current center add to cluster to reset mean 27

  28. Resetting k properly placed points can reset the positions of the k current centers Easy to compute locations of the reset points, so that new cluster centers are placed correctly: new center 28

  29. Resetting Easy to compute locations of the reset points, so that new cluster centers are placed correctly: But must avoid accidentally grabbing other points 29

  30. Resetting Solution: Add two new dimensions y z x w intended new position center to reset 30

  31. Resetting Solution: Add two new dimensions y z x w new point added 31

  32. Resetting Solution: Add two new dimensions y z x w new point added 32

  33. Multi-Signaling So far have shown how to signal and reset a single cluster. Can use one signal to induce a signal from all clusters. p 33

  34. Multi-Signaling All centers are stable before the main signaling has taken place. d p d + � 34

  35. Multi-Signaling All centers are stable before the main signaling has taken place. d p d + � 35

  36. Multi-Signaling Due to signaling the center moves away, now all centers absorb points above. > d + � p d + � 36

  37. Multi-Signaling Due to signaling the center q moves away, now all centers absorb points above. All clusters have previously unseen centers. p 37

  38. Put All Pieces Together Start with a signaling configuration Transform it, so that all clusters signal Use the new signal to reset cluster centers (and therefore double the runtime of k-Means) Ensure the new configuration is signaling Repeat... 38

  39. Construction in Pictures Construction: Reflected Points 39

  40. Construction in Pictures After steps - signal by all clusters t 40

  41. Construction in Pictures Main Clusters absorb “catalyst” points. Yellow centers move away 41

  42. Construction in Pictures The new points added are “reset” points - resetting the original cluster centers. 42

  43. Construction in Pictures Can ensure “catalyst” points leave the main clusters 43

  44. Construction in Pictures k-Means runs for another steps. The original centers t will be signaling. 44

  45. Construction Results If we repeat the reseting widget construction times: r O ( r 2 ) points in dimensions O ( r ) clusters O ( r ) 2 Ω( r ) Total running time: 45

  46. Construction Remarks Currently construction has very large spread Can use more trickery to decrease the spread to be constant, albeit with a blow up in the dimension. As presented requires specific placement of initial cluster centers, in practice centers chosen randomly from points. Can make construction work even in this case Open question: Can we decrease the dimensionality to constant d? 46

  47. Outline k-Means Intuition Lower Bound Sketch Upper Bound Sketch Open Problems 47

  48. Smoothed Analysis Assume each point came from a Gaussian distribution σ 2 with variance . Data collection is inherently noisy Or add some Gaussian noise (effect on final clustering is minimal) Key Fact: Probability mass inside any ball of radius is at � ( � / σ ) d most . 48

  49. Potential Function Use a potential function: � � x − C ( x ) � 2 Φ( C ) = x ∈ X nD 2 Original Potential at most Potential decreases every step. x − C ( X ) Reassignment reduces Center recomputation finds optimal for the Φ given partition 49

  50. Potential Decrease Lemma Let be a pointset with optimal center and c ∗ S c be any other point then: Φ( c ) − Φ( c ∗ ) = | S |� c − c ∗ � 2 � Φ ( c ) = ( x − c ) · ( x − c ) x ∈ S � ( x − c + c ∗ − c ∗ ) · ( x − c + c ∗ − c ∗ ) = x ∈ S � ( x − c ∗ ) · ( x − c ∗ ) + ( c − c ∗ ) · ( c − c ∗ ) + 2( c − c ∗ ) · ( x − c ∗ ) = x ∈ S � = Φ( c ∗ ) + | S |� c − c ∗ � + 2( c − c ∗ ) · ( x − c ∗ ) x ∈ S 50

  51. Main Lemma In a smoothed pointset, fix an . Then with probability at � > 0 1 − 2 2 n � � � d least for any two clusters and with S T σ optimal centers and we have that: c ( T ) c ( S ) � � c ( S ) − c ( T ) � ≥ 2 min( | S | , | T | ) 51

  52. Proof Sketch Suppose and . Fix all points | S | < | T | x, x ∈ S, x �∈ T except for . x To ensure , must lie in a ball of � c ( S ) − c ( T ) � ≤ � x diameter . | S | � σ 2 Since came from a Gaussian of variance this x ( | S | �σ − 1 ) d probability is at most . Finally, union bound the total error probability over all 2 2 n ( � / σ ) d possible pairs of sets - . 52

  53. Potential Drop At each iteration, examine a cluster whose center S c � changed from to : c � � c − c � � ≥ | S | | S | � 2 | S | 2 ≥ � 2 Therefore, the potential drops by 4 n m = 4 n 2 D 2 After iterations, the algorithm must � 2 terminate. 53

  54. To Finish Up: � = σn − 1 d 2 − 2 n Chose . Then the total probability of d 2 2 n ( �/σ ) d = 1 /n failure is: . The total running time is � � � 2 m = 4 n 2 D 2 � D n 2+2 /d 2 2 n/d = O � 2 σ n Remark: polynomial for d = Ω( log n ) 54

Recommend


More recommend