new developments in the theory of clustering
play

New Developments In The Theory Of Clustering thats all very well in - PowerPoint PPT Presentation

New Developments In The Theory Of Clustering thats all very well in practice, but does it work in theory ? Sergei Vassilvitskii (Yahoo! Research) Suresh Venkatasubramanian (U. Utah) Sergei V . and Suresh V . Theory of Clustering Overview


  1. Seeding on Gaussians But this fix is overly sensitive to outliers Sergei V . and Suresh V . Theory of Clustering

  2. Seeding on Gaussians But this fix is overly sensitive to outliers Sergei V . and Suresh V . Theory of Clustering

  3. Seeding on Gaussians But this fix is overly sensitive to outliers Sergei V . and Suresh V . Theory of Clustering

  4. ❦✲♠❡❛♥s✰✰ ❦✲♠❡❛♥s✰✰ What if we interpolate between the two methods? Sergei V . and Suresh V . Theory of Clustering

  5. ❦✲♠❡❛♥s✰✰ ❦✲♠❡❛♥s✰✰ What if we interpolate between the two methods? Let D ( x ) be the distance between a point x and its nearest cluster center. Chose the next point proportionally to D α ( x ) . Sergei V . and Suresh V . Theory of Clustering

  6. ❦✲♠❡❛♥s✰✰ ❦✲♠❡❛♥s✰✰ What if we interpolate between the two methods? Let D ( x ) be the distance between a point x and its nearest cluster center. Chose the next point proportionally to D α ( x ) . α = 0 −→ Random initialization Sergei V . and Suresh V . Theory of Clustering

  7. ❦✲♠❡❛♥s✰✰ ❦✲♠❡❛♥s✰✰ What if we interpolate between the two methods? Let D ( x ) be the distance between a point x and its nearest cluster center. Chose the next point proportionally to D α ( x ) . α = 0 −→ Random initialization α = ∞ −→ Furthest point heuristic Sergei V . and Suresh V . Theory of Clustering

  8. ❦✲♠❡❛♥s✰✰ What if we interpolate between the two methods? Let D ( x ) be the distance between a point x and its nearest cluster center. Chose the next point proportionally to D α ( x ) . α = 0 −→ Random initialization α = ∞ −→ Furthest point heuristic α = 2 −→ ❦✲♠❡❛♥s✰✰ Sergei V . and Suresh V . Theory of Clustering

  9. ❦✲♠❡❛♥s✰✰ What if we interpolate between the two methods? Let D ( x ) be the distance between a point x and its nearest cluster center. Chose the next point proportionally to D α ( x ) . α = 0 −→ Random initialization α = ∞ −→ Furthest point heuristic α = 2 −→ ❦✲♠❡❛♥s✰✰ More generally Set the probability of selecting a point proportional to its contribution to the overall error. � � If minimizing x ∈ C i � x − c i � , sample according to D . c i � � c ∈ C i � x − c i � ∞ , sample according to D ∞ If minimizing c i (take the furthest point). Sergei V . and Suresh V . Theory of Clustering

  10. Example of ❦✲♠❡❛♥s✰✰ If the data set looks Gaussian... Sergei V . and Suresh V . Theory of Clustering

  11. Example of ❦✲♠❡❛♥s✰✰ If the data set looks Gaussian... Sergei V . and Suresh V . Theory of Clustering

  12. Example of ❦✲♠❡❛♥s✰✰ If the data set looks Gaussian... Sergei V . and Suresh V . Theory of Clustering

  13. Example of ❦✲♠❡❛♥s✰✰ If the data set looks Gaussian... Sergei V . and Suresh V . Theory of Clustering

  14. Example of ❦✲♠❡❛♥s✰✰ If the data set looks Gaussian... Sergei V . and Suresh V . Theory of Clustering

  15. Example of ❦✲♠❡❛♥s✰✰ If the outlier should be its own cluster ... Sergei V . and Suresh V . Theory of Clustering

  16. Example of ❦✲♠❡❛♥s✰✰ If the outlier should be its own cluster ... Sergei V . and Suresh V . Theory of Clustering

  17. Example of ❦✲♠❡❛♥s✰✰ If the outlier should be its own cluster ... Sergei V . and Suresh V . Theory of Clustering

  18. Example of ❦✲♠❡❛♥s✰✰ If the outlier should be its own cluster ... Sergei V . and Suresh V . Theory of Clustering

  19. Example of ❦✲♠❡❛♥s✰✰ If the outlier should be its own cluster ... Sergei V . and Suresh V . Theory of Clustering

  20. Analyzing ❦✲♠❡❛♥s✰✰ What can we say about performance of ❦✲♠❡❛♥s✰✰ ? Sergei V . and Suresh V . Theory of Clustering

  21. Analyzing ❦✲♠❡❛♥s✰✰ What can we say about performance of ❦✲♠❡❛♥s✰✰ ? Theorem (AV07) This algorithm always attains an O ( log k ) approximation in expectation Sergei V . and Suresh V . Theory of Clustering

  22. Analyzing ❦✲♠❡❛♥s✰✰ What can we say about performance of ❦✲♠❡❛♥s✰✰ ? Theorem (AV07) This algorithm always attains an O ( log k ) approximation in expectation Theorem (ORSS06) A slightly modified version of this algorithm attains an O ( 1 ) approximation if the data is ‘nicely clusterable’ with k clusters. Sergei V . and Suresh V . Theory of Clustering

  23. Nice Clusterings What do we mean by ‘nicely clusterable’? Intuitively, X is nicely clusterable if going from k − 1 to k clusters drops the total error by a constant factor. Sergei V . and Suresh V . Theory of Clustering

  24. Nice Clusterings What do we mean by ‘nicely clusterable’? Intuitively, X is nicely clusterable if going from k − 1 to k clusters drops the total error by a constant factor. Definition A pointset X is ( k , ε ) -separated if φ ∗ k ( X ) ≤ ε 2 φ ∗ k − 1 ( X ) . Sergei V . and Suresh V . Theory of Clustering

  25. Why does this work? Intuition Look at the optimum clustering. In expectation: 1 If the algorithm selects a point from a new OPT cluster, that cluster is covered pretty well 2 If the algorithm picks two points from the same OPT cluster, then other clusters must contribute little to the overall error Sergei V . and Suresh V . Theory of Clustering

  26. Why does this work? Intuition Look at the optimum clustering. In expectation: 1 If the algorithm selects a point from a new OPT cluster, that cluster is covered pretty well 2 If the algorithm picks two points from the same OPT cluster, then other clusters must contribute little to the overall error As long as the points are reasonably well separated, the first condition holds. Sergei V . and Suresh V . Theory of Clustering

  27. Why does this work? Intuition Look at the optimum clustering. In expectation: 1 If the algorithm selects a point from a new OPT cluster, that cluster is covered pretty well 2 If the algorithm picks two points from the same OPT cluster, then other clusters must contribute little to the overall error As long as the points are reasonably well separated, the first condition holds. Two theorems Assume the points are ( k , ε ) -separated and get an O ( 1 ) approximation. Make no assumptions about separability and get an O ( log k ) approximation. Sergei V . and Suresh V . Theory of Clustering

  28. Summary ❦✲♠❡❛♥s✰✰ Summary: To select the next cluster, sample a point in proportion to its current contribution to the error Works for k -means, k -median, other objective functions Universal O ( log k ) approximation, O ( 1 ) approximation under some assumptions Can be implemented to run in O ( nkd ) time (same as a single ❦✲♠❡❛♥s step) Sergei V . and Suresh V . Theory of Clustering

  29. Summary ❦✲♠❡❛♥s✰✰ Summary: To select the next cluster, sample a point in proportion to its current contribution to the error Works for k -means, k -median, other objective functions Universal O ( log k ) approximation, O ( 1 ) approximation under some assumptions Can be implemented to run in O ( nkd ) time (same as a single ❦✲♠❡❛♥s step) But does it actually work? Sergei V . and Suresh V . Theory of Clustering

  30. Large Evaluation Sergei V . and Suresh V . Theory of Clustering

  31. Typical Run KM++ v. KM v. KM-Hybrid 1300 1200 1100 1000 LLOYD Error HYBRID KM++ 900 800 700 600 0 50 100 150 200 250 300 350 400 450 500 Stage Sergei V . and Suresh V . Theory of Clustering

  32. Other Runs KM++ v. KM v. KM-Hybrid 250000 200000 150000 LLOYD Error HYBRID KM++ 100000 50000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 Stage Sergei V . and Suresh V . Theory of Clustering

  33. Convergence How fast does ❦✲♠❡❛♥s converge? It appears the algorithm converges in under 100 iterations (even faster with smart initialization). Sergei V . and Suresh V . Theory of Clustering

  34. Convergence How fast does ❦✲♠❡❛♥s converge? It appears the algorithm converges in under 100 iterations (even faster with smart initialization). Theorem (V09) There exists a pointset X in � 2 and a set of initial centers � so that ❦✲♠❡❛♥s takes 2 Ω( k ) iterations to converge when initialized with � . Sergei V . and Suresh V . Theory of Clustering

  35. Theory vs. Practice Finding the disconnect In theory: ❦✲♠❡❛♥s might run in exponential time In practice: ❦✲♠❡❛♥s converges after a handful of iterations It works in practice but it does not work in theory! Sergei V . and Suresh V . Theory of Clustering

  36. Finding the disconnect Robustness of worst case examples Perhaps the worst case examples are too precise, and can never arise out of natural data Quantifying the robustness If we slightly perturb the points of the example: The optimum solution shouldn’t change too much Will the running time stay exponential? Sergei V . and Suresh V . Theory of Clustering

  37. Small Perturbations Sergei V . and Suresh V . Theory of Clustering

  38. Small Perturbations Sergei V . and Suresh V . Theory of Clustering

  39. Small Perturbations Sergei V . and Suresh V . Theory of Clustering

  40. Smoothed Analysis Perturbation To each point x ∈ X add independent noise drawn from N ( 0, σ 2 ) . Definition The smoothed complexity of an algorithm is the maximum expected running time after adding the noise: max � σ [ Time ( X + σ )] X Sergei V . and Suresh V . Theory of Clustering

  41. Smoothed Analysis Theorem (AMR09) The smoothed complexity of ❦✲♠❡❛♥s is bounded by n 34 k 34 d 8 D 6 log 4 n � � O σ 6 Notes While the bound is large, it is not exponential (2 k ≫ k 34 for large enough k ) The ( D / σ ) 6 factor shows the bound is scale invariant Sergei V . and Suresh V . Theory of Clustering

  42. Smoothed Analysis Comparing bounds The smoothed complexity of ❦✲♠❡❛♥s is polynomial in n , k and D / σ where D is the diameter of X , whereas the worst case complexity of ❦✲♠❡❛♥s is exponential in k Implications The pathological examples: Are very brittle Can be avoided with a little bit of random noise Sergei V . and Suresh V . Theory of Clustering

  43. ❦✲♠❡❛♥s Summary Running Time Exponential worst case running time Polynomial typical case running time Sergei V . and Suresh V . Theory of Clustering

  44. ❦✲♠❡❛♥s Summary Running Time Exponential worst case running time Polynomial typical case running time Solution Quality Arbitrary local optimum, even with many random restarts Simple initialization leads to a good solution Sergei V . and Suresh V . Theory of Clustering

  45. Large Datasets Implementing ❦✲♠❡❛♥s✰✰ Initialization: Takes O ( nd ) time and one pass over the data to select the next center Takes O ( nkd ) time total Overall running time: Each round of ❦✲♠❡❛♥s takes O ( nkd ) running time Typically finish after a constant number of rounds Sergei V . and Suresh V . Theory of Clustering

  46. Large Datasets Implementing ❦✲♠❡❛♥s✰✰ Initialization: Takes O ( nd ) time and one pass over the data to select the next center Takes O ( nkd ) time total Overall running time: Each round of ❦✲♠❡❛♥s takes O ( nkd ) running time Typically finish after a constant number of rounds Large Data What if O ( nkd ) is too much, can we parallelize this algorithm? Sergei V . and Suresh V . Theory of Clustering

  47. Parallelizing ❦✲♠❡❛♥s Approach Partition the data: Split X into X 1 , X 2 ,..., X m of roughly equal size. Sergei V . and Suresh V . Theory of Clustering

  48. Parallelizing ❦✲♠❡❛♥s Approach Partition the data: Split X into X 1 , X 2 ,..., X m of roughly equal size. In parallel compute a clustering on each partition: Find � j = { C j 1 ,..., C j k } : a good clustering on each partition, and denote by w j i the number of points in cluster C j i . Sergei V . and Suresh V . Theory of Clustering

  49. Parallelizing ❦✲♠❡❛♥s Approach Partition the data: Split X into X 1 , X 2 ,..., X m of roughly equal size. In parallel compute a clustering on each partition: Find � j = { C j 1 ,..., C j k } : a good clustering on each partition, and denote by w j i the number of points in cluster C j i . Cluster the clusters: Let Y = ∪ 1 ≤ j ≤ m � j . Find a clustering of Y , weighted by the weights W = { w j i } . Sergei V . and Suresh V . Theory of Clustering

  50. Parallelization Example Given X Sergei V . and Suresh V . Theory of Clustering

  51. Parallelization Example Partition the dataset Sergei V . and Suresh V . Theory of Clustering

  52. Parallelization Example Cluster each partition separately Sergei V . and Suresh V . Theory of Clustering

  53. Parallelization Example Cluster each partition separately Sergei V . and Suresh V . Theory of Clustering

Recommend


More recommend