fast k means with accurate bounds
play

Fast K-Means with Accurate Bounds James Newling & Franc ois - PowerPoint PPT Presentation

Fast K-Means with Accurate Bounds James Newling & Franc ois Fleuret Idiap Research Institute Computer Vision and Learning Group & EPFL June 20th, 2016 COLE POLYTECHNIQUE FDRALE DE LAUSANNE K -Means Problem Statement and


  1. Fast K-Means with Accurate Bounds James Newling & Franc ¸ois Fleuret Idiap Research Institute Computer Vision and Learning Group & EPFL June 20th, 2016 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

  2. K -Means Problem Statement and Lloyd’s Algorithm Given data ( x i ) N i = 1 ∈ ( R d ) N , find centers ( c k ) K k = 1 ∈ ( R d ) K minimising N � k = 1 : K � x i − c k � 2 . min i = 1 NP-hard, so heuristic algorithms such as Lloyd’s are used Lloyd’s algorithm run for T iterations requires dKNT FLOPs We are interested in making it faster 1 / 9

  3. Lloyd’s Algorithm × : data • : centers × × × × • × × × × × × • × × × × × • • × × × × × × • × × × × × × × × × × 2 / 9

  4. Lloyd’s Algorithm Assignment of datapoint at iteration 1 × × × × • × × × × × × • × × × × × • • × × × × × × • × × × × × × × × × × 2 / 9

  5. Lloyd’s Algorithm All assignments at iteration 1 × × × × • × × × × × × • × × × × × • • × × × × × × • × × × × × × × × × × 2 / 9

  6. Lloyd’s Algorithm Updates at iteration 1 × × × × • • × × × × × • × • × × × × × • • • × × × • × × × • • × × × × × × × × × × 2 / 9

  7. Lloyd’s Algorithm Assignment of datapoint at iteration 2 × × × × • • × × × × × • × • × × × × × • • • × × × • × × × • • × × × × × × × × × × 2 / 9

  8. Lloyd’s Algorithm All assignments at iteration 2 × × × × • • × × × × × • × • × × × × × • • • × × × • × × × • • × × × × × × × × × × 2 / 9

  9. Lloyd’s Algorithm Updates at iteration 2 × × × × • • × × × × • × • • × • × × × • × × • • • × × × • × × × • • • • × × × × × × × × × × 2 / 9

  10. Lloyd’s Algorithm Assignment of datapoint at iteration 3 × × × × • • × × × × • × • • × • × × × • × × • • • × × × • × × × • • • • × × × × × × × × × × 2 / 9

  11. Lloyd’s Algorithm All assignments at iteration 3 × × × × • • × × × × • × • • × • × × × • × × • • • × × × • × × × • • • • × × × × × × × × × × 2 / 9

  12. Lloyd’s Algorithm Updates at iteration 3 × × × × • • × × × × • • × • • × • • • × × × • × × • • • × × × • × × × • • • • • × × × × • × × × × × × 2 / 9

  13. Lloyd’s Algorithm Assignment of datapoint at iteration 4 × × × × • • × × × × • • × • • × • • • × × × • × × • • • × × × • × × × • • • • • × × × × • × × × × × × 2 / 9

  14. Lloyd’s Algorithm All assignments at iteration 4 × × × × • • × × × × • • × • • × • • • × × × • × × • • • × × × • × × × • • • • • × × × × • × × × × × × 2 / 9

  15. Lloyd’s Algorithm Updates at iteration 4 × × × × • • × × • × × • • × • • × • • • • • × × × • × × • • • × × × • × × × • • • • • • × × × × • × • × × × × × 2 / 9

  16. Lloyd’s Algorithm How to Accelerate Two approaches : (1) approximate it (2) be more efficient – get exactly the same output as Lloyd’s algorithm without all data-center distances i Pelleg et al. (1999) ∆ Elkan (2003) best high- d i Kanungo et al. (2002) ∆ Yinyang (2015) best mid- d ∆ Hamerly (2010) ∆ Annular (2013) best low- d 3 / 9

  17. Lloyd’s Algorithm How to Accelerate Two approaches : (1) approximate it only exact for next 13 minutes (2) be more efficient – get exactly the same output as Lloyd’s algorithm without all data-center distances i Pelleg et al. (1999) ∆ Elkan (2003) best high- d i Kanungo et al. (2002) ∆ Yinyang (2015) best mid- d ∆ Hamerly (2010) ∆ Annular (2013) best low- d 3 / 9

  18. Using The Triangle Inequality Elkan’s Two Techniques Elkan uses the triangle inequality in two distinct ways (1) center-center distances to bound data-center distances (2) directly maintain bounds on data-center distances • • • • × × U L U L 4 / 9

  19. Using The Triangle Inequality Elkan’s Two Techniques Elkan uses the triangle inequality in two distinct ways (1) center-center distances to bound data-center distances (2) directly maintain bounds on data-center distances • • • • × × U L U L (A) We show that (1) + (2) is slower than just (2). Simplifying helps! 4 / 9

  20. Using The Triangle Inequality Elkan K − 1 lower bounds • • • • • • • • • • • • • • U • L • • × • • • 5 / 9

  21. Using The Triangle Inequality Yinyang group lower bounds • • • • • • • • • • • • • • U • L • • × • • • 5 / 9

  22. Using The Triangle Inequality Hamerly 1 lower bound • • • • • • • • • • • • • • U • • • × • • • L 5 / 9

  23. Lower bound updating • × 6 / 9

  24. Lower bound updating • • × 6 / 9

  25. Lower bound updating • • • × 6 / 9

  26. Lower bound updating • • • • × 6 / 9

  27. Lower bound updating • • • • × • 6 / 9

  28. Lower bound updating • • • • × • • 6 / 9

  29. Lower bound updating • • • • × • • • 6 / 9

  30. Lower bound updating • • • • × • • • • 6 / 9

  31. Lower bound updating • • • • × • • • • • 6 / 9

  32. Lower bound updating • • • • × • • • • • • 6 / 9

  33. Lower bound updating • • • • × • • • • • • 6 / 9

  34. Lower bound updating � � ·� -bound • � � · � -bound • • • × • • • • • • 6 / 9

  35. � � ·� -bounds All upper and lower bounds in Elkan, Hamerly, Yinyang, Annular are � � · � -bounds, and can be replaced by tighter � � ·� -bounds. There is a cost to � � ·� -bounds, additional memory is required: • Store historical centers from all rounds • Store the round in which bounds are made tight This memory overhead can be controlled by periodically clearing the history, requiring a � � · � -bound update 7 / 9

  36. � � ·� -bounds All upper and lower bounds in Elkan, Hamerly, Yinyang, Annular are � � · � -bounds, and can be replaced by tighter � � ·� -bounds. There is a cost to � � ·� -bounds, additional memory is required: • Store historical centers from all rounds • Store the round in which bounds are made tight This memory overhead can be controlled by periodically clearing the history, requiring a � � · � -bound update (B) We show that � � ·� -bounding generally improves algorithms. 7 / 9

  37. Hamerly (2010) bound test, failure 1 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • × • • 8 / 9

  38. Hamerly (2010) bound test, failure 2 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • × • • 8 / 9

  39. Hamerly (2010) compute all distances • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • × • • 8 / 9

  40. Hamerly (2010) reset bounds • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • × • • 8 / 9

  41. Eliminating distance calculations c �∈ B ( x , r ) ⇒ c �∈ { c new , c new } a b • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • c old • c old • • • • • • b a • • • • × • r • r = max c ∈{ c old } � x − c � , c old a b 8 / 9

Recommend


More recommend