data driven clustering via parameterized lloyds families
play

Data-driven Clustering via Parameterized Lloyds Families Travis - PowerPoint PPT Presentation

Data-driven Clustering via Parameterized Lloyds Families Travis Dick Joint work with Maria-Florina Balcan and Colin White Carnegie Mellon University NeurIPS 2018 Data-driven Clustering Data-driven Clustering Clustering aims to divide a


  1. Data-driven Clustering via Parameterized Lloyds Families Travis Dick Joint work with Maria-Florina Balcan and Colin White Carnegie Mellon University NeurIPS 2018

  2. Data-driven Clustering

  3. Data-driven Clustering • Clustering aims to divide a dataset into self-similar clusters.

  4. Data-driven Clustering • Clustering aims to divide a dataset into self-similar clusters. • Goal: find some unknown natural clustering.

  5. Data-driven Clustering • Clustering aims to divide a dataset into self-similar clusters. • Goal: find some unknown natural clustering.

  6. Data-driven Clustering • Clustering aims to divide a dataset into self-similar clusters. • Goal: find some unknown natural clustering.

  7. Data-driven Clustering • Clustering aims to divide a dataset into self-similar clusters. • Goal: find some unknown natural clustering. • However, most clustering algorithms minimize a clustering cost function.

  8. Data-driven Clustering • Clustering aims to divide a dataset into self-similar clusters. • Goal: find some unknown natural clustering. • However, most clustering algorithms minimize a clustering cost function. • Hope that low-cost clusterings recover the natural clusters.

  9. Data-driven Clustering • Clustering aims to divide a dataset into self-similar clusters. • Goal: find some unknown natural clustering. • However, most clustering algorithms minimize a clustering cost function. • Hope that low-cost clusterings recover the natural clusters. • There are many algorithms and many objectives.

  10. Data-driven Clustering • Clustering aims to divide a dataset into self-similar clusters. • Goal: find some unknown natural clustering. • However, most clustering algorithms minimize a clustering cost function. • Hope that low-cost clusterings recover the natural clusters. • There are many algorithms and many objectives. How do we choose the best algorithm for a specific application?

  11. Data-driven Clustering • Clustering aims to divide a dataset into self-similar clusters. • Goal: find some unknown natural clustering. • However, most clustering algorithms minimize a clustering cost function. • Hope that low-cost clusterings recover the natural clusters. • There are many algorithms and many objectives. How do we choose the best algorithm for a specific application? Can we automate this process?

  12. Learning Model

  13. Learning Model • An unknown distribution ! over clustering instances.

  14. Learning Model • An unknown distribution ! over clustering instances. • Given a sample " # , … , " & ∼ ! annotated by their target clusterings.

  15. Learning Model • An unknown distribution ! over clustering instances. • Given a sample # $ , … , # ' ∼ ! annotated by their target clusterings. • Find an algorithm " that produces clusterings similar to the target clusterings.

  16. Learning Model • An unknown distribution ! over clustering instances. • Given a sample # $ , … , # ' ∼ ! annotated by their target clusterings. • Find an algorithm " that produces clusterings similar to the target clusterings. • Want " to also work well for new instances from ! !

  17. Learning Model • An unknown distribution ! over clustering instances. • Given a sample # $ , … , # ' ∼ ! annotated by their target clusterings. • Find an algorithm " that produces clusterings similar to the target clusterings. • Want " to also work well for new instances from ! ! • In this work: 1. Introduce large parametric family of clustering algorithms, (*, +) -Lloyds.

  18. Learning Model • An unknown distribution ! over clustering instances. • Given a sample # $ , … , # ' ∼ ! annotated by their target clusterings. • Find an algorithm " that produces clusterings similar to the target clusterings. • Want " to also work well for new instances from ! ! • In this work: 1. Introduce large parametric family of clustering algorithms, (*, +) -Lloyds. 2. Efficient procedures for finding best parameters on a sample.

  19. Learning Model • An unknown distribution ! over clustering instances. • Given a sample # $ , … , # ' ∼ ! annotated by their target clusterings. • Find an algorithm " that produces clusterings similar to the target clusterings. • Want " to also work well for new instances from ! ! • In this work: 1. Introduce large parametric family of clustering algorithms, (*, +) -Lloyds. 2. Efficient procedures for finding best parameters on a sample. 3. Generalization: optimal parameters on sample are nearly optimal on ! .

  20. Lloyds Method

  21. Lloyds Method • Maintains ! centers " # , … , " & that define clusters.

  22. Lloyds Method • Maintains ! centers " # , … , " & that define clusters. • Perform local search to improve the ! -means cost of the centers.

  23. Lloyds Method • Maintains ! centers " # , … , " & that define clusters. • Perform local search to improve the ! -means cost of the centers. 1. Assign each point to nearest center.

  24. Lloyds Method • Maintains ! centers " # , … , " & that define clusters. • Perform local search to improve the ! -means cost of the centers. 1. Assign each point to nearest center. 2. Update each center to be the mean of assigned points.

  25. Lloyds Method • Maintains ! centers " # , … , " & that define clusters. • Perform local search to improve the ! -means cost of the centers. 1. Assign each point to nearest center. 2. Update each center to be the mean of assigned points. 3. Repeat until convergence.

  26. Lloyds Method • Maintains ! centers " # , … , " & that define clusters. • Perform local search to improve the ! -means cost of the centers. 1. Assign each point to nearest center. 2. Update each center to be the mean of assigned points. 3. Repeat until convergence.

  27. Lloyds Method • Maintains ! centers " # , … , " & that define clusters. • Perform local search to improve the ! -means cost of the centers. 1. Assign each point to nearest center. 2. Update each center to be the mean of assigned points. 3. Repeat until convergence.

  28. Lloyds Method • Maintains ! centers " # , … , " & that define clusters. • Perform local search to improve the ! -means cost of the centers. 1. Assign each point to nearest center. 2. Update each center to be the mean of assigned points. 3. Repeat until convergence.

  29. Lloyds Method • Maintains ! centers " # , … , " & that define clusters. • Perform local search to improve the ! -means cost of the centers. 1. Assign each point to nearest center. 2. Update each center to be the mean of assigned points. 3. Repeat until convergence.

  30. Lloyds Method • Maintains ! centers " # , … , " & that define clusters. • Perform local search to improve the ! -means cost of the centers. 1. Assign each point to nearest center. 2. Update each center to be the mean of assigned points. 3. Repeat until convergence.

  31. Initial Centers are Important!

  32. Initial Centers are Important! • Lloyd’s method can get stuck if initial centers are chosen poorly

  33. Initial Centers are Important! • Lloyd’s method can get stuck if initial centers are chosen poorly • Initialization is a well-studied problem with many proposed procedures (e.g., ! -means++)

  34. Initial Centers are Important! • Lloyd’s method can get stuck if initial centers are chosen poorly • Initialization is a well-studied problem with many proposed procedures (e.g., ! -means++) • Best method will depend on properties of the clustering instances.

  35. The (", $) -Lloyds Family

  36. The (", $) -Lloyds Family Initialization: Parameter "

  37. The (", $) -Lloyds Family Initialization: Parameter " • Use & ' -sampling (generalizing & ( -sampling of ) -means++)

  38. The (", $) -Lloyds Family Initialization: Parameter " • Use & ' -sampling (generalizing & ( -sampling of ) -means++) • Choose initial centers from dataset * randomly.

  39. The (", $) -Lloyds Family Initialization: Parameter " • Use & ' -sampling (generalizing & ( -sampling of ) -means++) • Choose initial centers from dataset * randomly. ' . • Probability that point + ∈ * is center - . is proportional to & +, - / , … , - .1/

  40. The (", $) -Lloyds Family Initialization: Parameter " • Use & ' -sampling (generalizing & ( -sampling of ) -means++) • Choose initial centers from dataset , randomly. ' . • Probability that point - ∈ , is center / 0 is proportional to & -, / 1 , … , / 031 " = 0 : random initialization

  41. The (", $) -Lloyds Family Initialization: Parameter " • Use & ' -sampling (generalizing & ( -sampling of ) -means++) • Choose initial centers from dataset - randomly. ' . • Probability that point . ∈ - is center 0 1 is proportional to & ., 0 2 , … , 0 142 " = 0 : random initialization " = 2 : ) -means++

  42. The (", $) -Lloyds Family Initialization: Parameter " • Use & ' -sampling (generalizing & ( -sampling of ) -means++) • Choose initial centers from dataset . randomly. ' . • Probability that point / ∈ . is center 1 2 is proportional to & /, 1 3 , … , 1 253 " = 0 : random initialization " = 2 : ) -means++ " = ∞ : farthest first

Recommend


More recommend