partitional clustering
play

Partitional Clustering Boston University Slideshow Title Goes Here - PowerPoint PPT Presentation

Partitional Clustering Boston University Slideshow Title Goes Here Clustering: David Arthur, Sergei Vassilvitskii. k-means ++: The Advantages of Careful Seeding . In SODA 2007 Thanks A. Gionis and S. Vassilvitskii for the slides What is


  1. Partitional Clustering Boston University Slideshow Title Goes Here • Clustering: David Arthur, Sergei Vassilvitskii. k-means ++: The Advantages of Careful Seeding . In SODA 2007 • Thanks A. Gionis and S. Vassilvitskii for the slides

  2. What is clustering? • a grouping of data objects such that the objects within Boston University Slideshow Title Goes Here a group are similar (or near) to one another and dissimilar (or far) from the objects in other groups

  3. How to capture this objective? a grouping of data objects such that the objects within Boston University Slideshow Title Goes Here a group are similar (or near) to one another and dissimilar (or far) from the objects in other groups minimize maximize intra-cluster inter-cluster distances distances

  4. The clustering problem • Given a collection of data objects Boston University Slideshow Title Goes Here • Find a grouping so that • similar objects are in the same cluster • dissimilar objects are in different clusters ✦ Why we care ? ✦ stand-alone tool to gain insight into the data ✦ visualization ✦ preprocessing step for other algorithms ✦ indexing or compression often relies on clustering

  5. Applications of clustering • image processing Boston University Slideshow Title Goes Here • cluster images based on their visual content • web mining • cluster groups of users based on their access patterns on webpages • cluster webpages based on their content • bioinformatics • cluster similar proteins together (similarity wrt chemical structure and/or functionality etc) • many more...

  6. The clustering problem • Given a collection of data objects Boston University Slideshow Title Goes Here • Find a grouping so that • similar objects are in the same cluster • dissimilar objects are in different clusters ✦ Basic questions: ✦ what does similar mean? ✦ what is a good partition of the objects? i.e., how is the quality of a solution measured? ✦ how to find a good partition?

  7. Notion of a cluster can be ambiguous Boston University Slideshow Title Goes Here How many clusters? Six Clusters Two Clusters Four Clusters � � � �

  8. Types of clusterings Boston University Slideshow Title Goes Here • Partitional • each object belongs in exactly one cluster • Hierarchical • a set of nested clusters organized in a tree

  9. Hierarchical clustering Boston University Slideshow Title Goes Here p1 p3 p4 p2 p1 p2 p3 p4 Traditional Hierarchical Clustering Traditional Dendrogram p1 p3 p4 p2 p1 p2 p3 p4 Non-traditional Hierarchical Clustering Non-traditional Dendrogram

  10. Partitional clustering Boston University Slideshow Title Goes Here Original Points A Partitional Clustering

  11. Partitional algorithms Boston University Slideshow Title Goes Here • partition the n objects into k clusters • each object belongs to exactly one cluster • the number of clusters k is given in advance

  12. The k-means problem Boston University Slideshow Title Goes Here • consider set X={x 1 ,...,x n } of n points in R d • assume that the number k is given • problem: • find k points c 1 ,...,c k (named centers or means) so that the cost n n X X L 2 || x i − c j || 2 � min 2 ( x i , c j ) = min 2 j j i =1 i =1 is minimized

  13. The k-means problem • consider set X={x 1 ,...,x n } of n points in R d Boston University Slideshow Title Goes Here • assume that the number k is given • problem: • find k points c 1 ,...,c k (named centers or means) • and partition X into {X 1 ,...,X k } by assigning each point x i in X to its nearest cluster center, • so that the cost n k X X X || x i − c j || 2 || x − c j || 2 min 2 = 2 j i =1 j =1 x ∈ X j is minimized

  14. The k-means problem • k=1 and k=n are easy special cases (why?) Boston University Slideshow Title Goes Here • an NP-hard problem if the dimension of the data is at least 2 (d ≥ 2) • for d ≥ 2, finding the optimal solution in polynomial time is infeasible • for d=1 the problem is solvable in polynomial time • in practice, a simple iterative algorithm works quite well

  15. The k-means algorithm Boston University Slideshow Title Goes Here • voted among the top-10 algorithms in data mining • one way of solving the k- means problem

  16. The k-means algorithm Boston University Slideshow Title Goes Here 1.randomly (or with another method) pick k cluster centers {c 1 ,...,c k } 2.for each j, set the cluster X j to be the set of points in X that are the closest to center c j 3.for each j let c j be the center of cluster X j (mean of the vectors in X j ) 4.repeat (go to step 2) until convergence

  17. Sample execution Boston University Slideshow Title Goes Here

  18. Properties of the k-means algorithm Boston University Slideshow Title Goes Here • finds a local optimum • often converges quickly but not always • the choice of initial points can have large influence in the result

  19. Effects of bad initialization Boston University Slideshow Title Goes Here

  20. � � Limitations of k-means: different sizes Boston University Slideshow Title Goes Here K-means (3 Clusters) Original Points

  21. Limitations of k-means: different density Boston University Slideshow Title Goes Here K-means (3 Clusters) Original Points

  22. Limitations of k-means: non-spherical shapes Boston University Slideshow Title Goes Here Original Points K-means (2 Clusters)

  23. Discussion on the k-means algorithm Boston University Slideshow Title Goes Here • finds a local optimum • often converges quickly but not always • the choice of initial points can have large influence in the result • tends to find spherical clusters • outliers can cause a problem • different densities may cause a problem

  24. Initialization Boston University Slideshow Title Goes Here • random initialization • random, but repeat many times and take the best solution • helps, but solution can still be bad • pick points that are distant to each other • k-means++ • provable guarantees

  25. k-means++ Boston University Slideshow Title Goes Here David Arthur and Sergei Vassilvitskii k-means++: The advantages of careful seeding SODA 2007

  26. k-means algorithm: random initialization Boston University Slideshow Title Goes Here

  27. k-means algorithm: random initialization Boston University Slideshow Title Goes Here

  28. k-means algorithm: initialization with further-first traversal Boston University Slideshow Title Goes Here 2 1 3 4

  29. k-means algorithm: initialization with further-first traversal Boston University Slideshow Title Goes Here

  30. but... sensitive to outliers Boston University Slideshow Title Goes Here 2 1 3

  31. but... sensitive to outliers Boston University Slideshow Title Goes Here

  32. Here random may work well Boston University Slideshow Title Goes Here

  33. k-means++ algorithm • interpolate between the two methods Boston University Slideshow Title Goes Here • let D(x) be the distance between x and the nearest center selected so far • choose next center with probability proportional to (D(x)) a = D a (x) ✦ a = 0 random initialization ✦ a = ∞ furthest-first traversal ✦ a = 2 k-means++

  34. k-means++ algorithm • initialization phase: Boston University Slideshow Title Goes Here • choose the first center uniformly at random • choose next center with probability proportional to D 2 (x) • iteration phase: • iterate as in the k-means algorithm until convergence

  35. k-means++ initialization Boston University Slideshow Title Goes Here 3 1 2

  36. k-means++ result Boston University Slideshow Title Goes Here

  37. k-means++ provable guarantee Boston University Slideshow Title Goes Here Theorem: k-means++ is O(logk) approximate in expectation

  38. k-means++ provable guarantee Boston University Slideshow Title Goes Here • approximation guarantee comes just from the first iteration (initialization) • subsequent iterations can only improve cost

  39. k-means++ analysis • consider optimal clustering C * Boston University Slideshow Title Goes Here • assume that k-means++ selects a center from a new optimal cluster • then • k-means++ is 8-approximate in expectation • intuition: if no points from a cluster are picked, then it probably does not contribute much to the overall error • an inductive proof shows that the algorithm is O(logk) approximate

  40. k-means++ proof : first cluster Boston University Slideshow Title Goes Here • fix an optimal clustering C * • first center is selected uniformly at random • bound the total error of the points in the optimal cluster of the first center

  41. k-means++ proof : first cluster Boston University Slideshow Title Goes Here • let A be the first cluster • each point a 0 ∈ A is equally likely to be selected as center ✦ expected error: 1 X X || a − a 0 || 2 E [ φ ( A )] = | A | a 0 ∈ A a ∈ A A || 2 = 2 φ ∗ ( A ) || a − ¯ X = 2 a ∈ A

  42. k-means++ proof : other clusters Boston University Slideshow Title Goes Here • suppose next center is selected from a new cluster in the optimal clustering C * • bound the total error of that cluster

Recommend


More recommend