l ecture 26 c lustering
play

L ECTURE 26: C LUSTERING Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 26: C LUSTERING Prof. Julia Hockenmaier juliahmr@illinois.edu CS446 Machine Learning 1 Clustering


  1. CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 26: C LUSTERING Prof. Julia Hockenmaier juliahmr@illinois.edu CS446 Machine Learning 1

  2. Clustering What should a cluster algorithm achieve? – A cluster is a set of entities which are alike. – Entities in different clusters are not alike. What is alikeness?? This depends on the application/task. CS446 Machine Learning 2

  3. Clustering Can we formalize this? A cluster is a set of points such that the distance between any two points in the same cluster is less than the distance between any point in the cluster and any point not in it. The distance metric has to be appropriate for the task in mind. CS446 Machine Learning 3

  4. Distance Measures for Clustering CS446 Machine Learning 4

  5. Distance measures In studying clustering techniques we will assume that we are given a matrix of distances between all pairs of data points: x x x x x 1 2 3 4 m x 1 x 2 x 3 d(x , x ) x i j 4 • • • • • • x m CS446 Machine Learning 5

  6. Distance measures A distance measure d: R d × R d � R is a function that satisfies d( x , y ) ≥ 0 ⇔ x ≠ y and d( x , y ) = 0 ⇔ x = y d is a metric if it also satisfies: – The triangle inequality: d( x , y ) + d( y , z ) ≥ d( x , z ) – Symmetry: d( x , y ) = d( y , x ) CS446 Machine Learning 6

  7. Distance measures Euclidean (L 2 ) d ( x i − y i ) 2 ∑ d ( x , y ) = i = 1 Manhattan (L 1 ) d d ( x , y ) = x - y = ∑ x i − y i i = 1 Infinity (Sup) Distance L ∞ d ( x , y ) = max 1 ≤ i ≤ d x i − y i Note that L ∞ < L 1 < L 2 , but different distances do not induce the same ordering on points. CS446 Machine Learning 7

  8. Distance measures x = (x 1 , x 2 ) y = (x 1 –2, x 2 +4) Euclidean: (4 2 + 2 2 ) 1/2 = 4.47 Manhattan: 4 + 2 = 6 Sup: Max (4,2) = 4 4 2 8

  9. Distance Measures Different distances do not induce the same ordering on points L (a, b) 5 = ∞ 2 2 1/2 L (a, b) (5 ) 5 = + ε = + ε 2 L (c, d) 4 = ∞ 2 2 1/2 4 L (c, d) (4 4 ) 4 2 5 . 66 = + = = 2 5 L (c, d) L (a, b) < ∞ ∞ L (c, d) L (a, b) > 4 2 2 9

  10. Distance measures Clustering is sensitive to the distance measure. Sometimes it is beneficial to use a distance measure that is invariant to transformations that are natural to the problem: – Mahalanobis distance: Shift and scale invariance 10

  11. Mahalanobis Distance ( x - y ) T Σ ( x − y ) d ( x , y ) = Σ is a (symmetric) Covariance Matrix: µ = 1 m ∑ x i , (average of the data) m i = 1 Σ = 1 m ( x − µ )( x − µ ) T , ∑ a matrix of size m × m m i = 1 Translates all the axes to a mean = 0 and variance = 1 (shift and scale invariance) 11

  12. Distance measures Some algorithms require distances between a point x and a set of points A d( x , A) This might be defined e.g. as min/max/avg distance between x and any point in A. Others require distances between two sets of points A, B, d(A, B). This might be defined e.g as min/max/avg distance between any point in A and any point in B. CS446 Machine Learning 12

  13. Clustering Methods CS446 Machine Learning 13

  14. Clustering Methods Do the clusters partition the data? – Hard (yes) vs. soft clustering (no) Do the clusters have structure? – Hierarchical (yes) vs. flat clustering (no) Is the hierarchy induced top-down or bottom-up? – Top-Down: divisive – Bottom-up: agglomerative How do we represent the data points? – As vectors or as vertices in a graph? CS446 Machine Learning 14

  15. Graph-based clustering Each data point is a vertex in an undirected graph. Edge weights correspond to (non-zero) similarities, not distance: 0 ≤ sim(x,y) ≤ 1 Clustering = Graph partitioning! – Graph cuts, minimum spanning tree, etc. 15

  16. Vector-based clustering Each data point is a vector in a vector space. We can define a distance metric in this space. 16

  17. (Hard) clustering Given an unlabeled dataset D = { x 1 ,…, x N }, a distance metric d( x , x’ ) over pairs of points and a clustering algorithm A, return a partition C of D. – Partition: a set of sets C, such that each element of D belongs to exactly one C i 17

  18. Hierarchical Clustering Hierarchical clustering is a nested sequence of partitions Agglomerative: Places each object in its own cluster and gradually merge the atomic clusters into larger and larger clusters. Divisive: Start with all objects in one cluster and subdivide into smaller clusters. {(a) ,(b),(c),(d),(e)} a � b � c � d � e � {(a,b),(c),(d),(e)} {(a,b),(c,d),(e)} {(a,b,c,d),(e)} {(a,b,c,d,e)}

  19. Agglomerative Clustering Assume a distance measure between points d( x 1 , x 2 ) Define a distance measure between clusters D(C1,C2) Algorithm: – Initialization: Put each point in a separate cluster. – At each stage, merge the two closest clusters according to D (merge the two D-closest clusters). Different definitions of D (for the same d) give rise to radically different partitions of the data. 19

  20. Agglomerative Clustering Single Link Clustering Define cluster distance as distance of the closest pair D SL (C1,C2) = min { x 1 ∈ C1,x2 ∈ C2} d(x1,x2) Complete Link Clustering Define cluster distance as distance of the furthest pair D CL (C1,C2) = max { x 1 ∈ C1,x2 ∈ C2} d(x1,x2) Group Average Clustering Define cluster distance as the average distance of all pairs D GA (C1,C2) = avg { x 1 ∈ C1,x2 ∈ C2} d(x1,x2) Error-sum-of squares Clustering (Ward): ESS(C) = Σ x ∈ C ( x - η C ) 2 ( η C is the cluster mean) D ESS (C1,C2) = ESS(C1 ∪ C2) – ESS(C1) – ESS(C2) CS446 Machine Learning 20

  21. Association-Dissociation Given a collection of points, one way to define the goal of a clustering process is to use the following two measures: – A measure of similarity within a group of points – A measure of similarity between different groups Ideally, we would like to define these so that: The within-similarity can be maximized The between-similarity can be minimized at the same time. This turns out to be hard. We often only optimize one of these objectives.

Recommend


More recommend