lecture 22
play

Lecture 22: Clustering Distance measures K-Means Aykut Erdem May - PowerPoint PPT Presentation

Lecture 22: Clustering Distance measures K-Means Aykut Erdem May 2016 Hacettepe University Last time Boosting Idea: given a weak learner, run it multiple times on (reweighted) training data, then let the learned classifiers vote


  1. Lecture 22: − Clustering − Distance measures − K-Means Aykut Erdem May 2016 Hacettepe University

  2. Last time… Boosting • Idea: given a weak learner, run it multiple times on (reweighted) training data, then let the learned classifiers vote • On each iteration t : - weight each training example by how incorrectly it was classified Learn a hypothesis – h t - A strength for this hypothesis – a t - • Final classifier: - A linear combination of the votes of the di ff erent classifiers weighted by their strength slide by Aarti Singh & Barnabas Poczos • Practically useful • Theoretically interesting 2

  3. 3 Last time.. The AdaBoost Algorithm slide by Jiri Matas and Jan Š ochman

  4. This week • Clustering • Distance measures • K-Means • Spectral clustering • Hierarchical clustering • What is a good clustering? 4

  5. Distance measures 5

  6. Distance measures • In studying clustering techniques we will assume that we are given a matrix of distances between all pairs of data points: x x x x x 1 2 3 4 m x 1 x 2 x 3 d(x , x ) x i j 4 • • • • slide by Julia Hockenmeier • • x m 6

  7. What is Similarity/Dissimilarity? Hard to define! But we know it when we see it • The real meaning of similarity is a philosophical question. We will take � a more pragmatic approach. � • Depends on representation and algorithm. For many rep.//alg., easier to think in terms of a distance (rather than similarity) between vectors. slide by Eric Xing 7

  8. Defining Distance Measures • Definition: Let O 1 and O 2 be two objects from the universe of possible objects. The distance (dissimilarity) between O 1 and O 2 is a real number denoted by D( O 1 , O 2 ). gene1 gene2 slide by Andrew Moore 0.23 3 342.7 8

  9. A few examples: Euclidean distance • � d ( x , y ) � ( x i � y i ) 2 � • ance • � � i Correlation coefficient • • • • Similarity rather than distance • • • Can determine similar trends • � • coefficient slide by Andrew Moore � � � � � ( x i � � x )( y i � � y ) ฀ � s ( x , y ) � ฀ � i � � x � y 9 � � ฀ � ฀ �

  10. What properties should a distance measure have? • Symmetric - D( A , B ) = D( B , A ) - Otherwise, we can say A looks like B but B does not look like A • Positivity, and self-similarity - D( A , B ) ≥ 0, and D( A , B ) = 0 i ff A = B - Otherwise there will di ff erent objects that we cannot tell apart • Triangle inequality - D( A , B ) + D( B , C ) ≥ D( A , C ) - Otherwise one can say “ A is like B , B is like C , but A is not slide by Alan Fern like C at all” 10

  11. 
 
 
 Distance measures • Euclidean (L 2 ) 
 idean (L 2 ) d ( x i − y i ) 2 ∑ d ( x , y ) = i = 1 hattan (L ) • Manhattan (L 1 ) 
 hattan (L 1 ) d d ( x , y ) = x - y = ∑ x i − y i i = 1 ity (Sup) Distance L • Infinity (Sup) Distance L ∞
 ity (Sup) Distance L ∞ d ( x , y ) = max 1 ≤ i ≤ d x i − y i slide by Julia Hockenmeier • Note that L ∞ < L 1 < L 2 , but di ff erent distances do not induce the same ordering on points. 11

  12. Distance measures x = (x 1 , x 2 ) y = (x 1 –2, x 2 +4) Euclidean: (4 2 + 2 2 ) 1/2 = 4.47 Manhattan: 4 + 2 = 6 Sup: Max (4,2) = 4 4 slide by Julia Hockenmeier 2 12

  13. Distance measures • Di ff erent distances do not induce the same ordering on points L (a, b) 5 = ∞ 2 2 1/2 L (a, b) (5 ) 5 = + ε = + ε 2 L (c, d) 4 = ∞ 2 2 1/2 4 L (c, d) (4 4 ) 4 2 5 . 66 = + = = 2 5 L (c, d) L (a, b) < slide by Julia Hockenmeier ∞ ∞ L (c, d) L (a, b) > 4 2 2 9 13

  14. Distance measures • Clustering is sensitive to the distance measure. • Sometimes it is beneficial to use a distance measure that is invariant to transformations that are natural to the problem: - Mahalanobis distance: ✓ Shift and scale invariance slide by Julia Hockenmeier 14

  15. Mahalanobis Distance ( x - y ) T Σ ( x − y ) d ( x , y ) = Σ is a (symmetric) Covariance Matrix: µ = 1 m ∑ x i , (average of the data) m i = 1 Σ = 1 m ( x − µ )( x − µ ) T , ∑ a matrix of size m × m m i = 1 Translates all the axes to a mean = 0 and slide by Julia Hockenmeier variance = 1 (shift and scale invariance) 15

  16. Distance measures • Some algorithms require distances between a point x and a set of points A d(x, A) 
 This might be defined e.g. as min/max/avg distance between x and any point in A. 
 • Others require distances between two sets of points A, B, d(A, B). 
 This might be defined e.g as min/max/avg distance between any point in A and any point in B. slide by Julia Hockenmeier 16

  17. � � � � � � � � Clustering algorithms • Partitioning algorithms � � %; - Construct various partitions 
 � and then evaluate them by 
 � � some criterion � • K-means • Mixture of Gaussians � • Spectral Clustering • Hierarchical algorithms � � - Create a hierarchical decomposition 
 � � of the set of objects using some 
 � � � criterion - Bottom-up – agglomerative - Top-down – divisive slide by Eric Xing � � 17 � � � �

  18. Desirable Properties of a Clustering Algorithm • Scalability (in terms of both time and space) • Ability to deal with di ff erent data types • Minimal requirements for domain knowledge to determine input parameters • Ability to deal with noisy data • Interpretability and usability • Optional slide by Andrew Moore - Incorporation of user-specified constraints 18

  19. K-Means 19

  20. K-Means • An iterative clustering algorithm - Initialize: Pick K random points as cluster centers (means) - Alternate: Assign data instances • to closest mean Assign each mean to • the average of its assigned points - Stop when no points’ slide by David Sontag assignments change 20

  21. K-Means • An iterative clustering algorithm - Initialize: Pick K random points as cluster centers (means) - Alternate: Assign data instances • to closest mean Assign each mean to • the average of its assigned points - Stop when no points’ slide by David Sontag assignments change 21

  22. K-Means Clustering: Example • Pick K random points as cluster centers (means) Shown here for K=2 slide by David Sontag 22

  23. K-Means Clustering: Example Iterative Step 1 • Assign data points to closest cluster centers slide by David Sontag 23

  24. K-Means Clustering: Example Iterative Step 2 • Change the cluster center to the average of the assigned points slide by David Sontag 24

  25. K-Means Clustering: Example • Repeat until convergence slide by David Sontag 25

  26. K-Means Clustering: Example slide by David Sontag 26

  27. K-Means Clustering: Example slide by David Sontag 27

  28. Properties of K-Means Algorithms • Guaranteed to converge in a finite number of iterations • Running time per iteration: 1. Assign data points to closest cluster center 
 O( KN ) time 2. Change the cluster center to the average of its assigned points 
 O( N ) time slide by David Sontag 28

  29. K-Means Convergence Objective 1. Fix μ , optimize C : 2. Fix C , optimize μ : Take partial derivative of μ i and set to zero, we have – K-Means takes an alternating optimization approach, each step is slide by Alan Fern guaranteed to decrease the objective – thus guaranteed to converge 29

  30. Demo time… 30

  31. Example: K-Means for Segmentation K=2 Original K=3 K=10 K = 2 Original image K = 3 Goal of Segmentation K = 10 is to partition an image into regions each of which has reasonably homogenous visual appearance. slide by David Sontag 31

  32. Example: K-Means for Segmentation K=2 Original K=3 K=10 K = 2 Original image K = 3 K = 10 slide by David Sontag 32

  33. Example: K-Means for Segmentation K=2 Original K=3 K=10 K = 2 Original image K = 3 K = 10 slide by David Sontag 33

  34. Example: Vector quantization FIGURE 14.9. Sir Ronald A. Fisher ( 1890 − 1962 ) was one of the founders of modern day statistics, to whom we owe maximum-likelihood, su ffi ciency, and many other fundamental concepts. The image on the left is a 1024 × 1024 grayscale image at 8 bits per pixel. The center image is the result of 2 × 2 block VQ, using 200 code vectors, with a compression rate of 1 . 9 bits/pixel. The right image uses only four code vectors, with a compression rate of 0 . 50 bits/pixel slide by David Sontag [Figure from Hastie et al. book] 34

  35. Bag of Words model aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 ... gas 1 ... oil 1 … slide by Carlos Guestrin Zaire 0 35

  36. 36 slide by Fei Fei Li

  37. Object Bag of ‘words’ slide by Fei Fei Li 37

  38. Interest Point Features Compute Normalize SIFT patch descriptor [Lowe’99] Detect patches [Mikojaczyk and Schmid ’02] [Matas et al. ’02] [Sivic et al. ’03] slide by Josef Sivic 38

  39. 39 Patch Features … slide by Josef Sivic

  40. Dictionary Formation … slide by Josef Sivic 40

  41. Clustering (usually K-means) … Vector quantization slide by Josef Sivic 41

  42. Clustered Image Patches slide by Fei Fei Li 42

Recommend


More recommend