machine learning 2
play

Machine Learning 2 DS 4420 - Spring 2020 Clustering I Byron C. - PowerPoint PPT Presentation

Machine Learning 2 DS 4420 - Spring 2020 Clustering I Byron C. Wallace Unsupervised learning So far we have reviewed some fundamentals, discussed Maximum Likelihood Estimation (MLE) for probabilistic models, and neural networks/backprop


  1. Machine Learning 2 DS 4420 - Spring 2020 Clustering I Byron C. Wallace

  2. Unsupervised learning • So far we have reviewed some fundamentals, discussed Maximum Likelihood Estimation (MLE) for probabilistic models, and neural networks/backprop SGD • We have mostly considered supervised settings (implicitly) although the above methods are general; we will shift focus to unsupervised learning for a few weeks • Both the probabilistic and neural perspectives will continue to be relevant here — and we will consider the former explicitly for clustering next week

  3. Unsupervised learning • So far we have reviewed some fundamentals, discussed Maximum Likelihood Estimation (MLE) for probabilistic models, and neural networks/backprop SGD • We have mostly considered supervised settings (implicitly) although the above methods are general; we will shift focus to unsupervised learning for a few weeks • Both the probabilistic and neural perspectives will continue to be relevant here — and we will consider the former explicitly for clustering next week

  4. Unsupervised learning • So far we have reviewed some fundamentals, discussed Maximum Likelihood Estimation (MLE) for probabilistic models, and neural networks/backprop SGD • We have mostly considered supervised settings (implicitly) although the above methods are general; we will shift focus to unsupervised learning for a few weeks • Both the probabilistic and neural perspectives will continue to be relevant here — and we will consider the former explicitly for clustering next week

  5. Clustering

  6. Clustering Unsupervised learning (no labels for training) Group data into similar classes that • Maximize inter-cluster similarity • Minimize intra-cluster similarity

  7. Clustering Unsupervised learning (no labels for training) Group data into similar classes that • Maximize inter-cluster similarity • Minimize intra-cluster similarity

  8. What is a natural grouping? Choice of clustering criterion can be task-dependent Simpson’s School Females Males Family Employees

  9. What is a natural grouping? Choice of clustering criterion can be task-dependent Simpson’s School Females Males Family Employees

  10. What is a natural grouping? Choice of clustering criterion can be task-dependent Simpson’s School Females Males Family Employees

  11. Defining Distance Measures Peter Piotr 3 0.2 342.7 Dissimilarity/distance: d ( x 1 , x 2 ) Similarity: s ( x 1 , x 2 ) } Proximity: p ( x 1 , x 2 )

  12. Defining Distance Measures Peter Piotr 3 0.2 342.7 Dissimilarity/distance: d ( x 1 , x 2 ) Similarity: s ( x 1 , x 2 ) } Proximity: p ( x 1 , x 2 )

  13. Defining Distance Measures Peter Piotr 3 0.2 342.7 Dissimilarity/distance: d ( x 1 , x 2 ) Similarity: s ( x 1 , x 2 ) } Proximity: p ( x 1 , x 2 )

  14. Distance Measures s k P ( x i − y i ) 2 ) Euclidean Distance ( i =1 k P Mahattan Distance | x i − y i | i =1 ✓ k ◆ 1 q ( | x i − y i | ) q P Minkowski Distance i =1

  15. Distance Measures s k P ( x i − y i ) 2 ) Euclidean Distance ( i =1 k P Mahattan Distance | x i − y i | i =1 ✓ k ◆ 1 q ( | x i − y i | ) q P Minkowski Distance i =1

  16. Distance Measures s k P ( x i − y i ) 2 ) Euclidean Distance ( i =1 k P Mahattan Distance | x i − y i | i =1 ✓ k ◆ 1 q ( | x i − y i | ) q P Minkowski Distance i =1

  17. Similarity over functions of inputs • The preceding measures are distances defined on the original input space X 
 • A better representation may be some function of these classification representation φ ( x ) features xplore the two

  18. Similarity: Kernels Linear (inner-product) Polynomial Radial Basis Function (RBF)

  19. Second feature Second feature First feature First feature Linear RBF kernel Figure from MML book

  20. Why kernels? “The key insight in kernel-based learning is that you can rewrite many linear models in a way that doesn’t require you to ever explicitly compute φ (x) 
 - Daume, CIML

  21. Similarities vs Distance Measure Distance Measure • D(A, B) = D(B, A) Symmetry • D(A, A) ≥ 0 Reflexivity • D(A, B) = 0 iff A = B Positivity (Separation) • D(A, B) ≤ D(A, C) + D(B, C) Triangular Inequality

  22. Similarities vs Distance Measure Distance Measure • D(A, B) = D(B, A) Symmetry • D(A, A) ≥ 0 Reflexivity • D(A, B) = 0 iff A = B Positivity (Separation) • D(A, B) ≤ D(A, C) + D(B, C) Triangular Inequality

  23. Similarities vs Distance Measure Distance Measure • D(A, B) = D(B, A) Symmetry • D(A, A) ≥ 0 Reflexivity • D(A, B) = 0 iff A = B Positivity (Separation) • D(A, B) ≤ D(A, C) + D(B, C) Triangular Inequality

  24. Similarities vs Distance Measure Distance Measure • D(A, B) = D(B, A) Symmetry • D(A, A) ≥ 0 Reflexivity • D(A, B) = 0 iff A = B Positivity (Separation) • D(A, B) ≤ D(A, C) + D(B, C) Triangular Inequality

  25. Similarities vs Distance Measure Distance Measure • D(A, B) = D(B, A) Symmetry • D(A, A) ≥ 0 Reflexivity • D(A, B) = 0 iff A = B Positivity (Separation) • D(A, B) ≤ D(A, C) + D(B, C) Triangular Inequality Similarity functions • Less formal; encodes some notion of similarity but not necessarily well defined • Can be negative • May not satisfy triangular inequality

  26. Cosine similarity

  27. Four Types of Clustering 1. Centroid-based (K-means, K-medoids)

  28. Four Types of Clustering 2. Connectivity-based (Hierarchical) Notion of Clusters: Cut off dendrogram at some depth

  29. Four Types of Clustering 3. Density-based (DBSCAN, OPTICS) Notion of Clusters: Connected regions of high density

  30. Four Types of Clustering 4. Distribution-based (Mixture Models) Notion of Clusters: Distributions on features

  31. K-Means clustering (board)

  32. K-means Algorithm X = { x 1 , x 2 , . . . , x N } Input: Number of clusters K Initialize: K random centroids µ 1 , µ 2 , . . . , µ K Repeat Until Convergence For i = 1 , . . . , K do 1 1  j  K k x � µ j k 2 } C i = { x 2 X | i = arg min For i = 1 , . . . , K do 2 k z � x k 2 } P µ i = arg min z x 2 C i Output: C 1 , C 2 , . . . , C K

  33. K-means Algorithm X = { x 1 , x 2 , . . . , x N } Input: Number of clusters K Initialize: K random centroids µ 1 , µ 2 , . . . , µ K Repeat Until Convergence For i = 1 , . . . , K do 1 1  j  K k x � µ j k 2 } C i = { x 2 X | i = arg min For i = 1 , . . . , K do 2 k z � x k 2 } P µ i = arg min z x 2 C i Output: C 1 , C 2 , . . . , C K

  34. K-means Algorithm X = { x 1 , x 2 , . . . , x N } Input: Number of clusters K Initialize: K random centroids µ 1 , µ 2 , . . . , µ K Repeat Until Convergence For i = 1 , . . . , K do 1 1  j  K k x � µ j k 2 } C i = { x 2 X | i = arg min For i = 1 , . . . , K do 2 k z � x k 2 } P µ i = arg min z x 2 C i Output: C 1 , C 2 , . . . , C K

  35. K-means Algorithm X = { x 1 , x 2 , . . . , x N } Input: Number of clusters K Initialize: K random centroids µ 1 , µ 2 , . . . , µ K Repeat Until Convergence For i = 1 , . . . , K do 1 1  j  K k x � µ j k 2 } C i = { x 2 X | i = arg min For i = 1 , . . . , K do 2 k z � x k 2 } P µ i = arg min z x 2 C i Output: C 1 , C 2 , . . . , C K

  36. K-means Clustering thm: K-means, Distance Metric: Euclidean Distanc 5 4 μ 1 3 μ 2 2 1 μ 3 0 0 1 2 3 4 5 Randomly initialize K centroids μ k

  37. K-means Clustering 5 4 μ 1 3 μ 2 2 1 μ 3 0 0 1 2 3 4 5 Assign each point to closest centroid, then update centroids to average of points

  38. K-means Clustering 5 4 μ 1 3 2 μ 3 μ 2 1 0 0 1 2 3 4 5 Assign each point to closest centroid, then update centroids to average of points

  39. K-means Clustering 5 4 μ 1 3 2 μ 3 μ 2 1 0 0 1 2 3 4 5 Repeat until convergence 
 (no points reassigned, means unchanged)

  40. K-means Clustering 5 4 μ 1 3 2 μ 2 μ 3 1 0 0 1 2 3 4 5 Repeat until convergence 
 (no points reassigned, means unchanged)

  41. K-means Algorithm X = { x 1 , x 2 , . . . , x N } Input: Number of clusters K Initialize: K random centroids µ 1 , µ 2 , . . . , µ K Repeat Until Convergence For i = 1 , . . . , K do 1 1  j  K k x � µ j k 2 } C i = { x 2 X | i = arg min For i = 1 , . . . , K do 2 k z � x k 2 } P µ i = arg min z x 2 C i Output: C 1 , C 2 , . . . , C K • K-means: Set μ to mean of points in C • K-medoids: Set μ = x for point in C with minimum SSE

  42. Let's see some examples in Python

  43. “Good” Initialization of Centroids Iteration 1 Iteration 2 Iteration 3 + 3 3 3 + + 2.5 2.5 2.5 + + 2 2 2 + 1.5 1.5 1.5 + y y y + + 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x Iteration 4 Iteration 5 Iteration 6 3 3 3 2.5 2.5 2.5 + + + 2 2 2 1.5 1.5 1.5 y y y 1 1 1 + + 0.5 0.5 0.5 + + + + 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x

Recommend


More recommend