data mining techniques
play

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 13 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 13 Jan-Willem van de Meent (credit: David Lopez-Paz, David Duvenaud, Laurens van der Maaten) Homework Homework 3 is out today (due 4 Nov) Homework 1 has been graded


  1. Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 13 Jan-Willem van de Meent (credit: David Lopez-Paz, David Duvenaud, Laurens van der Maaten)

  2. Homework • Homework 3 is out today (due 4 Nov) • Homework 1 has been graded 
 (we will grade Homework 2 a little faster) • Regrading policy • Step 1: E-mail TAs to resolve simple 
 problems (e.g. code not running). • Step 2: E-mail instructor to request 
 regrading. • We will regrade the entire problem set. 
 The final grade can be lower than before.

  3. Review: PCA Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n n n Inverse Change of basis Change of basis z = U > x > j x x = Uz = ˜ to z = ( z 1 , . . . , z k ) > n ” z j = u > j x

  4. Review: PCA Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n n n Eigenvectors of Covariance Eigen-decomposition

  5. Review: PCA Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n n n Eigenvectors of Covariance Eigen-decomposition Claim : Eigenvectors of a symmetric matrix are orthogonal

  6. Review: PCA n (from stack exchange)

  7. Review: PCA Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n n n Eigenvectors of Covariance Eigen-decomposition Claim : Eigenvectors of a symmetric matrix are orthogonal

  8. Review: PCA Data Truncated Basis U = ( u 1 ·· u k ) ∈ R d ⇥ k X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Eigenvectors of Covariance Truncated decomposition

  9. Review: PCA Data Truncated Basis U = ( u 1 ·· u k ) ∈ R d ⇥ k X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Reconstruction / Decoding Projection / Encoding z = U > x e: ˜ x = Uz =

  10. Review: PCA Top 2 components Bottom 2 components Data : three varieties of wheat: Kama, Rosa, Canadian 
 Attributes : Area, Perimeter, Compactness, Length of Kernel, 
 Width of Kernel, Asymmetry Coefficient, Length of Groove

  11. PCA: Complexity Data Truncated Basis U = ( u 1 ·· u k ) ∈ R d ⇥ k X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Using eigen-value decomposition • Computation of covariance C : O ( n d 2 ) • Eigen-value decomposition: O ( d 3 ) • Total complexity: O ( n d 2 + d 3 )

  12. PCA: Complexity Data Truncated Basis U = ( u 1 ·· u k ) ∈ R d ⇥ k X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Using singular-value decomposition • Full decomposition: O(min{ nd 2 , n 2 d }) • Rank-k decomposition: O( k d n log(n)) 
 (with power method) 


  13. Singular Value Decomposition Idea : Decompose a 
 d x d matrix M into 1. Change of basis V 
 (unitary matrix) 2. A scaling Σ 
 (diagonal matrix) 3. Change of basis U 
 (unitary matrix)

  14. Singular Value Decomposition Idea : Decompose the 
 d x n matrix X into 1. A n x n basis V 
 (unitary matrix) 2. A d x n matrix Σ 
 (diagonal projection) 3. A d x d basis U 
 (unitary matrix) d X = U d ⇥ d Σ d ⇥ n V > n ⇥ n

  15. Random Projections Borrowing from : 
 David Lopez-Paz & David Duvenaud

  16. Random Projections Fast, e ffi cient and & distance-preserving dimensionality reduction ! w 2 R 40500 × 1000 y 1 x 1 � � (1 ± ✏ ) y 2 x 2 w 2 R 40500 × 1000 R 40500 R 1000 (1 � ✏ ) k x 1 � x 2 k 2  k y 1 � y 2 k 2  (1 + ✏ ) k x 1 � x 2 k 2 This result is formalized in the Johnson-Lindenstrauss Lemma

  17. Johnson-Lindenstrauss Lemma For any 0 < ✏ < 1 / 2 and any integer m > 4, let k = 20 log m . Then, ✏ 2 for any set V of m points in R N 9 f : R N ! R k s.t. 8 u , v 2 V : (1 � ✏ ) k u � v k 2  k f ( u ) � f ( v ) k 2  (1 + ✏ ) k u � v k 2 . The proof is a great example of Erd¨ os’ probabilistic method (1947). Paul Erd¨ os Joram Lindenstrauss William B. Johnson 1913-1996 1936-2012 1944-

  18. Johnson-Lindenstrauss Lemma For any 0 < ✏ < 1 / 2 and any integer m > 4, let k = 20 log m . Then, ✏ 2 for any set V of m points in R N 9 f : R N ! R k s.t. 8 u , v 2 V : (1 � ✏ ) k u � v k 2  k f ( u ) � f ( v ) k 2  (1 + ✏ ) k u � v k 2 . Holds when f is linear function with random coefficients 1 k A , A 2 R k × N , k < N and A ij ⇠ N (0 , 1). t f = √

  19. Example: 20-newsgroups data Data: 20-newsgroups, from 100.000 features to 300 (0.3%)

  20. Example: 20-newsgroups data Data: 20-newsgroups, from 100.000 features to 1.000 (1%)

  21. Example: 20-newsgroups data Data: 20-newsgroups, from 100.000 features to 10.000 (10%)

  22. Example: 20-newsgroups data Data: 20-newsgroups, from 100.000 features to 10.000 (10%) Conclusion : RP preserves distances like PCA, 
 but faster than PCA number of dimensions is vey large

  23. Stochastic Neighbor 
 Embeddings Borrowing from : 
 Laurens van der Maaten 
 (Delft -> Facebook AI)

  24. Manifold Learning Idea : Perform a non-linear dimensionality reduction 
 in a manner that preserves proximity (but not distances)

  25. Manifold Learning

  26. PCA on MNIST Digits

  27. Swiss Roll Euclidean distance is not always 
 a good notion of proximity

  28. Non-linear Projection Bad projection: relative position to neighbors changes

  29. Non-linear Projection Intuition: Want to preserve local neighborhood

  30. Stochastic Neighbor Embedding Similarity in high dimension Similarity in low dimension exp ( − || x i − x j || 2 / 2 σ 2 exp ( − || y i − y j || 2 ) i ) p j | i = q j | i = k 6 = i exp ( − || x i − x k || 2 / 2 σ 2 k 6 = i exp ( − || y i − y k || 2 ) P P i )

  31. Stochastic Neighbor Embedding Similarity of datapoints in High Dimension exp ( − || x i − x j || 2 / 2 σ 2 i ) p j | i = k 6 = i exp ( − || x i − x k || 2 / 2 σ 2 P i ) Similarity of datapoints in Low Dimension exp ( − || y i − y j || 2 ) q j | i = k 6 = i exp ( − || y i − y k || 2 ) P Cost function p j | i log p j | i X X X C = KL ( P i || Q i ) = q j | i i i j Idea: Optimize y i via gradient descent on C

  32. Stochastic Neighbor Embedding Similarity of datapoints in High Dimension exp ( − || x i − x j || 2 / 2 σ 2 i ) p j | i = k 6 = i exp ( − || x i − x k || 2 / 2 σ 2 P i ) Similarity of datapoints in Low Dimension exp ( − || y i − y j || 2 ) q j | i = k 6 = i exp ( − || y i − y k || 2 ) P Cost function p j | i log p j | i X X X C = KL ( P i || Q i ) = q j | i i i j Idea: Optimize y i via gradient descent on C

  33. Stochastic Neighbor Embedding Gradient has a surprisingly simple form ∂ C X = ( p j | i − q j | i + p i | j − q i | j )( y i − y j ) ∂ y i j 6 = i The gradient update with momentum term is given by Y ( t ) = Y ( t � 1) + η∂ C + β ( t )( Y ( t � 1) − Y ( t � 2) ) ∂ y i

  34. Stochastic Neighbor Embedding Gradient has a surprisingly simple form ∂ C X = ( p j | i − q j | i + p i | j − q i | j )( y i − y j ) ∂ y i j 6 = i The gradient update with momentum term is given by Y ( t ) = Y ( t � 1) + η∂ C + β ( t )( Y ( t � 1) − Y ( t � 2) ) ∂ y i Problem : p j|i is not equal to p i|j

  35. Symmetric SNE X X X | Minimize a single KL divergence between a joint probability distribution p ij log p ij X X C = KL ( P || Q ) = q ij j 6 = i i The obvious way to redefine the pairwise similarities is exp ( − || x i − x j || 2 / 2 σ 2 ) p ij = k 6 = l exp ( − || x l − x k || 2 / 2 σ 2 ) P exp ( − || y i − y j || 2 ) q ij = k 6 = l exp ( − || y l − y k || 2 ) P

  36. Symmetric SNE X X X | Minimize a single KL divergence between a joint probability distribution p ij log p ij X X C = KL ( P || Q ) = q ij j 6 = i i The obvious way to redefine the pairwise similarities is exp ( − || x i − x j || 2 / 2 σ 2 ) p ij = k 6 = l exp ( − || x l − x k || 2 / 2 σ 2 ) P exp ( − || y i − y j || 2 ) q ij = k 6 = l exp ( − || y l − y k || 2 ) P Problem : How should we choose σ ?

  37. Choosing the bandwidth exp ( − || x i − x j || 2 / 2 σ 2 ) p ij = k 6 = l exp ( − || x l − x k || 2 / 2 σ 2 ) P Bad σ : Neighborhood is not local in manifold

  38. Choosing the bandwidth exp ( − || x i − x j || 2 / 2 σ 2 ) p ij = k 6 = l exp ( − || x l − x k || 2 / 2 σ 2 ) P Good σ : Neighborhood contains 5-50 points

  39. Choosing the bandwidth exp ( − || x i − x j || 2 / 2 σ 2 ) p ij = k 6 = l exp ( − || x l − x k || 2 / 2 σ 2 ) P Problem : optimal σ may vary if density not uniform

  40. Choosing the bandwidth exp ( − || x i − x j || 2 / 2 σ 2 i ) p ij = p j | i + p i | j p j | i = k 6 = i exp ( − || x i − x k || 2 / 2 σ 2 P i ) 2 N Solution : Define σ i per point.

  41. Choosing the bandwidth exp ( − || x i − x j || 2 / 2 σ 2 i ) p ij = p j | i + p i | j p j | i = k 6 = i exp ( − || x i − x k || 2 / 2 σ 2 P i ) 2 N Solution : Define σ i per point.

  42. Choosing the bandwidth exp ( − || x i − x j || 2 / 2 σ 2 i ) p ij = p j | i + p i | j p j | i = k 6 = i exp ( − || x i − x k || 2 / 2 σ 2 P i ) 2 N Solution : Define σ i per point.

  43. Choosing the bandwidth Set σ i to ensure constant perplexity

Recommend


More recommend