machine learning 2
play

Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction 2 - PowerPoint PPT Presentation

Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction 2 Byron C Wallace Today A bit of wrap up on PCA Then : Non-linear dimensionality reduction! (SNE/t-SNE) In Sum: Principal Component Analysis X = ( x 1 x


  1. Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction 2 Byron C Wallace

  2. Today • A bit of wrap up on PCA • Then : Non-linear dimensionality reduction! (SNE/t-SNE)

  3. In Sum: Principal Component Analysis X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Data Eigenvectors of Covariance   λ 1 λ 2   Λ = ...   λ d Idea : Take top- k eigenvectors to maximize variance

  4. Why? Idea : Take top- k eigenvectors to maximize variance Last time, we saw that we can derive this by maximizing the variance in the compressed space

  5. Why? Idea : Take top- k eigenvectors to maximize variance Last time, we saw that we can derive this by maximizing the variance in the compressed space Can also motivate by explicitly minimizing reconstruction error

  6. Minimizing reconstruction error

  7. Getting the eigenvalues, two ways • Direct eigenvalue decomposition of the covariance matrix N S = 1 n = 1 X N XX > x n x > N n =1

  8. Getting the eigenvalues, two ways • Direct eigenvalue decomposition of the covariance matrix N S = 1 n = 1 X N XX > x n x > N n =1 • Singular Value Decomposition (SVD)

  9. Singular Value Decomposition Idea : Decompose the 
 d x n matrix X into 1. A n x n basis V 
 (unitary matrix) 2. A d x n matrix Σ 
 (diagonal projection) 3. A d x d basis U 
 (unitary matrix) d X = U d ⇥ d Σ d ⇥ n V > n ⇥ n

  10. 2 1. Rotation n X V T ~ h ~ x i ~ x = v i , ~ e i i = 1 2. Scaling n X SV T ~ s i h ~ x i ~ x = v i , ~ e i i = 1 3. Rotation n X USV T ~ s i h ~ x i ~ x = v i , ~ u i i = 1

  11. SVD for PCA V > = U X Σ , |{z} |{z} |{z} |{z} D ⇥ N D ⇥ D D ⇥ N N ⇥ N S = 1 N XX > = 1 Σ > U > = 1 N U Σ V > V N U ΣΣ > U > | {z } = I N

  12. SVD for PCA V > = U X Σ , |{z} |{z} |{z} |{z} D ⇥ N D ⇥ D D ⇥ N N ⇥ N S = 1 N XX > = 1 Σ > U > = 1 N U Σ V > V N U ΣΣ > U > | {z } = I N It turns out the columns of U are the eigenvectors of XX T

  13. Computing PCA Method 1: eigendecomposition n XX > U are eigenvectors of covariance matrix C = 1 Computing C already takes O ( nd 2 ) time (very expensive) Method 2: singular value decomposition (SVD) Find X = U d ⇥ d Σ d ⇥ n V > n ⇥ n where U > U = I d ⇥ d , V > V = I n ⇥ n , Σ is diagonal Computing top k singular vectors takes only O ( ndk )

  14. Computing PCA Method 1: eigendecomposition n XX > U are eigenvectors of covariance matrix C = 1 Computing C already takes O ( nd 2 ) time (very expensive) Method 2: singular value decomposition (SVD) Find X = U d ⇥ d Σ d ⇥ n V > n ⇥ n where U > U = I d ⇥ d , V > V = I n ⇥ n , Σ is diagonal Computing top k singular vectors takes only O ( ndk ) Relationship between eigendecomposition and SVD: Left singular vectors are principal components ( C = U Σ 2 U > )

  15. Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i

  16. Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i X d × n U d × k Z k × n u ) ( z 1 . . . z n ) ) u ( ( . . .

  17. Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i X d × n U d × k Z k × n u ) ( z 1 . . . z n ) ) u ( ( . . . Idea: z i more “meaningful” representation of i -th face than x i Can use z i for nearest-neighbor classification

  18. Eigen-faces [Turk & Pentland 1991] • d = number of pixels • Each x i 2 R d is a face image • x ji = intensity of the j -th pixel in image i X d × n U d × k Z k × n u ) ( z 1 . . . z n ) ) u ( ( . . . Idea: z i more “meaningful” representation of i -th face than x i Can use z i for nearest-neighbor classification Much faster: O ( dk + nk ) time instead of O ( dn ) when n, d � k

  19. Aside: How many components? • • Magnitude of eigenvalues indicate fraction of variance captured. • Eigenvalues on a face image dataset: 1353.2 1086.7 820.1 λ i 553.6 287.1 2 3 4 5 6 7 8 9 10 11 i • Eigenvalues typically drop o ff sharply, so don’t need that many. • Of course variance isn’t everything...

  20. Wrapping up PCA • PCA is a linear model for dimensionality reduction which finds a mapping to a lower dimensional space that maximizes variance • We saw that this is equivalent to performing an eigendecomposition on the covariance matrix of X • Next time Auto-encoders and neural compression for non-linear projections

  21. Wrapping up PCA • PCA is a linear model for dimensionality reduction which finds a mapping to a lower dimensional space that maximizes variance • We saw that this is equivalent to performing an eigendecomposition on the covariance matrix of X • Next time Auto-encoders and neural compression for non-linear projections

  22. Non-linear dimensionality reduction

  23. Limitations of Linearity PCA is e ff ective PCA is ine ff ective

  24. Nonlinear PCA Broken solution Desired solution u 1 x 2 We want desired solution: S = { ( x 1 , x 2 ) : x 2 = u 2 1 }

  25. Nonlinear PCA Broken solution Desired solution u 1 x 2 We want desired solution: S = { ( x 1 , x 2 ) : x 2 = u 2 1 } Idea: Use kernels 1 , x 2 ) > We can get this: S = { φ ( x ) = Uz } with φ ( x ) = ( x 2 { } Linear dimensionality reduction in φ ( x ) space ⇔ Nonlinear dimensionality reduction in x space

  26. Kernel PCA

  27. Alternatively: t-SNE !

  28. Stochastic Neighbor 
 Embeddings Borrowing from : 
 Laurens van der Maaten 
 (Delft -> Facebook AI)

  29. Manifold learning Idea : Perform a non-linear dimensionality reduction 
 in a manner that preserves proximity (but not distances)

  30. Manifold learning

  31. PCA on MNIST digits

  32. t-SNE on MNIST Digits

  33. Swiss roll Euclidean distance is not always 
 a good notion of proximity

  34. Non-linear projection Bad projection: relative position to neighbors changes

  35. Non-linear projection Intuition: Want to preserve local neighborhood

  36. Stochastic Neighbor Embedding Original space The map

  37. SNE to t-SNE ( on board )

  38. t-SNE: SNE with a t-Distribution Similarity in High Dimension exp ( − || x i − x j || 2 / 2 σ 2 ) p ij = k 6 = l exp ( − || x l − x k || 2 / 2 σ 2 ) P Similarity in Low Dimension (1 + || y i − y j || 2 ) � 1 q ij = k 6 = l (1 + || y k − y l || 2 ) � 1 P Gradient ∂ C ( p ij − q ij )(1 + || y i − y j || 2 ) � 1 ( y i − y j ) X = 4 ∂ y i j 6 = i

  39. Algorithm 1 : Simple version of t-Distributed Stochastic Neighbor Embedding. Data : data set X = { x 1 , x 2 ,..., x n } , cost function parameters: perplexity Perp , optimization parameters: number of iterations T , learning rate η , momentum α ( t ) . Result : low-dimensional data representation Y ( T ) = { y 1 , y 2 ,..., y n } . begin compute pairwise affinities p j | i with perplexity Perp (using Equation 1) p j | i + p i | j set p ij = 2 n sample initial solution Y ( 0 ) = { y 1 , y 2 ,..., y n } from N ( 0 , 10 − 4 I ) for t=1 to T do compute low-dimensional affinities q ij (using Equation 4) compute gradient δ C δ Y (using Equation 5) set Y ( t ) = Y ( t − 1 ) + η δ C � Y ( t − 1 ) − Y ( t − 2 ) � δ Y + α ( t ) end end

  40. Algorithm 1 : Simple version of t-Distributed Stochastic Neighbor Embedding. Data : data set X = { x 1 , x 2 ,..., x n } , cost function parameters: perplexity Perp , optimization parameters: number of iterations T , learning rate η , momentum α ( t ) . Result : low-dimensional data representation Y ( T ) = { y 1 , y 2 ,..., y n } . begin compute pairwise affinities p j | i with perplexity Perp (using Equation 1) p j | i + p i | j set p ij = 2 n sample initial solution Y ( 0 ) = { y 1 , y 2 ,..., y n } from N ( 0 , 10 − 4 I ) for t=1 to T do compute low-dimensional affinities q ij (using Equation 4) compute gradient δ C δ Y (using Equation 5) set Y ( t ) = Y ( t − 1 ) + η δ C � Y ( t − 1 ) − Y ( t − 2 ) � δ Y + α ( t ) end end

  41. set Y ( t ) = Y ( t − 1 ) + η δ C � Y ( t − 1 ) − Y ( t − 2 ) � δ Y + α ( t )

  42. set Y ( t ) = Y ( t − 1 ) + η δ C � Y ( t − 1 ) − Y ( t − 2 ) � δ Y + α ( t ) Regular gradient descent

  43. “momentum” set Y ( t ) = Y ( t − 1 ) + η δ C � Y ( t − 1 ) − Y ( t − 2 ) � δ Y + α ( t )

  44. Figure credit: Bisong, 2019

  45. What’s magical about t ? Basically, the gradient has nice properties 18 Low − dimensional distance > 16 14 12 10 8 6 4 2 0 High − dimensional distance > (a) Gradient of SNE.

  46. What’s magical about t ? Basically, the gradient has nice properties 18 Low − dimensional distance > 16 14 12 10 8 6 4 2 0 High − dimensional distance > (a) Gradient of SNE. Positive gradient —> “attraction” between points

  47. What’s magical about t ? 18 1 Low − dimensional distance > Low − dimensional distance > 16 14 0.5 12 10 0 8 6 − 0.5 4 2 − 1 0 High − dimensional distance > High − dimensional distance > (a) Gradient of SNE. (c) Gradient of t-SNE. Positive gradient —> “attraction” between points

Recommend


More recommend