cs 6316 machine learning
play

CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji - PowerPoint PPT Presentation

CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji Department of Computer Science University of Virginia Overview 1. Reducing DImensions 2. Principal Component Analysis 3. A Different Viewpoint of PCA 1 Reducing DImensions


  1. CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji Department of Computer Science University of Virginia

  2. Overview 1. Reducing DImensions 2. Principal Component Analysis 3. A Different Viewpoint of PCA 1

  3. Reducing DImensions

  4. Curse of Dimensionality What is the volume difference between two d -dimensional balls with radii r 1 � 1 and r 2 � 0 . 99 3

  5. Curse of Dimensionality What is the volume difference between two d -dimensional balls with radii r 1 � 1 and r 2 � 0 . 99 ◮ d � 2 : 1 2 π ( r 2 1 − r 2 2 ) ≈ 0 . 03 ◮ d � 3 : 4 3 π ( r 3 1 − r 3 2 ) ≈ 0 . 12 3

  6. Curse of Dimensionality What is the volume difference between two d -dimensional balls with radii r 1 � 1 and r 2 � 0 . 99 ◮ d � 2 : 1 2 π ( r 2 1 − r 2 2 ) ≈ 0 . 03 ◮ d � 3 : 4 3 π ( r 3 1 − r 3 2 ) ≈ 0 . 12 π d / 2 2 + 1 ) ( r d 1 − r d ◮ General form: 2 ) Γ ( d with r d 2 → 0 when d → ∞ ◮ E.g., r 500 � 0 . 00657 2 3

  7. Curse of Dimensionality What is the volume difference between two d -dimensional balls with radii r 1 � 1 and r 2 � 0 . 99 ◮ d � 2 : 1 2 π ( r 2 1 − r 2 2 ) ≈ 0 . 03 ◮ d � 3 : 4 3 π ( r 3 1 − r 3 2 ) ≈ 0 . 12 π d / 2 2 + 1 ) ( r d 1 − r d ◮ General form: 2 ) Γ ( d with r d 2 → 0 when d → ∞ ◮ E.g., r 500 � 0 . 00657 2 Question: what will happen if we uniformly sample from a d -dimensional ball? 3

  8. Dimensionality Reduction Dimensionality Reduction is the process of taking data in a high dimensional space and mapping it into a new space whose dimensionality is much smaller. 4

  9. Dimensionality Reduction Dimensionality Reduction is the process of taking data in a high dimensional space and mapping it into a new space whose dimensionality is much smaller. Mathematically, it means f : x → ˜ (1) x x ∈ R n with n < d where x ∈ R d , ˜ 4

  10. Reducing Dimensions: A toy example For the purpose of reducing dimensions, we can project x � ( x 1 , x 2 ) into the direction along x 1 or x 2 x 2 x 1 Question: Given these two data examples, which direction we should pick? x 1 or x 2 ? 5

  11. Reducing Dimensions: A toy example For the purpose of reducing dimensions, we can project x � ( x 1 , x 2 ) into the direction along x 1 or x 2 x 2 x 1 Question: Given these two data examples, which direction we should pick? x 1 or x 2 ? 5

  12. Reducing Dimensions: A toy example (II) There is a better solution if we are allowed to rotate the coordinate x 2 x 1 6

  13. Reducing Dimensions: A toy example (II) There is a better solution if we are allowed to rotate the coordinate x 2 u 1 u 2 x 1 Pick u 1 , then we preserve all the variance of the examples 6

  14. Reducing Dimensions: A toy example (III) Consider a general case, where the examples do not lie on a perfect line 7 [Bishop, 2006, Section 12.1]

  15. Reducing Dimensions: A toy example (III) Consider a general case, where the examples do not lie on a perfect line We can follow the same idea by finding a direction that can preserve most of the variance of the examples 7 [Bishop, 2006, Section 12.1]

  16. Principal Component Analysis

  17. Formulation Given a set of example S � { x 1 , . . . , x m } ◮ Centering the data by removing the mean � m x � 1 ¯ i � 1 x i m x i ← x i − ¯ ∀ i ∈ [ m ] x (2) 9

  18. Formulation Given a set of example S � { x 1 , . . . , x m } ◮ Centering the data by removing the mean � m x � 1 ¯ i � 1 x i m x i ← x i − ¯ ∀ i ∈ [ m ] x (2) ◮ Assume the direction that we would like to project the data is u , then the objective function is the data variance m J ( u ) � 1 � ( u T x i ) 2 (3) m i � 1 9

  19. Formulation Given a set of example S � { x 1 , . . . , x m } ◮ Centering the data by removing the mean � m x � 1 ¯ i � 1 x i m x i ← x i − ¯ ∀ i ∈ [ m ] x (2) ◮ Assume the direction that we would like to project the data is u , then the objective function is the data variance m J ( u ) � 1 � ( u T x i ) 2 (3) m i � 1 ◮ Maximize J ( u ) is trivial, if there is no constriant on u . Therefore, we set � u � 2 2 � u T u � 1 9

  20. Covariance Matrix The definition of J ( u ) can be written as m 1 � ( u T x i ) 2 J ( u ) (4) � m i � 1 m 1 � u T x i u T x i (5) � m i � 1 m 1 � u T x i x T (6) i u � m i � 1 u T � 1 m � � x i x T (7) u � i m i � 1 u T Σ u (8) � where Σ is the data covariance matrix 10

  21. Optimization ◮ The optimization of finding a single direction projection is u T Σ u max J ( u ) (9) � u u T u � 1 s.t. (10) 11

  22. Optimization ◮ The optimization of finding a single direction projection is u T Σ u max J ( u ) (9) � u u T u � 1 s.t. (10) ◮ It can be converted to an unconstrained optimization problem with a Lagrange multiplier � u T Σ u + λ ( 1 − u T u ) � max (11) u 11

  23. Optimization ◮ The optimization of finding a single direction projection is u T Σ u max J ( u ) (9) � u u T u � 1 s.t. (10) ◮ It can be converted to an unconstrained optimization problem with a Lagrange multiplier � u T Σ u + λ ( 1 − u T u ) � max (11) u ◮ The optimal solution is given by Σ u − λ u � 0 (12) Σ u � λ u (13) 11

  24. Two Observations There are two observations from Σ u � λ u (14) ◮ Firs, λ is an eigenvalue of Σ and u is the corresponding eigenvector (Lecture 01 page 29). 12

  25. Two Observations There are two observations from Σ u � λ u (14) ◮ Firs, λ is an eigenvalue of Σ and u is the corresponding eigenvector (Lecture 01 page 29). ◮ Second, multiplying u T on both sides, we have u T Σ u � λ (15) In order to maximize J ( u ) , λ has to the largest eigenvalue and 12

  26. Principal Component Analysis ◮ As u indicates the first major direction that can preserve the data variance, it is called the first principal component 13

  27. Principal Component Analysis ◮ As u indicates the first major direction that can preserve the data variance, it is called the first principal component ◮ In general, with eigen decomposition, we have U T Σ U � Λ (16) ◮ Eigenvalues Λ � diag ( λ 1 , . . . , λ d ) ◮ Eigenvectors U � [ u 1 , . . . , u d ] 13

  28. Principal Component Analysis (II) Assume in Λ � diag ( λ 1 , . . . , λ d ) , λ 1 ≥ λ 2 ≥ · · · ≥ λ d (17) 14

  29. Principal Component Analysis (II) Assume in Λ � diag ( λ 1 , . . . , λ d ) , λ 1 ≥ λ 2 ≥ · · · ≥ λ d (17) To reduce the dimensionality of x from d to n , with n < d ◮ Take the first n eigenvectors in U and form U � [ u 1 , . . . , u n ] ∈ R d × n ˜ (18) 14

  30. Principal Component Analysis (II) Assume in Λ � diag ( λ 1 , . . . , λ d ) , λ 1 ≥ λ 2 ≥ · · · ≥ λ d (17) To reduce the dimensionality of x from d to n , with n < d ◮ Take the first n eigenvectors in U and form U � [ u 1 , . . . , u n ] ∈ R d × n ˜ (18) ◮ Reduce the dimensionality of x as x � ˜ U T x ∈ R n ˜ (19) 14

  31. Principal Component Analysis (II) Assume in Λ � diag ( λ 1 , . . . , λ d ) , λ 1 ≥ λ 2 ≥ · · · ≥ λ d (17) To reduce the dimensionality of x from d to n , with n < d ◮ Take the first n eigenvectors in U and form U � [ u 1 , . . . , u n ] ∈ R d × n ˜ (18) ◮ Reduce the dimensionality of x as x � ˜ U T x ∈ R n ˜ (19) ◮ The value of n can be determined by the following � n i � 1 λ i ≈ 0 . 95 (20) � d i � 1 λ i 14

  32. Applications: Image Processing Reduce the dimensionality of an image dataset from 28 × 28 � 784 to M (a) Original data [Bishop, 2006, Section 12.1] 15

  33. Applications: Image Processing Reduce the dimensionality of an image dataset from 28 × 28 � 784 to M (a) Original data (b) With the first M principal components [Bishop, 2006, Section 12.1] 15

  34. A Different Viewpoint of PCA

  35. Data Reconstruction Another way to formulate the objective function of PCA m � � x i − UW x i � 2 min (21) 2 W , U i � 1 where ◮ W ∈ R n × d : mapping x i from the original space to a lower-dimensional space R n ◮ U ∈ R d × n : mapping back the original space R d [Shalev-Shwartz and Ben-David, 2014, Chap 23] 17

  36. Data Reconstruction Another way to formulate the objective function of PCA m � � x i − UW x i � 2 min (21) 2 W , U i � 1 where ◮ W ∈ R n × d : mapping x i from the original space to a lower-dimensional space R n ◮ U ∈ R d × n : mapping back the original space R d ◮ Dimensionality reduction is performed as ˜ x � Ux , while W make sure the reduction does not loss much information [Shalev-Shwartz and Ben-David, 2014, Chap 23] 17

  37. Optimization Consider the optimization problem m � � x i − UW x i � 2 min (22) 2 W , V i � 1 ◮ Let W , U be a solution of equation 24 [Shalev-Shwartz and Ben-David, 2014, Lemma 23.1] ◮ the columns of U are orthonormal ◮ W � U T 18

  38. Optimization Consider the optimization problem m � � x i − UW x i � 2 min (22) 2 W , V i � 1 ◮ Let W , U be a solution of equation 24 [Shalev-Shwartz and Ben-David, 2014, Lemma 23.1] ◮ the columns of U are orthonormal ◮ W � U T ◮ The optimization problem can be simplified as m � � x i − UU T x i � 2 min (23) 2 U T U � I i � 1 The solution will be the same. 18

Recommend


More recommend