dimensionality reduction
play

Dimensionality Reduction Jia-Bin Huang Virginia Tech Spring 2019 - PowerPoint PPT Presentation

Dimensionality Reduction Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative HW 3 due March 27. HW 4 out tonight J. Mark Sowers Distinguished Lecture Michael Jordan Pehong Chen Distinguished Professor


  1. Dimensionality Reduction Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824

  2. Administrative • HW 3 due March 27. • HW 4 out tonight

  3. J. Mark Sowers Distinguished Lecture • Michael Jordan • Pehong Chen Distinguished Professor Department of Statistics and Electrical Engineering and Computer Sciences • University of California, Berkeley • 3/28/19 • 7:30 PM, McBryde 100

  4. ECE Faculty Candidate Talk • Siheng Chen • Ph.D. Carnegie Mellon University • Data science with graphs: From social network analysis to autonomous driving • Time: 10:00 AM - 11:00 AM March 28 • Location: 457B Whittemore

  5. Expectation Maximization (EM) Algorithm • Goal: Find 𝜄 that maximizes log-likelihood σ 𝑗 log 𝑞(𝑦 𝑗 ; 𝜄) σ 𝑗 log 𝑞(𝑦 𝑗 ; 𝜄) = σ 𝑗 log σ 𝑨 (𝑗) 𝑞(𝑦 𝑗 , 𝑨 (𝑗) ; 𝜄) 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 = σ 𝑗 log σ 𝑨 (𝑗) 𝑅 𝑗 𝑨 𝑗 𝑅 𝑗 (𝑨 (𝑗) ) 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 ≥ σ 𝑗 σ 𝑨 (𝑗) 𝑅 𝑗 𝑨 𝑗 log 𝑅 𝑗 (𝑨 (𝑗) ) Jensen’s inequality: 𝑔 𝐹 𝑌 ≥ 𝐹[𝑔 𝑌 ]

  6. Expectation Maximization (EM) Algorithm • Goal: Find 𝜄 that maximizes log-likelihood σ 𝑗 log 𝑞(𝑦 𝑗 ; 𝜄) 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 σ 𝑗 log 𝑞(𝑦 𝑗 ; 𝜄) ≥ σ 𝑗 σ 𝑨 (𝑗) 𝑅 𝑗 𝑨 𝑗 log 𝑅 𝑗 (𝑨 (𝑗) ) - The lower bound works for all possible set of distributions 𝑅 𝑗 - We want tight lower-bound: 𝑔 𝐹 𝑌 = 𝐹[𝑔 𝑌 ] - When will that happen? 𝑌 = 𝐹 𝑌 with probability 1 ( 𝑌 is a constant) 𝑞 𝑦 𝑗 , 𝑨 𝑗 ; 𝜄 = 𝑑 𝑅 𝑗 (𝑨 (𝑗) )

  7. How should we choose 𝑅 𝑗 (𝑨 (𝑗) ) ? 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 • = 𝑑 𝑅 𝑗 (𝑨 (𝑗) ) • 𝑅 𝑗 (𝑨 (𝑗) ) ∝ 𝑞 𝑦 𝑗 , 𝑨 𝑗 ; 𝜄 • σ 𝑨 𝑅 𝑗 (𝑨 (𝑗) ) = 1 (because it is a distribution) 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 • 𝑅 𝑗 𝑨 𝑗 = σ 𝑨 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 = 𝑞 𝑦 𝑗 ;𝜄 = 𝑞(𝑨 𝑗 |𝑦 𝑗 ; 𝜄)

  8. EM algorithm Repeat until convergence{ (E-step) For each 𝑗 , set ≔ 𝑞(𝑨 𝑗 |𝑦 𝑗 ; 𝜄) 𝑅 𝑗 𝑨 𝑗 (Probabilistic inference) (M-step) Set 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 𝜄 ≔ argmax 𝜄 σ 𝑗 σ 𝑨 (𝑗) 𝑅 𝑗 𝑨 𝑗 log 𝑅 𝑗 (𝑨 (𝑗) ) }

  9. Expectation Maximization (EM) Algorithm       ˆ     Goal: argmax log , | p x z    z Log of sums is intractable for concave functions f(x)          Jensen’s Inequality E E f X f X (so we maximize the lower bound!) See here for proof: www.stanford.edu/class/cs229/notes/cs229-notes8.ps

  10. Expectation Maximization (EM) Algorithm       ˆ     Goal: argmax log , | p x z    z 1. E-step: compute                  ( ) t E log , | log , | | , p p p x z x z z x  ( ) t | , z x z 2. M-step: solve             ( 1 ) ( ) t t argmax log , | | , p p x z z x  z

  11. log of expectation of P(x|z)               ˆ      Goal: argmax log , | p E E f X f X x z    z 1. E-step: compute expectation of log of P(x|z)                  ( ) t E log , | log , | | , p p p x z x z z x  ( ) t | , z x z 2. M-step: solve             ( 1 ) ( ) t t argmax log , | | , p p x z z x  z

  12. EM for Mixture of Gaussians - derivation         2   1 x  2 π             n m μ σ 2 2 exp | , , , | , ,   p x p x z m  2 m n n n m m m    2 m m m m                  ( ) 1. E-step: t E log , | log , | | , p p p x z x z z x  ( ) t | , z x z             ( 1 ) ( ) t t argmax log , | | , p p 2. M-step: x z z x  z

  13. EM for Mixture of Gaussians         2   1 x  2 π             n m 2 2 exp μ σ | , , , | , ,   p x p x z m  2 m n n n m m m    2 m m m m                  1. E-step: ( ) t E log , | log , | | , p p p x z x z z x  ( ) t | , z x z             ( 1 ) ( ) t t argmax log , | | , p p 2. M-step: x z z x  z ( )    t μ ( ) σ 2 π ( ) t t ( | , , , ) p z m x nm n n   1   1    nm ( t 1 )       ˆ 2 ˆ 2    ˆ ( 1 ) t    x ˆ ( t 1 ) x  n    m nm n m m nm n m N n n nm nm n n

  14. EM algorithm - derivation http://lasa.epfl.ch/teaching/lectures/ML_Phd/Notes/GP-GMM.pdf

  15. EM algorithm – E-Step

  16. EM algorithm – E-Step

  17. EM algorithm – M-Step

  18. EM algorithm – M-Step Take derivative with respect to 𝜈 𝑚

  19. EM algorithm – M-Step −1 Take derivative with respect to σ 𝑚

  20. EM Algorithm for GMM

  21. EM Algorithm • Maximizes a lower bound on the data likelihood at each iteration • Each step increases the data likelihood • Converges to local maximum • Common tricks to derivation • Find terms that sum or integrate to 1 • Lagrange multiplier to deal with constraints

  22. Convergence of EM Algorithm

  23. “Hard EM” • Same as EM except compute z* as most likely values for hidden variables • K-means is an example • Advantages • Simpler: can be applied when cannot derive EM • Sometimes works better if you want to make hard predictions at the end • But • Generally, pdf parameters are not as accurate as EM

  24. Dimensionality Reduction • Motivation • Data compression • Data visualization • Principal component analysis • Formulation • Algorithm • Reconstruction • Choosing the number of principal components • Applying PCA

  25. Dimensionality Reduction • Motivation • Principal component analysis • Formulation • Algorithm • Reconstruction • Choosing the number of principal components • Applying PCA

  26. Data Compression • Reduces the required time and storage space • Removing multi-collinearity improves the interpretation of the parameters of the machine learning model. 𝑦 (1) ∈ 𝑆 2 → 𝑨 1 ∈ 𝑆 𝑦 (2) ∈ 𝑆 2 → 𝑨 1 ∈ 𝑆 𝑦 2 ⋮ 𝑦 (𝑛) ∈ 𝑆 2 → 𝑨 𝑛 ∈ 𝑆 𝑦 1 𝑨 1

  27. Data Compression • Reduces the required time and storage space • Removing multi-collinearity improves the interpretation of the parameters of the machine learning model. 𝑦 (1) ∈ 𝑆 2 → 𝑨 1 ∈ 𝑆 𝑦 (2) ∈ 𝑆 2 → 𝑨 1 ∈ 𝑆 𝑦 2 ⋮ 𝑦 (𝑛) ∈ 𝑆 2 → 𝑨 𝑛 ∈ 𝑆 𝑦 1 𝑨 1

  28. Data Compression • Reduce data from 3D to 2D (in general 1000D -> 100D) 𝑦 3 𝑨 2 𝑨 2 𝑦 3 𝑨 1 𝑦 2 𝑦 1 𝑦 2 𝑦 1 𝑨 1

  29. Dimensionality Reduction • Motivation • Principal component analysis • Formulation • Algorithm • Reconstruction • Choosing the number of principal components • Applying PCA

  30. Principal Component Analysis Formulation 𝑦 2 𝑦 1

  31. Principal Component Analysis Formulation 𝑣 (1) 𝑦 2 𝑣 (2) 𝑣 (1) 𝑦 1 • Reduce n-D to k-D: find 𝑣 (1) , 𝑣 (2) , ⋯ , 𝑣 (𝑙) ∈ 𝑆 𝑜 onto which to project the data, so as to minimize the projection error

  32. PCA vs. Linear regression 𝑦 2 𝑧 𝑦 1 𝑦 1

  33. Data pre-processing • Training set: 𝑦 (1) , 𝑦 (2) , ⋯ , 𝑦 (𝑛) • Preprocessing (feature scaling/mean normalization) 𝜈 𝑘 = 1 (𝑗) 𝑛 ෍ 𝑦 𝑘 𝑗 (𝑗) with 𝑦 𝑘 − 𝜈 𝑘 Replace each 𝑦 𝑘 If different features on different scales, scale features to have comparable range of values (𝑗) − 𝜈 𝑘 𝑦 𝑘 (𝑗) ← 𝑦 𝑘 𝑡 𝑘

  34. Principal Component Analysis Algorithm • Goal: Reduce data from n-dimensions to k-dimensions • Step 1: Compute “covariance matrix” 𝑜 Σ = 1 ⊤ 𝑦 𝑗 𝑦 𝑗 𝑛 ෍ 𝑗=1 • Step 2: Compute “eigenvectors” of the covariance matrix [U, S, V] = svd(Sigma); U = 𝑣 (1) , 𝑣 (2) , ⋯ , 𝑣 (𝑜) ∈ 𝑆 𝑜×𝑜 Principal components: 𝑣 (1) , 𝑣 (2) , ⋯ , 𝑣 𝑙 ∈ 𝑆 𝑜

  35. Principal Component Analysis Algorithm • Goal: Reduce data from n-dimensions to k-dimensions • Principal components: 𝑣 (1) , 𝑣 (2) , ⋯ , 𝑣 𝑙 ∈ 𝑆 𝑜 ⊤ 𝑦 (𝑗) ∈ 𝑆 𝑙 𝑨 𝑗 = 𝑣 1 , 𝑣 2 , ⋯ , 𝑣 𝑙

Recommend


More recommend