Dimensionality Reduction Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824
Administrative • HW 3 due March 27. • HW 4 out tonight
J. Mark Sowers Distinguished Lecture • Michael Jordan • Pehong Chen Distinguished Professor Department of Statistics and Electrical Engineering and Computer Sciences • University of California, Berkeley • 3/28/19 • 7:30 PM, McBryde 100
ECE Faculty Candidate Talk • Siheng Chen • Ph.D. Carnegie Mellon University • Data science with graphs: From social network analysis to autonomous driving • Time: 10:00 AM - 11:00 AM March 28 • Location: 457B Whittemore
Expectation Maximization (EM) Algorithm • Goal: Find 𝜄 that maximizes log-likelihood σ 𝑗 log 𝑞(𝑦 𝑗 ; 𝜄) σ 𝑗 log 𝑞(𝑦 𝑗 ; 𝜄) = σ 𝑗 log σ 𝑨 (𝑗) 𝑞(𝑦 𝑗 , 𝑨 (𝑗) ; 𝜄) 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 = σ 𝑗 log σ 𝑨 (𝑗) 𝑅 𝑗 𝑨 𝑗 𝑅 𝑗 (𝑨 (𝑗) ) 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 ≥ σ 𝑗 σ 𝑨 (𝑗) 𝑅 𝑗 𝑨 𝑗 log 𝑅 𝑗 (𝑨 (𝑗) ) Jensen’s inequality: 𝑔 𝐹 𝑌 ≥ 𝐹[𝑔 𝑌 ]
Expectation Maximization (EM) Algorithm • Goal: Find 𝜄 that maximizes log-likelihood σ 𝑗 log 𝑞(𝑦 𝑗 ; 𝜄) 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 σ 𝑗 log 𝑞(𝑦 𝑗 ; 𝜄) ≥ σ 𝑗 σ 𝑨 (𝑗) 𝑅 𝑗 𝑨 𝑗 log 𝑅 𝑗 (𝑨 (𝑗) ) - The lower bound works for all possible set of distributions 𝑅 𝑗 - We want tight lower-bound: 𝑔 𝐹 𝑌 = 𝐹[𝑔 𝑌 ] - When will that happen? 𝑌 = 𝐹 𝑌 with probability 1 ( 𝑌 is a constant) 𝑞 𝑦 𝑗 , 𝑨 𝑗 ; 𝜄 = 𝑑 𝑅 𝑗 (𝑨 (𝑗) )
How should we choose 𝑅 𝑗 (𝑨 (𝑗) ) ? 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 • = 𝑑 𝑅 𝑗 (𝑨 (𝑗) ) • 𝑅 𝑗 (𝑨 (𝑗) ) ∝ 𝑞 𝑦 𝑗 , 𝑨 𝑗 ; 𝜄 • σ 𝑨 𝑅 𝑗 (𝑨 (𝑗) ) = 1 (because it is a distribution) 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 • 𝑅 𝑗 𝑨 𝑗 = σ 𝑨 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 = 𝑞 𝑦 𝑗 ;𝜄 = 𝑞(𝑨 𝑗 |𝑦 𝑗 ; 𝜄)
EM algorithm Repeat until convergence{ (E-step) For each 𝑗 , set ≔ 𝑞(𝑨 𝑗 |𝑦 𝑗 ; 𝜄) 𝑅 𝑗 𝑨 𝑗 (Probabilistic inference) (M-step) Set 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 𝜄 ≔ argmax 𝜄 σ 𝑗 σ 𝑨 (𝑗) 𝑅 𝑗 𝑨 𝑗 log 𝑅 𝑗 (𝑨 (𝑗) ) }
Expectation Maximization (EM) Algorithm ˆ Goal: argmax log , | p x z z Log of sums is intractable for concave functions f(x) Jensen’s Inequality E E f X f X (so we maximize the lower bound!) See here for proof: www.stanford.edu/class/cs229/notes/cs229-notes8.ps
Expectation Maximization (EM) Algorithm ˆ Goal: argmax log , | p x z z 1. E-step: compute ( ) t E log , | log , | | , p p p x z x z z x ( ) t | , z x z 2. M-step: solve ( 1 ) ( ) t t argmax log , | | , p p x z z x z
log of expectation of P(x|z) ˆ Goal: argmax log , | p E E f X f X x z z 1. E-step: compute expectation of log of P(x|z) ( ) t E log , | log , | | , p p p x z x z z x ( ) t | , z x z 2. M-step: solve ( 1 ) ( ) t t argmax log , | | , p p x z z x z
EM for Mixture of Gaussians - derivation 2 1 x 2 π n m μ σ 2 2 exp | , , , | , , p x p x z m 2 m n n n m m m 2 m m m m ( ) 1. E-step: t E log , | log , | | , p p p x z x z z x ( ) t | , z x z ( 1 ) ( ) t t argmax log , | | , p p 2. M-step: x z z x z
EM for Mixture of Gaussians 2 1 x 2 π n m 2 2 exp μ σ | , , , | , , p x p x z m 2 m n n n m m m 2 m m m m 1. E-step: ( ) t E log , | log , | | , p p p x z x z z x ( ) t | , z x z ( 1 ) ( ) t t argmax log , | | , p p 2. M-step: x z z x z ( ) t μ ( ) σ 2 π ( ) t t ( | , , , ) p z m x nm n n 1 1 nm ( t 1 ) ˆ 2 ˆ 2 ˆ ( 1 ) t x ˆ ( t 1 ) x n m nm n m m nm n m N n n nm nm n n
EM algorithm - derivation http://lasa.epfl.ch/teaching/lectures/ML_Phd/Notes/GP-GMM.pdf
EM algorithm – E-Step
EM algorithm – E-Step
EM algorithm – M-Step
EM algorithm – M-Step Take derivative with respect to 𝜈 𝑚
EM algorithm – M-Step −1 Take derivative with respect to σ 𝑚
EM Algorithm for GMM
EM Algorithm • Maximizes a lower bound on the data likelihood at each iteration • Each step increases the data likelihood • Converges to local maximum • Common tricks to derivation • Find terms that sum or integrate to 1 • Lagrange multiplier to deal with constraints
Convergence of EM Algorithm
“Hard EM” • Same as EM except compute z* as most likely values for hidden variables • K-means is an example • Advantages • Simpler: can be applied when cannot derive EM • Sometimes works better if you want to make hard predictions at the end • But • Generally, pdf parameters are not as accurate as EM
Dimensionality Reduction • Motivation • Data compression • Data visualization • Principal component analysis • Formulation • Algorithm • Reconstruction • Choosing the number of principal components • Applying PCA
Dimensionality Reduction • Motivation • Principal component analysis • Formulation • Algorithm • Reconstruction • Choosing the number of principal components • Applying PCA
Data Compression • Reduces the required time and storage space • Removing multi-collinearity improves the interpretation of the parameters of the machine learning model. 𝑦 (1) ∈ 𝑆 2 → 𝑨 1 ∈ 𝑆 𝑦 (2) ∈ 𝑆 2 → 𝑨 1 ∈ 𝑆 𝑦 2 ⋮ 𝑦 (𝑛) ∈ 𝑆 2 → 𝑨 𝑛 ∈ 𝑆 𝑦 1 𝑨 1
Data Compression • Reduces the required time and storage space • Removing multi-collinearity improves the interpretation of the parameters of the machine learning model. 𝑦 (1) ∈ 𝑆 2 → 𝑨 1 ∈ 𝑆 𝑦 (2) ∈ 𝑆 2 → 𝑨 1 ∈ 𝑆 𝑦 2 ⋮ 𝑦 (𝑛) ∈ 𝑆 2 → 𝑨 𝑛 ∈ 𝑆 𝑦 1 𝑨 1
Data Compression • Reduce data from 3D to 2D (in general 1000D -> 100D) 𝑦 3 𝑨 2 𝑨 2 𝑦 3 𝑨 1 𝑦 2 𝑦 1 𝑦 2 𝑦 1 𝑨 1
Dimensionality Reduction • Motivation • Principal component analysis • Formulation • Algorithm • Reconstruction • Choosing the number of principal components • Applying PCA
Principal Component Analysis Formulation 𝑦 2 𝑦 1
Principal Component Analysis Formulation 𝑣 (1) 𝑦 2 𝑣 (2) 𝑣 (1) 𝑦 1 • Reduce n-D to k-D: find 𝑣 (1) , 𝑣 (2) , ⋯ , 𝑣 (𝑙) ∈ 𝑆 𝑜 onto which to project the data, so as to minimize the projection error
PCA vs. Linear regression 𝑦 2 𝑧 𝑦 1 𝑦 1
Data pre-processing • Training set: 𝑦 (1) , 𝑦 (2) , ⋯ , 𝑦 (𝑛) • Preprocessing (feature scaling/mean normalization) 𝜈 𝑘 = 1 (𝑗) 𝑛 𝑦 𝑘 𝑗 (𝑗) with 𝑦 𝑘 − 𝜈 𝑘 Replace each 𝑦 𝑘 If different features on different scales, scale features to have comparable range of values (𝑗) − 𝜈 𝑘 𝑦 𝑘 (𝑗) ← 𝑦 𝑘 𝑡 𝑘
Principal Component Analysis Algorithm • Goal: Reduce data from n-dimensions to k-dimensions • Step 1: Compute “covariance matrix” 𝑜 Σ = 1 ⊤ 𝑦 𝑗 𝑦 𝑗 𝑛 𝑗=1 • Step 2: Compute “eigenvectors” of the covariance matrix [U, S, V] = svd(Sigma); U = 𝑣 (1) , 𝑣 (2) , ⋯ , 𝑣 (𝑜) ∈ 𝑆 𝑜×𝑜 Principal components: 𝑣 (1) , 𝑣 (2) , ⋯ , 𝑣 𝑙 ∈ 𝑆 𝑜
Principal Component Analysis Algorithm • Goal: Reduce data from n-dimensions to k-dimensions • Principal components: 𝑣 (1) , 𝑣 (2) , ⋯ , 𝑣 𝑙 ∈ 𝑆 𝑜 ⊤ 𝑦 (𝑗) ∈ 𝑆 𝑙 𝑨 𝑗 = 𝑣 1 , 𝑣 2 , ⋯ , 𝑣 𝑙
Recommend
More recommend