Dimensionality Reduction Jia-Bin Huang Virginia Tech Spring 2019 - PowerPoint PPT Presentation

Dimensionality Reduction Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824

Administrative • HW 3 due March 27. • HW 4 out tonight

J. Mark Sowers Distinguished Lecture • Michael Jordan • Pehong Chen Distinguished Professor Department of Statistics and Electrical Engineering and Computer Sciences • University of California, Berkeley • 3/28/19 • 7:30 PM, McBryde 100

ECE Faculty Candidate Talk • Siheng Chen • Ph.D. Carnegie Mellon University • Data science with graphs: From social network analysis to autonomous driving • Time: 10:00 AM - 11:00 AM March 28 • Location: 457B Whittemore

Expectation Maximization (EM) Algorithm • Goal: Find 𝜄 that maximizes log-likelihood σ 𝑗 log 𝑞(𝑦 𝑗 ; 𝜄) σ 𝑗 log 𝑞(𝑦 𝑗 ; 𝜄) = σ 𝑗 log σ 𝑨 (𝑗) 𝑞(𝑦 𝑗 , 𝑨 (𝑗) ; 𝜄) 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 = σ 𝑗 log σ 𝑨 (𝑗) 𝑅 𝑗 𝑨 𝑗 𝑅 𝑗 (𝑨 (𝑗) ) 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 ≥ σ 𝑗 σ 𝑨 (𝑗) 𝑅 𝑗 𝑨 𝑗 log 𝑅 𝑗 (𝑨 (𝑗) ) Jensen’s inequality: 𝑔 𝐹 𝑌 ≥ 𝐹[𝑔 𝑌 ]

Expectation Maximization (EM) Algorithm • Goal: Find 𝜄 that maximizes log-likelihood σ 𝑗 log 𝑞(𝑦 𝑗 ; 𝜄) 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 σ 𝑗 log 𝑞(𝑦 𝑗 ; 𝜄) ≥ σ 𝑗 σ 𝑨 (𝑗) 𝑅 𝑗 𝑨 𝑗 log 𝑅 𝑗 (𝑨 (𝑗) ) - The lower bound works for all possible set of distributions 𝑅 𝑗 - We want tight lower-bound: 𝑔 𝐹 𝑌 = 𝐹[𝑔 𝑌 ] - When will that happen? 𝑌 = 𝐹 𝑌 with probability 1 ( 𝑌 is a constant) 𝑞 𝑦 𝑗 , 𝑨 𝑗 ; 𝜄 = 𝑑 𝑅 𝑗 (𝑨 (𝑗) )

How should we choose 𝑅 𝑗 (𝑨 (𝑗) ) ? 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 • = 𝑑 𝑅 𝑗 (𝑨 (𝑗) ) • 𝑅 𝑗 (𝑨 (𝑗) ) ∝ 𝑞 𝑦 𝑗 , 𝑨 𝑗 ; 𝜄 • σ 𝑨 𝑅 𝑗 (𝑨 (𝑗) ) = 1 (because it is a distribution) 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 • 𝑅 𝑗 𝑨 𝑗 = σ 𝑨 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 = 𝑞 𝑦 𝑗 ;𝜄 = 𝑞(𝑨 𝑗 |𝑦 𝑗 ; 𝜄)

EM algorithm Repeat until convergence{ (E-step) For each 𝑗 , set ≔ 𝑞(𝑨 𝑗 |𝑦 𝑗 ; 𝜄) 𝑅 𝑗 𝑨 𝑗 (Probabilistic inference) (M-step) Set 𝑞 𝑦 𝑗 ,𝑨 𝑗 ;𝜄 𝜄 ≔ argmax 𝜄 σ 𝑗 σ 𝑨 (𝑗) 𝑅 𝑗 𝑨 𝑗 log 𝑅 𝑗 (𝑨 (𝑗) ) }

Expectation Maximization (EM) Algorithm       ˆ     Goal: argmax log , | p x z    z Log of sums is intractable for concave functions f(x)          Jensen’s Inequality E E f X f X (so we maximize the lower bound!) See here for proof: www.stanford.edu/class/cs229/notes/cs229-notes8.ps

Expectation Maximization (EM) Algorithm       ˆ     Goal: argmax log , | p x z    z 1. E-step: compute                  ( ) t E log , | log , | | , p p p x z x z z x  ( ) t | , z x z 2. M-step: solve             ( 1 ) ( ) t t argmax log , | | , p p x z z x  z

log of expectation of P(x|z)               ˆ      Goal: argmax log , | p E E f X f X x z    z 1. E-step: compute expectation of log of P(x|z)                  ( ) t E log , | log , | | , p p p x z x z z x  ( ) t | , z x z 2. M-step: solve             ( 1 ) ( ) t t argmax log , | | , p p x z z x  z

EM for Mixture of Gaussians - derivation         2   1 x  2 π             n m μ σ 2 2 exp | , , , | , ,   p x p x z m  2 m n n n m m m    2 m m m m                  ( ) 1. E-step: t E log , | log , | | , p p p x z x z z x  ( ) t | , z x z             ( 1 ) ( ) t t argmax log , | | , p p 2. M-step: x z z x  z

EM for Mixture of Gaussians         2   1 x  2 π             n m 2 2 exp μ σ | , , , | , ,   p x p x z m  2 m n n n m m m    2 m m m m                  1. E-step: ( ) t E log , | log , | | , p p p x z x z z x  ( ) t | , z x z             ( 1 ) ( ) t t argmax log , | | , p p 2. M-step: x z z x  z ( )    t μ ( ) σ 2 π ( ) t t ( | , , , ) p z m x nm n n   1   1    nm ( t 1 )       ˆ 2 ˆ 2    ˆ ( 1 ) t    x ˆ ( t 1 ) x  n    m nm n m m nm n m N n n nm nm n n

EM algorithm - derivation http://lasa.epfl.ch/teaching/lectures/ML_Phd/Notes/GP-GMM.pdf

EM algorithm – E-Step

EM algorithm – M-Step

EM algorithm – M-Step Take derivative with respect to 𝜈 𝑚

EM algorithm – M-Step −1 Take derivative with respect to σ 𝑚

EM Algorithm for GMM

EM Algorithm • Maximizes a lower bound on the data likelihood at each iteration • Each step increases the data likelihood • Converges to local maximum • Common tricks to derivation • Find terms that sum or integrate to 1 • Lagrange multiplier to deal with constraints

Convergence of EM Algorithm

“Hard EM” • Same as EM except compute z* as most likely values for hidden variables • K-means is an example • Advantages • Simpler: can be applied when cannot derive EM • Sometimes works better if you want to make hard predictions at the end • But • Generally, pdf parameters are not as accurate as EM

Dimensionality Reduction • Motivation • Data compression • Data visualization • Principal component analysis • Formulation • Algorithm • Reconstruction • Choosing the number of principal components • Applying PCA

Dimensionality Reduction • Motivation • Principal component analysis • Formulation • Algorithm • Reconstruction • Choosing the number of principal components • Applying PCA

Data Compression • Reduces the required time and storage space • Removing multi-collinearity improves the interpretation of the parameters of the machine learning model. 𝑦 (1) ∈ 𝑆 2 → 𝑨 1 ∈ 𝑆 𝑦 (2) ∈ 𝑆 2 → 𝑨 1 ∈ 𝑆 𝑦 2 ⋮ 𝑦 (𝑛) ∈ 𝑆 2 → 𝑨 𝑛 ∈ 𝑆 𝑦 1 𝑨 1

Data Compression • Reduce data from 3D to 2D (in general 1000D -> 100D) 𝑦 3 𝑨 2 𝑨 2 𝑦 3 𝑨 1 𝑦 2 𝑦 1 𝑦 2 𝑦 1 𝑨 1

Dimensionality Reduction • Motivation • Principal component analysis • Formulation • Algorithm • Reconstruction • Choosing the number of principal components • Applying PCA

Principal Component Analysis Formulation 𝑦 2 𝑦 1

Principal Component Analysis Formulation 𝑣 (1) 𝑦 2 𝑣 (2) 𝑣 (1) 𝑦 1 • Reduce n-D to k-D: find 𝑣 (1) , 𝑣 (2) , ⋯ , 𝑣 (𝑙) ∈ 𝑆 𝑜 onto which to project the data, so as to minimize the projection error

PCA vs. Linear regression 𝑦 2 𝑧 𝑦 1 𝑦 1

Data pre-processing • Training set: 𝑦 (1) , 𝑦 (2) , ⋯ , 𝑦 (𝑛) • Preprocessing (feature scaling/mean normalization) 𝜈 𝑘 = 1 (𝑗) 𝑛 ෍ 𝑦 𝑘 𝑗 (𝑗) with 𝑦 𝑘 − 𝜈 𝑘 Replace each 𝑦 𝑘 If different features on different scales, scale features to have comparable range of values (𝑗) − 𝜈 𝑘 𝑦 𝑘 (𝑗) ← 𝑦 𝑘 𝑡 𝑘

Principal Component Analysis Algorithm • Goal: Reduce data from n-dimensions to k-dimensions • Step 1: Compute “covariance matrix” 𝑜 Σ = 1 ⊤ 𝑦 𝑗 𝑦 𝑗 𝑛 ෍ 𝑗=1 • Step 2: Compute “eigenvectors” of the covariance matrix [U, S, V] = svd(Sigma); U = 𝑣 (1) , 𝑣 (2) , ⋯ , 𝑣 (𝑜) ∈ 𝑆 𝑜×𝑜 Principal components: 𝑣 (1) , 𝑣 (2) , ⋯ , 𝑣 𝑙 ∈ 𝑆 𝑜

Principal Component Analysis Algorithm • Goal: Reduce data from n-dimensions to k-dimensions • Principal components: 𝑣 (1) , 𝑣 (2) , ⋯ , 𝑣 𝑙 ∈ 𝑆 𝑜 ⊤ 𝑦 (𝑗) ∈ 𝑆 𝑙 𝑨 𝑗 = 𝑣 1 , 𝑣 2 , ⋯ , 𝑣 𝑙

Dimensionality Reduction Jia-Bin Huang Virginia Tech Spring 2019 - PowerPoint PPT Presentation

Dimensionality Reduction Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative HW 3 due March 27. HW 4 out tonight J. Mark Sowers Distinguished Lecture Michael Jordan Pehong Chen Distinguished Professor

STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24 Dimensionality

Dimensionality Reduction Alexandros Tantos Assistant Professor Aristotle University of

Investigating Dimensionality Dimensionality Dimensionality with with Investigating

WIKIPEDIA ARTICLE GROUP 9 Contents Article Overview 1. Dimensionality Reduction 2.

Nonlinear Dimensionality Reduction Donovan Parks Overview Direct visualization vs.

Dimensionality Reduction Algorithms (and how to interpret their output) Dalya Baron (Tel Aviv

Exploring Multivariate Data with Clustering and Dimensionality Reduction Marco Baroni Practical

Applied Machine Learning Dimensionality reduction using PCA Siamak Ravanbakhsh COMP 551 (Fall

Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec

DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION MATTHIEU BLOCH April 21, 2020 1 / 26

Probabilistic Dimensionality Reduction Neil D. Lawrence University of Sheffield Facebook, London

Kernel-Based Dimensionality Reduction Methods on Synthesized and Facial Image Data Jonathan L.

Spatial Data: Dimensionality Reduction CS444 Techniques, Lecture 3 In this subfield, we think

Spatial Data: Dimensionality Reduction CSC444 Techniques In this subfield, we think of a data

Dimensionality Reduction INFO-4604, Applied Machine Learning University of Colorado Boulder

Dimensionality Reduction Techniques for Proximity Problems Piotr Indyk, SODA 2000 CS 468 |

Topological Data Analysis A Framework for Machine Learning Samarth Bansal (11630) Deepak

Continuous Latent Variables Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 12 Principal Component

PRINCIPAL COMPONENT ANALYSIS(PCA) By Deepen naorem Latent(hidden) representation Method A

CS475/CS675 Lecture 23: July 19, 2016 Principal Component Analysis, Eigenfaces CS475/CS675 (c)

Machine Learning Basics Lecture slides for Chapter 5 of Deep Learning www.deeplearningbook.org

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Principal Component Analysis of High Frequency Data t-Sahalia Dacheng Xiu Yacine A

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Dimensionality Reduction Jia-Bin Huang Virginia Tech Spring 2019 - PowerPoint PPT Presentation

Dimensionality Reduction Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative HW 3 due March 27. HW 4 out tonight J. Mark Sowers Distinguished Lecture Michael Jordan Pehong Chen Distinguished Professor

STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24 Dimensionality

Dimensionality Reduction Alexandros Tantos Assistant Professor Aristotle University of

Investigating Dimensionality Dimensionality Dimensionality with with Investigating

WIKIPEDIA ARTICLE GROUP 9 Contents Article Overview 1. Dimensionality Reduction 2.

Nonlinear Dimensionality Reduction Donovan Parks Overview Direct visualization vs.

Dimensionality Reduction Algorithms (and how to interpret their output) Dalya Baron (Tel Aviv

Exploring Multivariate Data with Clustering and Dimensionality Reduction Marco Baroni Practical

Applied Machine Learning Dimensionality reduction using PCA Siamak Ravanbakhsh COMP 551 (Fall

Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec

DIMENSIONALITY REDUCTION DIMENSIONALITY REDUCTION MATTHIEU BLOCH April 21, 2020 1 / 26

Probabilistic Dimensionality Reduction Neil D. Lawrence University of Sheffield Facebook, London

Kernel-Based Dimensionality Reduction Methods on Synthesized and Facial Image Data Jonathan L.

Spatial Data: Dimensionality Reduction CS444 Techniques, Lecture 3 In this subfield, we think

Spatial Data: Dimensionality Reduction CSC444 Techniques In this subfield, we think of a data

Dimensionality Reduction INFO-4604, Applied Machine Learning University of Colorado Boulder

Dimensionality Reduction Techniques for Proximity Problems Piotr Indyk, SODA 2000 CS 468 |

Topological Data Analysis A Framework for Machine Learning Samarth Bansal (11630) Deepak

Continuous Latent Variables Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 12 Principal Component

PRINCIPAL COMPONENT ANALYSIS(PCA) By Deepen naorem Latent(hidden) representation Method A

CS475/CS675 Lecture 23: July 19, 2016 Principal Component Analysis, Eigenfaces CS475/CS675 (c)

Machine Learning Basics Lecture slides for Chapter 5 of Deep Learning www.deeplearningbook.org

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Principal Component Analysis of High Frequency Data t-Sahalia Dacheng Xiu Yacine A

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Regularization Overview Regularization Overview Problems & Multicollinearity We will