Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis CMSC 678 UMBC
Outline Linear Algebra/Math Review Two Methods of Dimensionality Reduction Linear Discriminant Analysis (LDA, LDiscA) Principal Component Analysis (PCA)
Covariance covariance: how (linearly) correlated are variables ๐ Mean of Mean of 1 variable i variable j ๐ ๐๐ = ๐ โ 1 เท (๐ฆ ๐๐ โ ๐ ๐ )(๐ฆ ๐๐ โ ๐ ๐ ) ๐=1 Value of covariance of Value of variable j variables i and j variable i in object k in object k
Covariance covariance: how (linearly) correlated are variables ๐ Mean of Mean of 1 variable i variable j ๐ ๐๐ = ๐ โ 1 เท (๐ฆ ๐๐ โ ๐ ๐ )(๐ฆ ๐๐ โ ๐ ๐ ) ๐=1 Value of covariance of Value of variable j variables i and j variable i in object k in object k ๐ 11 โฏ ๐ 1๐ฟ ๐ ๐๐ = ๐ โฎ โฑ โฎ ฮฃ = ๐๐ ๐ ๐ฟ1 โฏ ๐ ๐ฟ๐ฟ
Eigenvalues and Eigenvectors vector ๐ต๐ฆ = ๐๐ฆ scalar matrix for a given matrix operation (multiplication): what non-zero vector(s) change linearly? (by a single multiplication)
Eigenvalues and Eigenvectors vector ๐ต๐ฆ = ๐๐ฆ scalar matrix ๐ต = 1 5 0 1
Eigenvalues and Eigenvectors vector = ๐ ๐ฆ ๐ฆ + 5๐ง ๐ต๐ฆ = ๐๐ฆ ๐ง ๐ง scalar matrix ๐ต = 1 5 0 1 ๐ฆ ๐ง = ๐ฆ + 5๐ง 1 5 ๐ง 0 1
Eigenvalues and Eigenvectors vector = ๐ ๐ฆ ๐ฆ + 5๐ง ๐ต๐ฆ = ๐๐ฆ ๐ง ๐ง 1 5 1 0 = 1 1 scalar matrix 0 1 0 ๐ต = 1 5 0 1 only non-zero vector to scale
Outline Linear Algebra/Math Review Two Methods of Dimensionality Reduction Linear Discriminant Analysis (LDA, LDiscA) Principal Component Analysis (PCA)
Dimensionality Reduction D input L reduced features features N Original (lightly preprocessed Compressed instances data) representation
Dimensionality Reduction clarity of representation vs. ease of understanding oversimplification: loss of important or relevant information Courtesy Antano ลฝilinsko
Why โmaximizeโ the variance? How can we efficiently summarize? We maximize the variance within our summarization We donโt increase the variance in the dataset How can we capture the most information with the fewest number of axes?
Summarizing Redundant Information (4,2) (2,1) (2,-1) (-2,-1)
Summarizing Redundant Information (4,2) (2,1) (2,-1) (-2,-1) (2,1) = 2*(1,0) + 1*(0,1)
Summarizing Redundant Information 2u 1 (4,2) (4,2) u 1 (2,1) (2,1) u 2 (2,-1) (2,-1) (-2,-1) (-2,-1) -u 1 (2,1) = 1*(2,1) + 0*(2,-1) (4,2) = 2*(2,1) + 0*(2,-1)
Summarizing Redundant Information 2u 1 (4,2) (4,2) u 1 (2,1) (2,1) u 2 (2,-1) (2,-1) (-2,-1) (-2,-1) -u 1 (2,1) = 1*(2,1) + 0*(2,-1) (Is it the most general? These vectors arenโt orthogonal) (4,2) = 2*(2,1) + 0*(2,-1)
Outline Linear Algebra/Math Review Two Methods of Dimensionality Reduction Linear Discriminant Analysis (LDA, LDiscA) Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA, LDiscA) and Principal Component Analysis (PCA) Summarize D-dimensional input data by uncorrelated axes Uncorrelated axes are also called principal components Use the first L components to account for as much variance as possible
Geometric Rationale of LDiscA & PCA Objective: to rigidly rotate the axes of the D- dimensional space to new positions (principal axes): ordered such that principal axis 1 has the highest variance, axis 2 has the next highest variance, .... , and axis D has the lowest variance covariance among each pair of the principal axes is zero (the principal axes are uncorrelated) Courtesy Antano ลฝilinsko
Remember: MAP Classifiers are Optimal for Classification ๐ง ๐ [โ 0/1 (๐ง, เทข min ๐ฑ เท ๐ฝ เท ๐ง ๐ )] โ max เท ๐ เท ๐ง ๐ = ๐ง ๐ ๐ฆ ๐ ๐ฑ ๐ ๐ ๐ เท ๐ง ๐ = ๐ง ๐ ๐ฆ ๐ โ ๐ ๐ฆ ๐ เท ๐ง ๐ ๐(เท ๐ง ๐ ) class-conditional posterior class prior likelihood ๐ฆ ๐ โ โ ๐ธ
Linear Discriminant Analysis MAP Classifier where: 1. class-conditional likelihoods are Gaussian 2. common covariance among class likelihoods
LDiscA: (1) What if likelihoods are Gaussian ๐ เท ๐ง ๐ = ๐ง ๐ ๐ฆ ๐ โ ๐ ๐ฆ ๐ เท ๐ง ๐ ๐(เท ๐ง ๐ ) ๐ ๐ฆ ๐ ๐ = ๐ช ๐ ๐ , ฮฃ ๐ exp โ 1 โ1 ๐ฆ ๐ โ ๐ ๐ 2 ๐ฆ ๐ โ ๐ ๐ ๐ ฮฃ ๐ = 2๐ ๐ธ/2 ฮฃ ๐ 1/2 https://upload.wikimedia.org/wikipedia/commons/5/57/Multivariate_Gaussian.png
LDiscA: (2) Shared Covariance log ๐ เท ๐ง ๐ = ๐ ๐ฆ ๐ = log ๐(๐ฆ ๐ |๐) ๐(๐ฆ ๐ |๐) + log ๐(๐) ๐ เท ๐ง ๐ = ๐ ๐ฆ ๐ ๐ ๐
LDiscA: (2) Shared Covariance log ๐ เท ๐ง ๐ = ๐ ๐ฆ ๐ = log ๐(๐ฆ ๐ |๐) ๐(๐ฆ ๐ |๐) + log ๐(๐) ๐ เท ๐ง ๐ = ๐ ๐ฆ ๐ ๐ ๐ exp โ 1 โ1 ๐ฆ ๐ โ ๐ ๐ 2 ๐ฆ ๐ โ ๐ ๐ ๐ ฮฃ ๐ 2๐ ๐ธ/2 ฮฃ ๐ 1/2 = log ๐(๐) ๐ ๐ + log exp โ 1 โ1 ๐ฆ ๐ โ ๐ ๐ 2 ๐ฆ ๐ โ ๐ ๐ ๐ ฮฃ ๐ 2๐ ๐ธ/2 ฮฃ ๐ 1/2
LDiscA: (2) Shared Covariance log ๐ เท ๐ง ๐ = ๐ ๐ฆ ๐ = log ๐(๐ฆ ๐ |๐) ๐(๐ฆ ๐ |๐) + log ๐(๐) ๐ เท ๐ง ๐ = ๐ ๐ฆ ๐ ๐ ๐ exp โ 1 2 ๐ฆ ๐ โ ๐ ๐ ๐ ฮฃ โ1 ๐ฆ ๐ โ ๐ ๐ 2๐ ๐ธ/2 ฮฃ ๐ 1/2 = log ๐(๐) ๐ ๐ + log exp โ 1 2 ๐ฆ ๐ โ ๐ ๐ ๐ ฮฃ โ1 ๐ฆ ๐ โ ๐ ๐ 2๐ ๐ธ/2 ฮฃ ๐ 1/2 ฮฃ ๐ = ฮฃ ๐
LDiscA: (2) Shared Covariance log ๐ เท ๐ง ๐ = ๐ ๐ฆ ๐ = log ๐(๐ฆ ๐ |๐) ๐(๐ฆ ๐ |๐) + log ๐(๐) ๐ เท ๐ง ๐ = ๐ ๐ฆ ๐ ๐ ๐ = log ๐(๐) ๐ ๐ โ 1 2 ๐ ๐ โ ๐ ๐ ๐ ฮฃ โ1 ๐ ๐ โ ๐ ๐ + ๐ฆ ๐ ๐ ฮฃ โ1 (๐ ๐ โ ๐ ๐ ) linear in x i (check for yourself: why did the quadratic x i terms cancel?)
LDiscA: (2) Shared Covariance log ๐ เท ๐ง ๐ = ๐ ๐ฆ ๐ = log ๐(๐ฆ ๐ |๐) ๐(๐ฆ ๐ |๐) + log ๐(๐) ๐ เท ๐ง ๐ = ๐ ๐ฆ ๐ ๐ ๐ = log ๐(๐) ๐ ๐ โ 1 2 ๐ ๐ โ ๐ ๐ ๐ ฮฃ โ1 ๐ ๐ โ ๐ ๐ + ๐ฆ ๐ ๐ ฮฃ โ1 (๐ ๐ โ ๐ ๐ ) ๐ ฮฃ โ1 ๐ ๐ โ 1 ๐ ฮฃ โ1 ๐ ๐ + log ๐(๐) = ๐ฆ ๐ 2 ๐ ๐ ๐ ฮฃ โ1 ๐ ๐ โ 1 ๐ ฮฃ โ1 ๐ ๐ + log ๐ ๐ +๐ฆ ๐ 2 ๐ ๐ linear in x i rewrite only in terms of x i (check for yourself: why did the (data) and single-class terms quadratic x i terms cancel?)
Classify via Linear Discriminant Functions ๐ ฮฃ โ1 ๐ ๐ โ 1 ๐ ฮฃ โ1 ๐ ๐ + log ๐(๐) ๐ ๐ ๐ฆ ๐ = ๐ฆ ๐ 2 ๐ ๐ arg max equivalent MAP classifier ๐ ๐ ๐ฆ ๐ to ๐
LDiscA Parameters to learn: ๐ ๐ ๐ , ๐ ๐ ๐ , ฮฃ ๐ ๐ โ ๐ ๐ number of items labeled with class k
LDiscA Parameters to learn: ๐ ๐ ๐ , ๐ ๐ ๐ , ฮฃ ๐ ๐ = 1 เท ๐ฆ ๐ ๐ ๐ โ ๐ ๐ ๐ ๐ ๐:๐ง ๐ =๐
LDiscA Parameters to learn: ๐ ๐ ๐ , ๐ ๐ ๐ , ฮฃ ๐ ๐ = 1 เท ๐ฆ ๐ ๐ ๐ โ ๐ ๐ ๐ ๐ ๐:๐ง ๐ =๐ 1 1 ๐ฆ ๐ โ ๐ ๐ ๐ ฮฃ = ๐ โ ๐ฟ เท scatter ๐ = ๐ โ ๐ฟ เท เท ๐ฆ ๐ โ ๐ ๐ ๐ ๐ ๐:๐ง ๐ =๐ one option for ๐ต within-class covariance
Computational Steps for Full- Dimensional LDiscA 1. Compute means, priors, and covariance
Computational Steps for Full- Dimensional LDiscA 1. Compute means, priors, and covariance 2. Diagonalize covariance ฮฃ = UDU T Eigen decomposition K x K orthonormal diagonal matrix of matrix (eigenvectors) eigenvalues
Computational Steps for Full- Dimensional LDiscA 1. Compute means, priors, and covariance 2. Diagonalize covariance ฮฃ = UDU T 3. Sphere the data โ1 X โ = ๐ธ 2 ๐ ๐ ๐
Computational Steps for Full- Dimensional LDiscA 1. Compute means, priors, and covariance 2. Diagonalize covariance ฮฃ = UDU T 3. Sphere the data (get unit covariance) โ1 X โ = ๐ธ 2 ๐ ๐ ๐ 4. Classify according to linear discriminant โ ) functions ๐ ๐ (๐ฆ ๐
Two Extensions to LDiscA Quadratic Discriminant Analysis (QDA) Keep separate covariances per class ๐ ๐ ๐ฆ ๐ = โ 1 โ1 (๐ฆ ๐ โ ๐ ๐ ) 2 ๐ฆ ๐ โ ๐ ๐ T ฮฃ k + log ๐ ๐ โ log |ฮฃ ๐ | 2
Two Extensions to LDiscA Quadratic Discriminant Analysis Regularized LDiscA (QDA) Keep separate covariances per Interpolate between shared class covariance estimate (LDiscA) and class-specific estimate (QDA) ๐ ๐ ๐ฆ ๐ = โ 1 โ1 (๐ฆ ๐ โ ๐ ๐ ) 2 ๐ฆ ๐ โ ๐ ๐ T ฮฃ k ฮฃ ๐ ๐ฝ = ๐ฝฮฃ ๐ + 1 โ ๐ฝ ฮฃ + log ๐ ๐ โ log |ฮฃ ๐ | 2
Vowel Classification LDiscA (left) vs. QDA (right) ESL 4.3
Vowel Classification LDiscA (left) vs. QDA (right) Regularized LDiscA ฮฃ ๐ ๐ฝ = ๐ฝฮฃ ๐ + 1 โ ๐ฝ ฮฃ ESL 4.3
Recommend
More recommend