dimensionality reduction linear discriminant analysis and
play

Dimensionality Reduction: Linear Discriminant Analysis and - PowerPoint PPT Presentation

Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis CMSC 678 UMBC Outline Linear Algebra/Math Review Two Methods of Dimensionality Reduction Linear Discriminant Analysis (LDA, LDiscA) Principal Component


  1. Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis CMSC 678 UMBC

  2. Outline Linear Algebra/Math Review Two Methods of Dimensionality Reduction Linear Discriminant Analysis (LDA, LDiscA) Principal Component Analysis (PCA)

  3. Covariance covariance: how (linearly) correlated are variables ๐‘‚ Mean of Mean of 1 variable i variable j ๐œ ๐‘—๐‘˜ = ๐‘‚ โˆ’ 1 เท (๐‘ฆ ๐‘™๐‘— โˆ’ ๐œˆ ๐‘— )(๐‘ฆ ๐‘™๐‘˜ โˆ’ ๐œˆ ๐‘˜ ) ๐‘™=1 Value of covariance of Value of variable j variables i and j variable i in object k in object k

  4. Covariance covariance: how (linearly) correlated are variables ๐‘‚ Mean of Mean of 1 variable i variable j ๐œ ๐‘—๐‘˜ = ๐‘‚ โˆ’ 1 เท (๐‘ฆ ๐‘™๐‘— โˆ’ ๐œˆ ๐‘— )(๐‘ฆ ๐‘™๐‘˜ โˆ’ ๐œˆ ๐‘˜ ) ๐‘™=1 Value of covariance of Value of variable j variables i and j variable i in object k in object k ๐œ 11 โ‹ฏ ๐œ 1๐ฟ ๐œ ๐‘—๐‘˜ = ๐œ โ‹ฎ โ‹ฑ โ‹ฎ ฮฃ = ๐‘˜๐‘— ๐œ ๐ฟ1 โ‹ฏ ๐œ ๐ฟ๐ฟ

  5. Eigenvalues and Eigenvectors vector ๐ต๐‘ฆ = ๐œ‡๐‘ฆ scalar matrix for a given matrix operation (multiplication): what non-zero vector(s) change linearly? (by a single multiplication)

  6. Eigenvalues and Eigenvectors vector ๐ต๐‘ฆ = ๐œ‡๐‘ฆ scalar matrix ๐ต = 1 5 0 1

  7. Eigenvalues and Eigenvectors vector = ๐œ‡ ๐‘ฆ ๐‘ฆ + 5๐‘ง ๐ต๐‘ฆ = ๐œ‡๐‘ฆ ๐‘ง ๐‘ง scalar matrix ๐ต = 1 5 0 1 ๐‘ฆ ๐‘ง = ๐‘ฆ + 5๐‘ง 1 5 ๐‘ง 0 1

  8. Eigenvalues and Eigenvectors vector = ๐œ‡ ๐‘ฆ ๐‘ฆ + 5๐‘ง ๐ต๐‘ฆ = ๐œ‡๐‘ฆ ๐‘ง ๐‘ง 1 5 1 0 = 1 1 scalar matrix 0 1 0 ๐ต = 1 5 0 1 only non-zero vector to scale

  9. Outline Linear Algebra/Math Review Two Methods of Dimensionality Reduction Linear Discriminant Analysis (LDA, LDiscA) Principal Component Analysis (PCA)

  10. Dimensionality Reduction D input L reduced features features N Original (lightly preprocessed Compressed instances data) representation

  11. Dimensionality Reduction clarity of representation vs. ease of understanding oversimplification: loss of important or relevant information Courtesy Antano ลฝilinsko

  12. Why โ€œmaximizeโ€ the variance? How can we efficiently summarize? We maximize the variance within our summarization We donโ€™t increase the variance in the dataset How can we capture the most information with the fewest number of axes?

  13. Summarizing Redundant Information (4,2) (2,1) (2,-1) (-2,-1)

  14. Summarizing Redundant Information (4,2) (2,1) (2,-1) (-2,-1) (2,1) = 2*(1,0) + 1*(0,1)

  15. Summarizing Redundant Information 2u 1 (4,2) (4,2) u 1 (2,1) (2,1) u 2 (2,-1) (2,-1) (-2,-1) (-2,-1) -u 1 (2,1) = 1*(2,1) + 0*(2,-1) (4,2) = 2*(2,1) + 0*(2,-1)

  16. Summarizing Redundant Information 2u 1 (4,2) (4,2) u 1 (2,1) (2,1) u 2 (2,-1) (2,-1) (-2,-1) (-2,-1) -u 1 (2,1) = 1*(2,1) + 0*(2,-1) (Is it the most general? These vectors arenโ€™t orthogonal) (4,2) = 2*(2,1) + 0*(2,-1)

  17. Outline Linear Algebra/Math Review Two Methods of Dimensionality Reduction Linear Discriminant Analysis (LDA, LDiscA) Principal Component Analysis (PCA)

  18. Linear Discriminant Analysis (LDA, LDiscA) and Principal Component Analysis (PCA) Summarize D-dimensional input data by uncorrelated axes Uncorrelated axes are also called principal components Use the first L components to account for as much variance as possible

  19. Geometric Rationale of LDiscA & PCA Objective: to rigidly rotate the axes of the D- dimensional space to new positions (principal axes): ordered such that principal axis 1 has the highest variance, axis 2 has the next highest variance, .... , and axis D has the lowest variance covariance among each pair of the principal axes is zero (the principal axes are uncorrelated) Courtesy Antano ลฝilinsko

  20. Remember: MAP Classifiers are Optimal for Classification ๐‘ง ๐‘— [โ„“ 0/1 (๐‘ง, เทข min ๐ฑ เท ๐”ฝ เทž ๐‘ง ๐‘— )] โ†’ max เท ๐‘ž เท ๐‘ง ๐‘— = ๐‘ง ๐‘— ๐‘ฆ ๐‘— ๐ฑ ๐‘— ๐‘— ๐‘ž เท ๐‘ง ๐‘— = ๐‘ง ๐‘— ๐‘ฆ ๐‘— โˆ ๐‘ž ๐‘ฆ ๐‘— เท ๐‘ง ๐‘— ๐‘ž(เท ๐‘ง ๐‘— ) class-conditional posterior class prior likelihood ๐‘ฆ ๐‘— โˆˆ โ„ ๐ธ

  21. Linear Discriminant Analysis MAP Classifier where: 1. class-conditional likelihoods are Gaussian 2. common covariance among class likelihoods

  22. LDiscA: (1) What if likelihoods are Gaussian ๐‘ž เท ๐‘ง ๐‘— = ๐‘ง ๐‘— ๐‘ฆ ๐‘— โˆ ๐‘ž ๐‘ฆ ๐‘— เท ๐‘ง ๐‘— ๐‘ž(เท ๐‘ง ๐‘— ) ๐‘ž ๐‘ฆ ๐‘— ๐‘™ = ๐’ช ๐œˆ ๐‘™ , ฮฃ ๐‘™ exp โˆ’ 1 โˆ’1 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ 2 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ ๐‘ˆ ฮฃ ๐‘™ = 2๐œŒ ๐ธ/2 ฮฃ ๐‘™ 1/2 https://upload.wikimedia.org/wikipedia/commons/5/57/Multivariate_Gaussian.png

  23. LDiscA: (2) Shared Covariance log ๐‘ž เท ๐‘ง ๐‘— = ๐‘™ ๐‘ฆ ๐‘— = log ๐‘ž(๐‘ฆ ๐‘— |๐‘™) ๐‘ž(๐‘ฆ ๐‘— |๐‘š) + log ๐‘ž(๐‘™) ๐‘ž เท ๐‘ง ๐‘— = ๐‘š ๐‘ฆ ๐‘— ๐‘ž ๐‘š

  24. LDiscA: (2) Shared Covariance log ๐‘ž เท ๐‘ง ๐‘— = ๐‘™ ๐‘ฆ ๐‘— = log ๐‘ž(๐‘ฆ ๐‘— |๐‘™) ๐‘ž(๐‘ฆ ๐‘— |๐‘š) + log ๐‘ž(๐‘™) ๐‘ž เท ๐‘ง ๐‘— = ๐‘š ๐‘ฆ ๐‘— ๐‘ž ๐‘š exp โˆ’ 1 โˆ’1 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ 2 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ ๐‘ˆ ฮฃ ๐‘™ 2๐œŒ ๐ธ/2 ฮฃ ๐‘™ 1/2 = log ๐‘ž(๐‘™) ๐‘ž ๐‘š + log exp โˆ’ 1 โˆ’1 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘š 2 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘š ๐‘ˆ ฮฃ ๐‘š 2๐œŒ ๐ธ/2 ฮฃ ๐‘š 1/2

  25. LDiscA: (2) Shared Covariance log ๐‘ž เท ๐‘ง ๐‘— = ๐‘™ ๐‘ฆ ๐‘— = log ๐‘ž(๐‘ฆ ๐‘— |๐‘™) ๐‘ž(๐‘ฆ ๐‘— |๐‘š) + log ๐‘ž(๐‘™) ๐‘ž เท ๐‘ง ๐‘— = ๐‘š ๐‘ฆ ๐‘— ๐‘ž ๐‘š exp โˆ’ 1 2 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ ๐‘ˆ ฮฃ โˆ’1 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ 2๐œŒ ๐ธ/2 ฮฃ ๐‘™ 1/2 = log ๐‘ž(๐‘™) ๐‘ž ๐‘š + log exp โˆ’ 1 2 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘š ๐‘ˆ ฮฃ โˆ’1 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘š 2๐œŒ ๐ธ/2 ฮฃ ๐‘š 1/2 ฮฃ ๐‘š = ฮฃ ๐‘™

  26. LDiscA: (2) Shared Covariance log ๐‘ž เท ๐‘ง ๐‘— = ๐‘™ ๐‘ฆ ๐‘— = log ๐‘ž(๐‘ฆ ๐‘— |๐‘™) ๐‘ž(๐‘ฆ ๐‘— |๐‘š) + log ๐‘ž(๐‘™) ๐‘ž เท ๐‘ง ๐‘— = ๐‘š ๐‘ฆ ๐‘— ๐‘ž ๐‘š = log ๐‘ž(๐‘™) ๐‘ž ๐‘š โˆ’ 1 2 ๐œˆ ๐‘™ โˆ’ ๐œˆ ๐‘š ๐‘ˆ ฮฃ โˆ’1 ๐œˆ ๐‘™ โˆ’ ๐œˆ ๐‘š + ๐‘ฆ ๐‘— ๐‘ˆ ฮฃ โˆ’1 (๐œˆ ๐‘™ โˆ’ ๐œˆ ๐‘š ) linear in x i (check for yourself: why did the quadratic x i terms cancel?)

  27. LDiscA: (2) Shared Covariance log ๐‘ž เท ๐‘ง ๐‘— = ๐‘™ ๐‘ฆ ๐‘— = log ๐‘ž(๐‘ฆ ๐‘— |๐‘™) ๐‘ž(๐‘ฆ ๐‘— |๐‘š) + log ๐‘ž(๐‘™) ๐‘ž เท ๐‘ง ๐‘— = ๐‘š ๐‘ฆ ๐‘— ๐‘ž ๐‘š = log ๐‘ž(๐‘™) ๐‘ž ๐‘š โˆ’ 1 2 ๐œˆ ๐‘™ โˆ’ ๐œˆ ๐‘š ๐‘ˆ ฮฃ โˆ’1 ๐œˆ ๐‘™ โˆ’ ๐œˆ ๐‘š + ๐‘ฆ ๐‘— ๐‘ˆ ฮฃ โˆ’1 (๐œˆ ๐‘™ โˆ’ ๐œˆ ๐‘š ) ๐‘ˆ ฮฃ โˆ’1 ๐œˆ ๐‘™ โˆ’ 1 ๐‘ˆ ฮฃ โˆ’1 ๐œˆ ๐‘™ + log ๐‘ž(๐‘™) = ๐‘ฆ ๐‘— 2 ๐œˆ ๐‘™ ๐‘ˆ ฮฃ โˆ’1 ๐œˆ ๐‘š โˆ’ 1 ๐‘ˆ ฮฃ โˆ’1 ๐œˆ ๐‘š + log ๐‘ž ๐‘š +๐‘ฆ ๐‘— 2 ๐œˆ ๐‘š linear in x i rewrite only in terms of x i (check for yourself: why did the (data) and single-class terms quadratic x i terms cancel?)

  28. Classify via Linear Discriminant Functions ๐‘ˆ ฮฃ โˆ’1 ๐œˆ ๐‘™ โˆ’ 1 ๐‘ˆ ฮฃ โˆ’1 ๐œˆ ๐‘™ + log ๐‘ž(๐‘™) ๐œ€ ๐‘™ ๐‘ฆ ๐‘— = ๐‘ฆ ๐‘— 2 ๐œˆ ๐‘™ arg max equivalent MAP classifier ๐œ€ ๐‘™ ๐‘ฆ ๐‘— to ๐‘™

  29. LDiscA Parameters to learn: ๐‘ž ๐‘™ ๐‘™ , ๐œˆ ๐‘™ ๐‘™ , ฮฃ ๐‘ž ๐‘™ โˆ ๐‘‚ ๐‘™ number of items labeled with class k

  30. LDiscA Parameters to learn: ๐‘ž ๐‘™ ๐‘™ , ๐œˆ ๐‘™ ๐‘™ , ฮฃ ๐œˆ ๐‘™ = 1 เท ๐‘ฆ ๐‘— ๐‘ž ๐‘™ โˆ ๐‘‚ ๐‘™ ๐‘‚ ๐‘™ ๐‘—:๐‘ง ๐‘— =๐‘™

  31. LDiscA Parameters to learn: ๐‘ž ๐‘™ ๐‘™ , ๐œˆ ๐‘™ ๐‘™ , ฮฃ ๐œˆ ๐‘™ = 1 เท ๐‘ฆ ๐‘— ๐‘ž ๐‘™ โˆ ๐‘‚ ๐‘™ ๐‘‚ ๐‘™ ๐‘—:๐‘ง ๐‘— =๐‘™ 1 1 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ ๐‘ˆ ฮฃ = ๐‘‚ โˆ’ ๐ฟ เท scatter ๐‘™ = ๐‘‚ โˆ’ ๐ฟ เท เท ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ ๐‘™ ๐‘™ ๐‘—:๐‘ง ๐‘— =๐‘™ one option for ๐›ต within-class covariance

  32. Computational Steps for Full- Dimensional LDiscA 1. Compute means, priors, and covariance

  33. Computational Steps for Full- Dimensional LDiscA 1. Compute means, priors, and covariance 2. Diagonalize covariance ฮฃ = UDU T Eigen decomposition K x K orthonormal diagonal matrix of matrix (eigenvectors) eigenvalues

  34. Computational Steps for Full- Dimensional LDiscA 1. Compute means, priors, and covariance 2. Diagonalize covariance ฮฃ = UDU T 3. Sphere the data โˆ’1 X โˆ— = ๐ธ 2 ๐‘‰ ๐‘ˆ ๐‘Œ

  35. Computational Steps for Full- Dimensional LDiscA 1. Compute means, priors, and covariance 2. Diagonalize covariance ฮฃ = UDU T 3. Sphere the data (get unit covariance) โˆ’1 X โˆ— = ๐ธ 2 ๐‘‰ ๐‘ˆ ๐‘Œ 4. Classify according to linear discriminant โˆ— ) functions ๐œ€ ๐‘™ (๐‘ฆ ๐‘—

  36. Two Extensions to LDiscA Quadratic Discriminant Analysis (QDA) Keep separate covariances per class ๐œ€ ๐‘™ ๐‘ฆ ๐‘— = โˆ’ 1 โˆ’1 (๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ ) 2 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ T ฮฃ k + log ๐‘ž ๐‘™ โˆ’ log |ฮฃ ๐‘™ | 2

  37. Two Extensions to LDiscA Quadratic Discriminant Analysis Regularized LDiscA (QDA) Keep separate covariances per Interpolate between shared class covariance estimate (LDiscA) and class-specific estimate (QDA) ๐œ€ ๐‘™ ๐‘ฆ ๐‘— = โˆ’ 1 โˆ’1 (๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ ) 2 ๐‘ฆ ๐‘— โˆ’ ๐œˆ ๐‘™ T ฮฃ k ฮฃ ๐‘™ ๐›ฝ = ๐›ฝฮฃ ๐‘™ + 1 โˆ’ ๐›ฝ ฮฃ + log ๐‘ž ๐‘™ โˆ’ log |ฮฃ ๐‘™ | 2

  38. Vowel Classification LDiscA (left) vs. QDA (right) ESL 4.3

  39. Vowel Classification LDiscA (left) vs. QDA (right) Regularized LDiscA ฮฃ ๐‘™ ๐›ฝ = ๐›ฝฮฃ ๐‘™ + 1 โˆ’ ๐›ฝ ฮฃ ESL 4.3

Recommend


More recommend