dimensionality reduction embedding part 2 2
play

Dimensionality Reduction & Embedding (part 2/2) Prof. Mike - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Dimensionality Reduction & Embedding (part 2/2) Prof. Mike Hughes Many ideas/slides attributable to: Emily Fox (UW), Erik Sudderth (UCI) 2 What


  1. Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Dimensionality Reduction & Embedding (part 2/2) Prof. Mike Hughes Many ideas/slides attributable to: Emily Fox (UW), Erik Sudderth (UCI) 2

  2. What will we learn? Supervised Learning Data Examples Performance { x n } N measure Task n =1 Unsupervised Learning summary data of x x Reinforcement Learning Mike Hughes - Tufts COMP 135 - Spring 2019 3

  3. Task: Embedding Supervised Learning x 2 Unsupervised Learning embedding Reinforcement x 1 Learning Mike Hughes - Tufts COMP 135 - Spring 2019 4

  4. Dim. Reduction/Embedding Unit Objectives • Goals of dimensionality reduction • Reduce feature vector size (keep signal, discard noise) • “Interpret” features: visualize/explore/understand • Common approaches • Principal Component Analysis (PCA) + Factor Analysis • t-SNE (“tee-snee”) • word2vec and other neural embeddings • Evaluation Metrics • Storage size - Reconstruction error • “Interpretability” - Prediction error Mike Hughes - Tufts COMP 135 - Spring 2019 5

  5. Example: Genes vs. geography Nature, 2008 Mike Hughes - Tufts COMP 135 - Spring 2019 6

  6. Example: Genes vs. geography Nature, 2008 Where possible, we based the geographic origin on the observed country data for grandparents. We used a ‘strict consensus’ approach: if all observed grandparents originated from a single country, we used that country as the origin. If an individual’s observed grandparents originated from different countries, we excluded the individual. Where grandparental data were unavailable, we used the individual’s country of birth. Total sample size after exclusion: 1,387 subjects Features: over half a million variable DNA sites in the human genome Mike Hughes - Tufts COMP 135 - Spring 2019 7

  7. Eigenvectors and Eigenvalues Mike Hughes - Tufts COMP 135 - Spring 2019 8

  8. Source: https://textbooks.math.gatech.edu/ila/eigenvectors.html Mike Hughes - Tufts COMP 135 - Spring 2019 9

  9. Demo: What is an Eigenvector? • http://setosa.io/ev/eigenvectors-and- eigenvalues/ Mike Hughes - Tufts COMP 135 - Spring 2019 10

  10. Centering the Data Goal: each feature’s mean = 0.0 Mike Hughes - Tufts COMP 135 - Spring 2019 11

  11. Why center? • Think of mean vector as simplest possible “reconstruction” of a dataset • No example specific parameters, just one F- dim vector N ( x n − m ) T ( x n − m ) X min m ∈ R F n =1 m ∗ = mean( x 1 , . . . x N ) Mike Hughes - Tufts COMP 135 - Spring 2019 12

  12. Principal Component Analysis Mike Hughes - Tufts COMP 135 - Spring 2019 13

  13. Reconstruction with PCA x i = Wz i + m + F vector K vector F x K F vector High- Low-dim Basis mean dim. vector data Mike Hughes - Tufts COMP 135 - Spring 2019 14

  14. Principal Component Analysis Training step: .fit() • Input: • X : training data, N x F • N high-dim. example vectors • K : int, number of components • Satisfies 1 <= K <= F • Output: • m : mean vector, size F • W : learned basis of eigenvectors, F x K • One F-dim. vector (magnitude 1) for each component • Each of the K vectors is orthogonal to every other Mike Hughes - Tufts COMP 135 - Spring 2019 15

  15. Principal Component Analysis Transformation step: .transform() • Input: • X : training data, N x F • N high-dim. example vectors • Trained PCA “model” • m : mean vector, size F • W : learned basis of eigenvectors, F x K • One F-dim. vector (magnitude 1) for each component • Each of the K vectors is orthogonal to every other • Output: • Z : projected data, N x K Mike Hughes - Tufts COMP 135 - Spring 2019 16

  16. PCA Demo • http://setosa.io/ev/principal- component-analysis/ Mike Hughes - Tufts COMP 135 - Spring 2019 17

  17. Example: EigenFaces Credit: Erik Sudderth Mike Hughes - Tufts COMP 135 - Spring 2019 18

  18. PCA Principles • Minimize reconstruction error • Should be able to recreate x from z • Equivalent to maximizing variance • Want reconstructions to retain maximum information Mike Hughes - Tufts COMP 135 - Spring 2019 19

  19. PCA: How to Select K? • 1) Use downstream supervised task metric • Regression error • 2) Use memory constraints of task • Can’t store more than 50 dims for 1M examples? Take K=50 • 3) Plot cumulative “variance explained” • Take K that seems to capture most or all variance Mike Hughes - Tufts COMP 135 - Spring 2019 20

  20. Empirical Variance of Data X N F Var( X ) = 1 X X x 2 nf N n =1 f =1 N = 1 X x T n x n N n =1 • (Assumes each feature is centered) Mike Hughes - Tufts COMP 135 - Spring 2019 21

  21. Variance of reconstructions N = 1 X x T n x n N n =1 N = 1 ( z n 1 w 1 + . . . + z nK w K ) T ( z n 1 w 1 + . . . + z nK w K ) X N n =1 N K = 1 X X z 2 nk N n =1 k =1 K Just sum up the top K eigenvalues! X λ k = k =1 Mike Hughes - Tufts COMP 135 - Spring 2019 22

  22. Proportion of Variance Explained by first K components P K k =1 λ k PVE( K ) = P F f =1 λ f Mike Hughes - Tufts COMP 135 - Spring 2019 23

  23. Variance explained curve Mike Hughes - Tufts COMP 135 - Spring 2019 24

  24. PCA Summary PRO • Usually, fast to train, fast to test • Slowest step: finding K eigenvectors of an F x F matrix • Nested model • PCA with K=5 overlaps with PCA with K=4 CON • Sensitive to rescaling of input data features • Learned basis known only up to +/- scaling • Not often best for supervised tasks Mike Hughes - Tufts COMP 135 - Spring 2019 25

  25. PCA: Best Practices • If features all have different units • Try rescaling to all be within (-1, +1) or have variance 1 • If features have same units, may not need to do this Mike Hughes - Tufts COMP 135 - Spring 2019 26

  26. Beyond PCA: Factor Analysis Mike Hughes - Tufts COMP 135 - Spring 2019 27

  27. A Probabilistic Model x i = Wz i + m + ✏ i F vector K vector F x K F vector F vector High- Low-dim Basis mean noise dim. vector data ✏ i ∼ N (0 , I F ) Mike Hughes - Tufts COMP 135 - Spring 2019 28

  28. A Probabilistic Model x i = Wz i + m + ✏ i In terms of matrix math: X = WZ + M + E Mike Hughes - Tufts COMP 135 - Spring 2019 29

  29. A Probabilistic Model x i = Wz i + m + ✏ i F vector K vector F x K F vector F vector High- Low-dim Basis mean noise dim. vector data   � 2 0 0 � 2 ✏ i ∼ N (0 , 0 0  )  � 2 0 0 Mike Hughes - Tufts COMP 135 - Spring 2019 30

  30. Face Dataset  � 2  0 0 Is this noise model � 2 ✏ i ∼ N (0 , 0 0  ) realistic?  � 2 0 0 Mike Hughes - Tufts COMP 135 - Spring 2019 31

  31. Each pixel might need own variance!  � 2  0 0 1 � 2 ✏ i ∼ N (0 , 0 0  )  2 � 2 0 0 3 Mike Hughes - Tufts COMP 135 - Spring 2019 32

  32. Factor Analysis • Finds a linear basis like PCA, but allows per- feature estimation of variance  � 2  0 0 1 � 2 ✏ i ∼ N (0 , 0 0  )  2 � 2 0 0 3 • Small detail: columns of estimated basis may not be orthogonal Mike Hughes - Tufts COMP 135 - Spring 2019 33

  33. PCA vs Factor Analysis Mike Hughes - Tufts COMP 135 - Spring 2019 34

  34. Matrix Factorization and Singular Value Decomposition Mike Hughes - Tufts COMP 135 - Spring 2019 35

  35. Matrix Factorization (MF) • User ! represented by vector " # ∈ % & ) ∈ % & • Item ' represented by vector ( * + ) approximates the utility , #) • Inner product " # • Intuition: • Two items with similar vectors get similar utility scores from the same user; • Two users with similar vectors give similar utility scores to the same item Mike Hughes - Tufts COMP 135 - Spring 2019 36

  36. Mike Hughes - Tufts COMP 135 - Spring 2019 37

  37. General Matrix Factorization X = ZW = Mike Hughes - Tufts COMP 135 - Spring 2019 38

  38. SVD: Singular Value Decomposition Credit: Wikipedia Mike Hughes - Tufts COMP 135 - Spring 2019 39

  39. Truncated SVD X = UDV T K K K K = Mike Hughes - Tufts COMP 135 - Spring 2019 40

  40. Recall: Eigen Decomposition λ 1 , λ 2 , . . . λ K w 1 , w 2 , . . . w K Mike Hughes - Tufts COMP 135 - Spring 2019 41

  41. Two ways to “fit” PCA • First, apply “centering” to X • Then, do one of these two options: • 1) Compute SVD of X • Eigenvalues are rescaled entries of the diagonal D • Basis = first K columns of V • 2) Compute covariance Cov(X) • Eigenvalues = largest eigenvalues of Cov(X) • Basis = corresponding eigenvectors of Cov(X) Mike Hughes - Tufts COMP 135 - Spring 2019 42

  42. Visualization with t-SNE Mike Hughes - Tufts COMP 135 - Spring 2019 43

  43. Reducing Dimensionality of Digit Images INPUT: Each image represented by 784-dimensional vector Apply PCA transformation with K=2 OUTPUT: Each image is a 2-dimensional vector Mike Hughes - Tufts COMP 135 - Spring 2019 44

  44. Credit: Luuk Derksen (https://medium.com/@luckylwk/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python- 8ef87e7915b) Mike Hughes - Tufts COMP 135 - Spring 2019 45

  45. Credit: Luuk Derksen (https://medium.com/@luckylwk/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python- 8ef87e7915b) Mike Hughes - Tufts COMP 135 - Spring 2019 46

  46. Mike Hughes - Tufts COMP 135 - Spring 2019 47

Recommend


More recommend