pca
play

PCA CS 446 Supervised learning So far, weve done supervised - PowerPoint PPT Presentation

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) , find f with f ( x i ) y i . k -nn, decision trees, . . . 1 / 18 Supervised learning So far, weve done supervised learning: Given (( x i , y


  1. PCA CS 446

  2. Supervised learning So far, we’ve done supervised learning: Given (( x i , y i )) , find f with f ( x i ) ≈ y i . k -nn, decision trees, . . . 1 / 18

  3. Supervised learning So far, we’ve done supervised learning: Given (( x i , y i )) , find f with f ( x i ) ≈ y i . k -nn, decision trees, . . . Most methods used (regularized) ERM: � n minimize � R ( f ) = 1 i =1 ℓ ( f ( x i ) , y i ) , hope R is small. n least squares, logistic regression, deep networks, SVM, perceptron, . . . 1 / 18

  4. Un supervised learning Now we only receive ( x i ) n i =1 , and the goal is. . . ? 2 / 18

  5. Un supervised learning Now we only receive ( x i ) n i =1 , and the goal is. . . ? ◮ Encoding data in some compact representation (and decoding this). ◮ Data analysis; recovering “hidden structure” in data (e.g., recovering cliques or clusters). ◮ Features for supervised learning. ◮ . . . ? 2 / 18

  6. Un supervised learning Now we only receive ( x i ) n i =1 , and the goal is. . . ? ◮ Encoding data in some compact representation (and decoding this). ◮ Data analysis; recovering “hidden structure” in data (e.g., recovering cliques or clusters). ◮ Features for supervised learning. ◮ . . . ? The task is less clear-cut. In 2019 we still have people trying to formalize it! 2 / 18

  7. 1. PCA (Principal Component Analysis)

  8. PCA motivation Let’s formulate a simplistic linear unsupervised method. 3 / 18

  9. PCA motivation Let’s formulate a simplistic linear unsupervised method. ◮ Encoding (and decoding) data in some compact representation. 3 / 18

  10. PCA motivation Let’s formulate a simplistic linear unsupervised method. ◮ Encoding (and decoding) data in some compact representation. Let’s linearly map data in R d to R k and back. 3 / 18

  11. PCA motivation Let’s formulate a simplistic linear unsupervised method. ◮ Encoding (and decoding) data in some compact representation. Let’s linearly map data in R d to R k and back. ◮ Data analysis; recovering “hidden structure” in data. 3 / 18

  12. PCA motivation Let’s formulate a simplistic linear unsupervised method. ◮ Encoding (and decoding) data in some compact representation. Let’s linearly map data in R d to R k and back. ◮ Data analysis; recovering “hidden structure” in data. Let’s find if data mostly lies on a low-dimensional subspace. 3 / 18

  13. PCA motivation Let’s formulate a simplistic linear unsupervised method. ◮ Encoding (and decoding) data in some compact representation. Let’s linearly map data in R d to R k and back. ◮ Data analysis; recovering “hidden structure” in data. Let’s find if data mostly lies on a low-dimensional subspace. ◮ Features for supervised learning. 3 / 18

  14. PCA motivation Let’s formulate a simplistic linear unsupervised method. ◮ Encoding (and decoding) data in some compact representation. Let’s linearly map data in R d to R k and back. ◮ Data analysis; recovering “hidden structure” in data. Let’s find if data mostly lies on a low-dimensional subspace. ◮ Features for supervised learning. Let’s feed the R k -dimensional encoding to supervised methods. 3 / 18

  15. SVD reminder 4 / 18

  16. SVD reminder 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M T u = s v . 4 / 18

  17. SVD reminder 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M T u = s v . 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . 4 / 18

  18. SVD reminder 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M T u = s v . 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . 3. Full factorization SVD: M = USV T . 4 / 18

  19. SVD reminder 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M T u = s v . 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . 3. Full factorization SVD: M = USV T . 4. “Operational” view of SVD: for M ∈ R n × d ,     s 1 0   ⊤   ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ...       0   u 1 · · · u r u r +1 · · · u n  · · v 1 · · · v r v r +1 · · · v d .      0 s r   ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ 0 0 4 / 18

  20. SVD reminder 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M T u = s v . 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . 3. Full factorization SVD: M = USV T . 4. “Operational” view of SVD: for M ∈ R n × d ,     s 1 0   ⊤   ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ...       0   u 1 · · · u r u r +1 · · · u n  · · v 1 · · · v r v r +1 · · · v d .      0 s r   ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ 0 0 First part of U , V span the col / row space (respectively), second part the left / right nullspaces (respectively). 4 / 18

  21. SVD reminder 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M T u = s v . 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . 3. Full factorization SVD: M = USV T . 4. “Operational” view of SVD: for M ∈ R n × d ,     s 1 0   ⊤   ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ...       0   u 1 · · · u r u r +1 · · · u n  · · v 1 · · · v r v r +1 · · · v d .      0 s r   ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ 0 0 First part of U , V span the col / row space (respectively), second part the left / right nullspaces (respectively). New: let ( U k , S k , V k ) denote the truncated SVD with U k ∈ R d × k (first k columns of U ), similarly for the others. 4 / 18

  22. PCA (Principal component analysis) Input: Data as rows of R n × d ∋ X = USV T , integer k . Output: Encoder V k , decoder V T k , encoded data XV k = U k S k . 5 / 18

  23. PCA (Principal component analysis) Input: Data as rows of R n × d ∋ X = USV T , integer k . Output: Encoder V k , decoder V T k , encoded data XV k = U k S k . The goal in unsupervised learning is unclear. We’ll try to define this as “best encoding/decoding in Frobenius sense”: � � 2 � � � X − XED � 2 T min F = � X − XV k V F . � k D ∈ R k × d E ∈ R d × k 5 / 18

  24. PCA (Principal component analysis) Input: Data as rows of R n × d ∋ X = USV T , integer k . Output: Encoder V k , decoder V T k , encoded data XV k = U k S k . The goal in unsupervised learning is unclear. We’ll try to define this as “best encoding/decoding in Frobenius sense”: � � 2 � � � X − XED � 2 T min F = � X − XV k V F . � k D ∈ R k × d E ∈ R d × k Note V k V T k performs orthogonal projection onto subspace spanned by V k ; thus we are finding “best k -dimensional projection of the data”. 5 / 18

  25. PCA properties Theorem. Let X ∈ R n × d with SVD X = USV T and integer k ≤ r be given. � T � 2 � � � X − XED � 2 min F = min � X − XDD � D ∈ R k × d D ∈ R d × k F E ∈ R d × k D T D = I � � r � 2 � � s 2 T = � X − XV k V F = i . � k i = k +1 Additionally, � T � 2 � � F = � X � 2 � XD � 2 min � X − XDD F − max � F D ∈ R d × k D ∈ R d × k D T D = I D T D = I k � = � X � 2 F −� XV k � 2 F = � X � 2 s 2 F − i . i =1 6 / 18

  26. PCA properties Theorem. Let X ∈ R n × d with SVD X = USV T and integer k ≤ r be given. � T � 2 � � � X − XED � 2 min F = min � X − XDD � D ∈ R k × d D ∈ R d × k F E ∈ R d × k D T D = I � � r � 2 � � s 2 T = � X − XV k V F = i . � k i = k +1 Additionally, � T � 2 � � F = � X � 2 � XD � 2 min � X − XDD F − max � F D ∈ R d × k D ∈ R d × k D T D = I D T D = I k � = � X � 2 F −� XV k � 2 F = � X � 2 s 2 F − i . i =1 Remark 1. SVD is not unique, but � r i =1 s 2 i is identical across SVD choices. 6 / 18

  27. PCA properties Theorem. Let X ∈ R n × d with SVD X = USV T and integer k ≤ r be given. � T � 2 � � � X − XED � 2 min F = min � X − XDD � D ∈ R k × d D ∈ R d × k F E ∈ R d × k D T D = I � � r � 2 � � s 2 T = � X − XV k V F = i . � k i = k +1 Additionally, � T � 2 � � F = � X � 2 � XD � 2 min � X − XDD F − max � F D ∈ R d × k D ∈ R d × k D T D = I D T D = I k � = � X � 2 F −� XV k � 2 F = � X � 2 s 2 F − i . i =1 Remark 1. SVD is not unique, but � r i =1 s 2 i is identical across SVD choices. Remark 2. As written, this is not a convex optimization problem! 6 / 18

  28. PCA properties Theorem. Let X ∈ R n × d with SVD X = USV T and integer k ≤ r be given. � T � 2 � � � X − XED � 2 min F = min � X − XDD � D ∈ R k × d D ∈ R d × k F E ∈ R d × k D T D = I � � r � 2 � � s 2 T = � X − XV k V F = i . � k i = k +1 Additionally, � T � 2 � � F = � X � 2 � XD � 2 min � X − XDD F − max � F D ∈ R d × k D ∈ R d × k D T D = I D T D = I k � = � X � 2 F −� XV k � 2 F = � X � 2 s 2 F − i . i =1 Remark 1. SVD is not unique, but � r i =1 s 2 i is identical across SVD choices. Remark 2. As written, this is not a convex optimization problem! Remark 3. The second form is interesting. . . 6 / 18

  29. Centered PCA Some treatments replace X with X − 1 µ T , � with mean µ = 1 i =1 x i . n 7 / 18

Recommend


More recommend