PCA CS 446
Supervised learning So far, we’ve done supervised learning: Given (( x i , y i )) , find f with f ( x i ) ≈ y i . k -nn, decision trees, . . . 1 / 18
Supervised learning So far, we’ve done supervised learning: Given (( x i , y i )) , find f with f ( x i ) ≈ y i . k -nn, decision trees, . . . Most methods used (regularized) ERM: � n minimize � R ( f ) = 1 i =1 ℓ ( f ( x i ) , y i ) , hope R is small. n least squares, logistic regression, deep networks, SVM, perceptron, . . . 1 / 18
Un supervised learning Now we only receive ( x i ) n i =1 , and the goal is. . . ? 2 / 18
Un supervised learning Now we only receive ( x i ) n i =1 , and the goal is. . . ? ◮ Encoding data in some compact representation (and decoding this). ◮ Data analysis; recovering “hidden structure” in data (e.g., recovering cliques or clusters). ◮ Features for supervised learning. ◮ . . . ? 2 / 18
Un supervised learning Now we only receive ( x i ) n i =1 , and the goal is. . . ? ◮ Encoding data in some compact representation (and decoding this). ◮ Data analysis; recovering “hidden structure” in data (e.g., recovering cliques or clusters). ◮ Features for supervised learning. ◮ . . . ? The task is less clear-cut. In 2019 we still have people trying to formalize it! 2 / 18
1. PCA (Principal Component Analysis)
PCA motivation Let’s formulate a simplistic linear unsupervised method. 3 / 18
PCA motivation Let’s formulate a simplistic linear unsupervised method. ◮ Encoding (and decoding) data in some compact representation. 3 / 18
PCA motivation Let’s formulate a simplistic linear unsupervised method. ◮ Encoding (and decoding) data in some compact representation. Let’s linearly map data in R d to R k and back. 3 / 18
PCA motivation Let’s formulate a simplistic linear unsupervised method. ◮ Encoding (and decoding) data in some compact representation. Let’s linearly map data in R d to R k and back. ◮ Data analysis; recovering “hidden structure” in data. 3 / 18
PCA motivation Let’s formulate a simplistic linear unsupervised method. ◮ Encoding (and decoding) data in some compact representation. Let’s linearly map data in R d to R k and back. ◮ Data analysis; recovering “hidden structure” in data. Let’s find if data mostly lies on a low-dimensional subspace. 3 / 18
PCA motivation Let’s formulate a simplistic linear unsupervised method. ◮ Encoding (and decoding) data in some compact representation. Let’s linearly map data in R d to R k and back. ◮ Data analysis; recovering “hidden structure” in data. Let’s find if data mostly lies on a low-dimensional subspace. ◮ Features for supervised learning. 3 / 18
PCA motivation Let’s formulate a simplistic linear unsupervised method. ◮ Encoding (and decoding) data in some compact representation. Let’s linearly map data in R d to R k and back. ◮ Data analysis; recovering “hidden structure” in data. Let’s find if data mostly lies on a low-dimensional subspace. ◮ Features for supervised learning. Let’s feed the R k -dimensional encoding to supervised methods. 3 / 18
SVD reminder 4 / 18
SVD reminder 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M T u = s v . 4 / 18
SVD reminder 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M T u = s v . 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . 4 / 18
SVD reminder 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M T u = s v . 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . 3. Full factorization SVD: M = USV T . 4 / 18
SVD reminder 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M T u = s v . 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . 3. Full factorization SVD: M = USV T . 4. “Operational” view of SVD: for M ∈ R n × d , s 1 0 ⊤ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ... 0 u 1 · · · u r u r +1 · · · u n · · v 1 · · · v r v r +1 · · · v d . 0 s r ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ 0 0 4 / 18
SVD reminder 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M T u = s v . 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . 3. Full factorization SVD: M = USV T . 4. “Operational” view of SVD: for M ∈ R n × d , s 1 0 ⊤ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ... 0 u 1 · · · u r u r +1 · · · u n · · v 1 · · · v r v r +1 · · · v d . 0 s r ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ 0 0 First part of U , V span the col / row space (respectively), second part the left / right nullspaces (respectively). 4 / 18
SVD reminder 1. SV triples: ( s, u , v ) satisfies Mv = s u , and M T u = s v . 2. Thin decomposition SVD: M = � r i =1 s i u i v T i . 3. Full factorization SVD: M = USV T . 4. “Operational” view of SVD: for M ∈ R n × d , s 1 0 ⊤ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ... 0 u 1 · · · u r u r +1 · · · u n · · v 1 · · · v r v r +1 · · · v d . 0 s r ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ 0 0 First part of U , V span the col / row space (respectively), second part the left / right nullspaces (respectively). New: let ( U k , S k , V k ) denote the truncated SVD with U k ∈ R d × k (first k columns of U ), similarly for the others. 4 / 18
PCA (Principal component analysis) Input: Data as rows of R n × d ∋ X = USV T , integer k . Output: Encoder V k , decoder V T k , encoded data XV k = U k S k . 5 / 18
PCA (Principal component analysis) Input: Data as rows of R n × d ∋ X = USV T , integer k . Output: Encoder V k , decoder V T k , encoded data XV k = U k S k . The goal in unsupervised learning is unclear. We’ll try to define this as “best encoding/decoding in Frobenius sense”: � � 2 � � � X − XED � 2 T min F = � X − XV k V F . � k D ∈ R k × d E ∈ R d × k 5 / 18
PCA (Principal component analysis) Input: Data as rows of R n × d ∋ X = USV T , integer k . Output: Encoder V k , decoder V T k , encoded data XV k = U k S k . The goal in unsupervised learning is unclear. We’ll try to define this as “best encoding/decoding in Frobenius sense”: � � 2 � � � X − XED � 2 T min F = � X − XV k V F . � k D ∈ R k × d E ∈ R d × k Note V k V T k performs orthogonal projection onto subspace spanned by V k ; thus we are finding “best k -dimensional projection of the data”. 5 / 18
PCA properties Theorem. Let X ∈ R n × d with SVD X = USV T and integer k ≤ r be given. � T � 2 � � � X − XED � 2 min F = min � X − XDD � D ∈ R k × d D ∈ R d × k F E ∈ R d × k D T D = I � � r � 2 � � s 2 T = � X − XV k V F = i . � k i = k +1 Additionally, � T � 2 � � F = � X � 2 � XD � 2 min � X − XDD F − max � F D ∈ R d × k D ∈ R d × k D T D = I D T D = I k � = � X � 2 F −� XV k � 2 F = � X � 2 s 2 F − i . i =1 6 / 18
PCA properties Theorem. Let X ∈ R n × d with SVD X = USV T and integer k ≤ r be given. � T � 2 � � � X − XED � 2 min F = min � X − XDD � D ∈ R k × d D ∈ R d × k F E ∈ R d × k D T D = I � � r � 2 � � s 2 T = � X − XV k V F = i . � k i = k +1 Additionally, � T � 2 � � F = � X � 2 � XD � 2 min � X − XDD F − max � F D ∈ R d × k D ∈ R d × k D T D = I D T D = I k � = � X � 2 F −� XV k � 2 F = � X � 2 s 2 F − i . i =1 Remark 1. SVD is not unique, but � r i =1 s 2 i is identical across SVD choices. 6 / 18
PCA properties Theorem. Let X ∈ R n × d with SVD X = USV T and integer k ≤ r be given. � T � 2 � � � X − XED � 2 min F = min � X − XDD � D ∈ R k × d D ∈ R d × k F E ∈ R d × k D T D = I � � r � 2 � � s 2 T = � X − XV k V F = i . � k i = k +1 Additionally, � T � 2 � � F = � X � 2 � XD � 2 min � X − XDD F − max � F D ∈ R d × k D ∈ R d × k D T D = I D T D = I k � = � X � 2 F −� XV k � 2 F = � X � 2 s 2 F − i . i =1 Remark 1. SVD is not unique, but � r i =1 s 2 i is identical across SVD choices. Remark 2. As written, this is not a convex optimization problem! 6 / 18
PCA properties Theorem. Let X ∈ R n × d with SVD X = USV T and integer k ≤ r be given. � T � 2 � � � X − XED � 2 min F = min � X − XDD � D ∈ R k × d D ∈ R d × k F E ∈ R d × k D T D = I � � r � 2 � � s 2 T = � X − XV k V F = i . � k i = k +1 Additionally, � T � 2 � � F = � X � 2 � XD � 2 min � X − XDD F − max � F D ∈ R d × k D ∈ R d × k D T D = I D T D = I k � = � X � 2 F −� XV k � 2 F = � X � 2 s 2 F − i . i =1 Remark 1. SVD is not unique, but � r i =1 s 2 i is identical across SVD choices. Remark 2. As written, this is not a convex optimization problem! Remark 3. The second form is interesting. . . 6 / 18
Centered PCA Some treatments replace X with X − 1 µ T , � with mean µ = 1 i =1 x i . n 7 / 18
Recommend
More recommend