compsci 514 algorithms for data science
play

compsci 514: algorithms for data science Cameron Musco University - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2020. Lecture 15 0 logistics whether people are working in groups. So if you dont have a group, I encourage you to join one. There are


  1. compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2020. Lecture 15 0

  2. logistics whether people are working in groups. So if you don’t have a group, I encourage you to join one. There are multiple people looking so post on Piazza to find some. 1 • Problem Set 3 is due next Friday 10/23, 8pm. • Problem set grades seem to be strongly correlated with • This week’s quiz due Monday at 8pm.

  3. summary space is given by projecting onto that space. approximate X ? Last Class: Low-Rank Approximation arg min perfectly embed into k dimensions using an orthonormal 2 • When data lies in a k -dimensional subspace V , we can span V ∈ R d × k . • When data lies close to V , the optimal embedding in that XVV T = ∥ X − B ∥ 2 F . B with rows in V This Class: Finding V via eigendecomposition. • How do we find the best low-dimensional subspace to • PCA and its connection to eigendecomposition.

  4. basic set up matrix with these vectors as its columns. v k . 3 Reminder of Set Up: Assume that ⃗ x 1 , . . . ,⃗ x n lie close to any k -dimensional subspace V of R d . Let X ∈ R n × d be the data matrix. v k be an orthonormal basis for V and V ∈ R d × k be the Let ⃗ v 1 , . . . ,⃗ • VV T ∈ R d × d is the projection matrix onto V . • X ≈ X ( VV T ) . Gives the closest approximation to X with rows in V . ⃗ x 1 , . . . ,⃗ x n ∈ R d : data points, X ∈ R n × d : data matrix, ⃗ v 1 , . . . ,⃗ v k ∈ R d : orthogo- nal basis for subspace V . V ∈ R d × k : matrix with columns ⃗ v 1 , . . . ,⃗

  5. dual view of low-rank approximation 4

  6. best fit subspace arg min arg max 5 If ⃗ x 1 , . . . ,⃗ x n are close to a k -dimensional subspace V with orthonormal basis V ∈ R d × k , the data matrix can be approximated as XVV T . XV gives optimal embedding of X in V . How do we find V (equivalently V )? orthonormal V ∈ R d × k ∥ X − XVV T ∥ 2 F = orthonormal V ∈ R d × k ∥ XV ∥ 2 F .

  7. solution via eigendecomposition k v k . These are exactly the top k eigenvectors of X T X . arg max arg max 2 6 arg max n V minimizing ∥ X − XVV T ∥ 2 F is given by: ∑ ∑ ∥ V T ⃗ ∥ X ⃗ orthonormal V ∈ R d × k ∥ XV ∥ 2 F = x i ∥ 2 2 = v j ∥ 2 i = 1 j = 1 Surprisingly, can find the columns of V , ⃗ v 1 , . . . ,⃗ v k greedily. ⃗ ∥ X ⃗ 2 ⃗ v T X T X ⃗ v 1 = arg max v ∥ 2 v . ⃗ v with ∥ v ∥ 2 = 1 ⃗ ⃗ v T X T X ⃗ v 2 = v . ⃗ v with ∥ v ∥ 2 = 1 , ⟨ ⃗ v ,⃗ v 1 ⟩ = 0 . . . ⃗ ⃗ v T X T X ⃗ v k = v . ⃗ v with ∥ v ∥ 2 = 1 , ⟨ ⃗ v ,⃗ v j ⟩ = 0 ∀ j < k ⃗ x 1 , . . . ,⃗ x n ∈ R d : data points, X ∈ R n × d : data matrix, ⃗ v 1 , . . . ,⃗ v k ∈ R d : orthogo- nal basis for subspace V . V ∈ R d × k : matrix with columns ⃗ v 1 , . . . ,⃗

  8. review of eigenvectors and eigendecomposition v 1 v d v 2 v 1 v d v 2 7 x ). x ∈ R d is an eigenvector of a matrix A ∈ R d × d if Eigenvector: ⃗ A ⃗ x = λ⃗ x for some scalar λ (the eigenvalue corresponding to ⃗ • That is, A just ‘stretches’ x . • If A is symmetric, can find d orthonormal eigenvectors v d . Let V ∈ R d × d have these vectors as columns. ⃗ v 1 , . . . ,⃗     | | | | | | | | AV = A ⃗ A ⃗ · · · A ⃗  = λ 1 ⃗ λ 2 ⃗ · · · λ⃗  = V Λ       | | | | | | | | Yields eigendecomposition: AVV T = A = V Λ V T .

  9. review of eigenvectors and eigendecomposition Typically order the eigenvectors in decreasing order: 8 λ 1 ≥ λ 2 ≥ . . . ≥ λ d .

  10. courant-fischer principal arg max that we use for low-rank approximation. eigenvalues) are exactly the directions of greatest variance in X v T v T arg max Courant-Fischer Principal: For symmetric A , the eigenvectors are 9 given via the greedy optimization: ⃗ ⃗ v T A ⃗ v 1 = arg max v . ⃗ v with ∥ v ∥ 2 = 1 ⃗ v 2 = ⃗ v T A ⃗ v . ⃗ v with ∥ v ∥ 2 = 1 , ⟨ ⃗ v ,⃗ v 1 ⟩ = 0 . . . ⃗ ⃗ v T A ⃗ v d = v . ⃗ v with ∥ v ∥ 2 = 1 , ⟨ ⃗ v ,⃗ v j ⟩ = 0 ∀ j < d • ⃗ j A ⃗ v j = λ j · ⃗ j ⃗ v j = λ j , the j th largest eigenvalue. • The first k eigenvectors of X T X (corresponding to the largest k

  11. low-rank approximation via eigendecomposition 10

  12. low-rank approximation via eigendecomposition This is principal component analysis (PCA). v k . using eigenvalues of X T X . How accurate is this low-rank approximation? Can understand orthogonal basis minimizing 11 Upshot: Letting V k have columns ⃗ v 1 , . . . ,⃗ v k corresponding to the top k eigenvectors of the covariance matrix X T X , V k is the ∥ X − XV k V T k ∥ 2 F , ⃗ x 1 , . . . ,⃗ x n ∈ R d : data points, X ∈ R n × d : data matrix, ⃗ v 1 , . . . ,⃗ v k ∈ R d : top eigenvectors of X T X , V k ∈ R d × k : matrix with columns ⃗ v 1 , . . . ,⃗

  13. spectrum analysis d v k . diagonal entries = sum eigenvalues). d k d v i v T k 12 principal components). Approximation error is: Let ⃗ v 1 , . . . ,⃗ v k be the top k eigenvectors of X T X (the top k ∥ X − XV k V T k ∥ 2 F = ∥ X ∥ 2 F tr ( X T X ) − ∥ XV k V T k ∥ 2 F tr ( V T k X T XV k ) ∑ ∑ ⃗ i X T X ⃗ = λ i ( X T X ) − i = 1 i = 1 ∑ ∑ ∑ = λ i ( X T X ) − λ i ( X T X ) = λ i ( X T X ) i = 1 i = 1 i = k + 1 F = ∑ d i = 1 ∥ ⃗ • For any matrix A , ∥ A ∥ 2 a i ∥ 2 2 = tr ( A T A ) (sum of ⃗ x 1 , . . . ,⃗ x n ∈ R d : data points, X ∈ R n × d : data matrix, ⃗ v 1 , . . . ,⃗ v k ∈ R d : top eigenvectors of X T X , V k ∈ R d × k : matrix with columns ⃗ v 1 , . . . ,⃗

  14. spectrum analysis d v k . Claim: The error in approximating X with the best rank k 13 approximation (projecting onto the top k eigenvectors of X T X is: ∑ ∥ X − XV k V T k ∥ 2 F = λ i ( X T X ) i = k + 1 ⃗ x 1 , . . . ,⃗ x n ∈ R d : data points, X ∈ R n × d : data matrix, ⃗ v 1 , . . . ,⃗ v k ∈ R d : top eigenvectors of X T X , V k ∈ R d × k : matrix with columns ⃗ v 1 , . . . ,⃗

  15. spectrum analysis Plotting the spectrum of the covariance matrix X T X (its eigenvalues) shows how compressible X is using low-rank approximation (i.e., how v k . 14 close ⃗ x 1 , . . . ,⃗ x n are to a low-dimensional subspace). ⃗ x 1 , . . . ,⃗ x n ∈ R d : data points, X ∈ R n × d : data matrix, ⃗ v 1 , . . . ,⃗ v k ∈ R d : top eigenvectors of X T X , V k ∈ R d × k : matrix with columns ⃗ v 1 , . . . ,⃗

  16. spectrum analysis Exercises: 1. Show that the eigenvalues of X T X are always positive. Hint: v T v j . 2. Show that for symmetric A , the trace is the sum of 15 Use that λ j = ⃗ j X T X ⃗ eigenvalues: tr ( A ) = ∑ n i = 1 λ i ( A ) .

  17. summary onto a low-dimensional subspace. max data) is determined by the tail of X T X ’s eigenvalue spectrum. 16 • Many (most) datasets can be approximated via projection • Find this subspace via a maximization problem: orthonormal V ∥ XV ∥ 2 F . • Greedy solution via eigendecomposition of X T X . • Columns of V are the top eigenvectors of X T X . • Error of best low-rank approximation (compressibility of

  18. interpretation in terms of correlation Covariance becomes diagonal. I.e., all correlations have been v k . Recall: Low-rank approximation is possible when our data features removed. Maximal compression. are correlated. top k eigenvectors of X T X . 17 Our compressed dataset is C = XV k where the columns of V k are the What is the covariance of C ? C T C = V T k X T XV k = V T k V Λ V T V k = Λ k ⃗ x 1 , . . . ,⃗ x n ∈ R d : data points, X ∈ R n × d : data matrix, ⃗ v 1 , . . . ,⃗ v k ∈ R d : top eigenvectors of X T X , V k ∈ R d × k : matrix with columns ⃗ v 1 , . . . ,⃗

Recommend


More recommend