topics in algorithms and data science singular value
play

Topics in Algorithms and Data Science Singular Value Decomposition - PowerPoint PPT Presentation

Topics in Algorithms and Data Science Singular Value Decomposition (SVD) Omid Etesami The problem of best-fit subpace Best-fit subspace n points in d -dimensional Euclidean space are given. The best-fit subspace of dimension k minimizes the sum


  1. Topics in Algorithms and Data Science Singular Value Decomposition (SVD) Omid Etesami

  2. The problem of best-fit subpace

  3. Best-fit subspace n points in d -dimensional Euclidean space are given. The best-fit subspace of dimension k minimizes the sum of squared distances from the points to the subspace.

  4. Centering data

  5. Centering data For the best-fit affine subspace, translate so that the center of mass of the points lies at the origin. Then find best-fit linear subspace .

  6. Why centering works? Lemma. Best-fit affine subspace of dimension k for the points a 1 ,…,a n passes through their center of mass.

  7. Proof of lemma • W.l.o.g . assume “center of mass = 0”. a i • Let a = projection of 0 onto the affine subspace l . • We can write l = a + S where S is a linear subspace. • The sum is minimized for a = 0.

  8. The greedy approach to best subpace yields the singular vectors

  9. The greedy approach to finding best S 3 k -dimensional subspace S 1 S 0 = {0} for i = 1 to k do S 2 S i = best-fit i -dimensional subspace among those that contain S i-1 S 0

  10. Best-fit line Instead of minimizing sum of squared distances we maximize sum of squared lengths of projections onto the line. a i

  11. 1 st singular vector and value • v 1 = unit vector in the direction of best-fit line • |< a i , v 1 >| = length of projection of a i • rows of n×d matrix A = data points. v 1 1 st right singular vector = v 1 = argmax | v |=1 | Av | 1 st singular value = | Av 1 |

  12. Other singular vectors S 3 v 1 S 1 S 2 { v 1 ,…,v i } is an orthonormal basis for S i because v 2 sum of squared lengths of projections on S i is | A v i | + sum of squared lengths of projections on S i-1 . S 0 v i is the i ‘th right singular vector and ơ i (A) = | A v i | is the i ‘th singular vector.

  13. T k Why greedy works? w 2 S k v 1 Proof by induction on k: Consider a k -dimensional subspace T k . It has a unit vector w k orthogonal to S k-1 . Let T k = T k-1 + w k , such that w k is orthogonal to T k-1 . • Sum of squared lengths of projections on T k-1 is at most that for S k-1 . • | A w k | ≤ | A v k |.

  14. Singular values We only consider non-zero singular values, i.e. i ’th singular value is defined only for 1 ≤ i ≤ r = rank (A). Lemma.

  15. Singular Vector Decomposition

  16. Singular Value Decomposition (SVD) Left singular vector u i = A v i / ơ i D diagonal with diagonal entries ơ i U with columns u i V with columns v i

  17. Thm. Left singular vectors are orthogonal. u 1 Proof: u 3 u 2

  18. Uniqueness of singular values/vectors • The sequence of singular values forms a unique non-increasing sequence. • Singular vectors corresponding to a particular singular value ơ are any orthonormal basis for a unique subspace associated with ơ . v 3 v’ 2 v 2 ơ 1 = ơ 2 v ’ 1 v 1 v’ 3 = -v 3 v ’ 3

  19. Best rank- k approximation: Frobenius and spectral norms

  20. Rank- k approximation A k is the best rank- k approximation to A under Frobenius norm.

  21. Representing documents as vectors n×d term-document matrix: I like football John likes basketball Doc 1 1 1 2 0 0 0 Doc 2 0 0 2 1 1 0 Doc 3 0 0 0 1 1 1

  22. Answering queries • Each query is a d- dimensional vector x denoting the importance of each term • Answer = similarity (dot-product) to each document = Ax • O(nd) time to process each query

  23. SVD as preprocessing When many queries, we preprocess and get We can now answer queries in O(kd + kn) time: Good when k << d, n A u 1 ,…,u k ,v 1 ,…, v k A k x preprocessing answering queries

  24. Spectral norm

  25. Spectral norm of error of A k Spectral norm of M = ơ 1

  26. Best rank- k approximation according to spectral norm is A k rank-4 approximation of adjacency matrix

  27. Connection of SVD with eigenvalues

  28. Singular values and eigenvalues • Let B = A T A. Therefore eigenvalues of B are square of singular values of A , and eigenvectors of B are right singular vectors of A . • If A is symmetric, absolute value of eigenvalues of A are singular values of A , and eigenvectors of A are right singular vectors of A.

  29. Analogue of eigenvectors and eigenvalues • A v i = ơ i u i • A T u i = ơ i v i

  30. Computing SVD

  31. Computing SVD by the Power Method If ơ 1 ≠ ơ 2 , then B k tends to ơ 1 2k v i v i T . Estimate of v 1 = a normalized column of B k

  32. Inefficiency of the previous method • Matrix multiplication takes time. • We cannot use the potential sparsity of A. E.g. A may be 10 8 × 10 8 but we represent it by its say 10 9 nonzero entries. B may have 10 16 nonzero entries, so big not even possible to write.

  33. Faster power method Use matrix-vector multiplication instead of matrix-matrix multiplication Algorithm: • Choose a random vector x • Compute B k x = A T A A T A … A T A x • Choose B k x normalized as v 1

  34. Component of random vector along 1 st singular vector v 1 Lemma. Pr[ |< x, v 1 > | ≤ 20/d 1/2 ] ≤ 1/10 + exp(- Ө (d)) . Proof: • x = y / | y | where y spherical Gaussian with unit variance • Pr[ |< y, v 1 >| ≤ 1/10 ] ≤ 1/10 ( < y, v 1 > standard normal Gaussian) • Pr[ | y | ≥ 2 d 1/2 ] ≤ exp(- Ө (d)) (Gaussian annulus theorem)

  35. Analysis of the power method • Let V = span of right singular vectors with singular values ≥ (1 – Ɛ) ơ 1 . • Assume |< x, v 1 > | ≥ δ . • | B k x | ≥ ơ 1 2k δ. • component of B k x perpendicular to V ≤ [( 1 – Ɛ) ơ 1 ] 2k .

  36. Traditional application of SVD: Principal Component Analysis

  37. Movie recommendation n costumers, d movies matrix A where a ij = rating of user i for movie j

  38. Principal Component Analysis (PCA) • Assume there are k underlying factors, e.g. “amount of comedy”, “novelty of story”, … • each movie = k -dimensional vector • each user = k -dimensional vector representing importance of each factor to the user • rating = dot-product <movie, user> • A k = best rank- k approximation to A yields U, V • A – UV treated as noise

  39. Collaborative filtering • A has missing entries: recommend a movie or target an ad based on previous purchases • Assume A = small-rank matrix + noise • One approach is to fill missing values reasonably e.g. by average rating, then apply SVD to recover missing entries

  40. Application of SVD: clustering mixture of spherical Gaussians

  41. Clustering • Partition d -dimensional points into k groups • Finding “best” solution often NP -hard; thus, assume stochastic models of data.

  42. Mixture models A class of stochastic models are mixture models, e.g. mixture of spherical Gaussians F = w 1 p 1 + … + w k p k

  43. Model fitting problem Given n i.i.d. samples drawn according to F , fit a mixture of k Gaussians to them. Possible solution: • First, cluster the points into k clusters • Then, fit a Gaussian to each cluster (by choosing empirical mean and variance)

  44. Inter-center distance • If two Gaussian centers are very close, clustering unresolvable. • If every two Gaussian centers are at least say six times the standard deviation apart, clustering unambiguous.

  45. Distance based clustering • If x, y are independent samples from the same Gaussian, then | x – y | 2 = 2 (d 1/2 ± O(1)) 2 σ 2 • If x, y are independent samples from two Gaussians at distance Δ , then | x – y | 2 = 2 (d 1/2 ± O(1)) 2 σ 2 + Δ 2 Thus, to distinguish the two cases, we need inter-center distance Δ ≥ Ω ( σ d 1/4 ).

  46. Projection on the subspace spanned by the k centers If we knew the subspace spanned by the k centers, we could project points on that subspace. Inter-center distances would not change, and the samples are still spherical Gaussians but now in k -space, so now a separation of Θ ( σ k 1/4 ) is enough.

  47. How to find the subspace spanned by the k centers? Green points = sample points Theorem. The best-fit subspace of dimension k for points sampled according to the mixture distribution passes through the centers. Thus, we can find the subspace by SVD on a large number of sampled points.

  48. Why best-fit subpace passes through centers? • Best-fit line for a single Gaussian passes through the center. Proof. For a unit vector x and sample point x • Best-fit dim- k subspace for a single Gaussian is any subspace that passes through the center. Proof. Greedy best-fit subspace. • Subspace of dim k passing through k centers is simultaneously best for all Gaussians.

  49. Application of SVD: ranking documents and webpages

  50. Ranking documents Given documents in a collection, how do we rank documents according to their relevance to the collection? Solution. We can rank according to the length of the projection of documents onto the first right singular vector. I like football John likes basketball Doc 1 1 1 2 0 0 0 Doc 2 0 0 2 1 1 0 Doc 3 0 0 0 1 1 1

  51. Ranking webpages • Web as directed graph, webpages as vertices, hyperlinks as edges • Authorities: sources of information (with many pointers from hubs) • Hubs: identify authorities (with many pointers to hubs) Looks like a “circular” definition.

Recommend


More recommend