compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Spring 2020. Lecture 20 0
logistics released very shortly. classes on optimization before end of semester. 1 • Problem Set 3 is due tomorrow at 8pm. Problem Set 4 will be • This is the last day of our spectral unit. Then will have 4
summary Last Two Classes: Spectral Graph Partitioning This Class: Computing the SVD/eigendecomposition. computation, which are workhorses behind data science. 2 • Focus on separating graphs with small but relatively balanced cuts. • Connection to second smallest eigenvector of graph Laplacian. • Provable guarantees for stochastic block model. • Idealized analysis in class. See slides for full analysis. • Discuss efficient algorithms for SVD/eigendecomposition. • Iterative methods: power method, Krylov subspace methods. • High level: a glimpse into fast methods for linear algebraic
efficient eigendecomposition and svd We have talked about the eigendecomposition and SVD as ways to compress data, to embed entities like words and documents, to compress/cluster non-linearly separable data. How efficient are these techniques? Can they be run on massive datasets? 3
computing the svd 4 Basic Algorithm: To compute the SVD of full-rank A ∈ R n × d , A = U Σ V T : • Compute A T A – O ( nd 2 ) runtime. • Find eigendecomposition A T A = V Λ V T – O ( d 3 ) runtime. • Compute L = AV – O ( nd 2 ) runtime. Note that L = U Σ . • Set σ i = ∥ L i ∥ 2 and U i = L i / ∥ L i ∥ 2 . – O ( nd ) runtime. Total runtime: O ( nd 2 + d 3 ) = O ( nd 2 ) (assume w.l.o.g. n ≥ d ) • If we have n = 10 million images with 200 × 200 × 3 = 120 , 000 pixel values each, runtime is 1 . 5 × 10 17 operations! • The worlds fastest super computers compute at ≈ 100 petaFLOPS = 10 17 FLOPS (floating point operations per second). • This is a relatively easy task for them – but no one else.
faster algorithms To speed up SVD computation we will take advantage of the fact that we typically only care about computing the top (or the top k singular vectors V k . Sparse (iterative) vs. Direct Method. svd vs. svds . 5 bottom) k singular vectors of a matrix X ∈ R n × k for k ≪ d . • Suffices to compute V k ∈ R d × k and then compute U k Σ k = XV k . • Use an iterative algorithm to compute an approximation to • Runtime will be roughly O ( ndk ) instead of O ( nd 2 ) .
power method Power Method: The most fundamental iterative method for z t Runtime: d Runtime: d 6 but can easily be generalized to larger k . v 1 . approximate SVD. Applies to computing k = 1 singular vectors, Goal: Given X ∈ R n × d , with SVD X = U Σ V , find ⃗ z ≈ ⃗ z ( 0 ) randomly. E.g. ⃗ z ( 0 ) ( i ) ∼ N ( 0 , 1 ) . • Initialize: Choose ⃗ • For i = 1 , . . . , t z ( i ) = ( X T X ) · ⃗ z ( i − 1 ) • ⃗ Runtime: 2 · nd • n i = ∥ ⃗ z ( i ) ∥ 2 z ( i ) = ⃗ • ⃗ z ( i ) / n i Return ⃗ Total Runtime : O ( ndt )
power method v 1 ? 7 Why is it converging towards ⃗
power method intuition v 1 : top right singular vector, being v 1 . 8 z ( 0 ) in the right singular vector basis: Write ⃗ z ( 0 ) = c 1 ⃗ ⃗ v 1 + c 2 ⃗ v 2 + . . . + c d ⃗ v d . z ( i ) = X T X · ⃗ z ( i − 1 ) = V Σ 2 V T · ⃗ z ( i − 1 ) (then normalize) Update step: ⃗ z ( 0 ) = V T ⃗ z ( 0 ) = Σ 2 V T ⃗ z ( 1 ) = V Σ 2 V T · ⃗ z ( 0 ) = ⃗ X ∈ R n × d : input matrix with SVD X = U Σ V T . ⃗ computed, ⃗ z ( i ) : iterate at step i , converging to ⃗
power method intuition Claim 2: v 1 . v 1 : top right singular vector, being 9 v d , z ( 0 ) = c 1 ⃗ Claim 1 : Writing ⃗ v 1 + c 2 ⃗ v 2 + . . . + c d ⃗ z ( 1 ) = c 1 · σ 2 ⃗ 1 ⃗ v 1 + c 2 · σ 2 2 ⃗ v 2 + . . . + c d · σ 2 d ⃗ v d . z ( 2 ) = X T X ⃗ z ( 1 ) = V Σ 2 V T ⃗ z ( 1 ) = ⃗ z ( t ) = c 1 · σ 2 t ⃗ 1 ⃗ v 1 + c 2 · σ 2 t 2 ⃗ v 2 + . . . + c d · σ 2 t d ⃗ v d . X ∈ R n × d : input matrix with SVD X = U Σ V T . ⃗ z ( i ) : iterate at step i , converging to ⃗ computed, ⃗
power method convergence v d When will convergence be slow? After t iterations, we have ‘powered’ up the singular values, making 10 other components. the component in the direction of v 1 much larger, relative to the z ( 0 ) = c 1 ⃗ z ( t ) = c 1 σ 2 t ⃗ v 1 + c 2 ⃗ v 2 + . . . + c d ⃗ ⇒ ⃗ 1 ⃗ 2 ⃗ d ⃗ v d = v 1 + c 2 σ 2 t v 2 + . . . + c d σ 2 t Iteration 0 Iteration 1 0.4 0.8 0.6 0.2 0.4 0 0.2 -0.2 0 -0.4 -0.2 -0.6 -0.4 -0.8 -0.6 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
power method slow convergence v d 11 Slow Case: X has singular values: σ 1 = 1 , σ 2 = . 99 , σ 3 = . 9 , σ 4 = . 8 , . . . z ( 0 ) = c 1 ⃗ z ( t ) = c 1 σ 2 t ⃗ v 1 + c 2 ⃗ v 2 + . . . + c d ⃗ v d = ⇒ ⃗ 1 ⃗ v 1 + c 2 σ 2 t 2 ⃗ v 2 + . . . + c d σ 2 t d ⃗ Iteration 0 Iteration 1 0.6 0.7 0.6 0.4 0.5 0.4 0.2 0.3 0 0.2 0.1 -0.2 0 -0.1 -0.4 -0.2 -0.6 -0.3 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
power method convergence rate v d d . v 1 : top 2 1 U V T . Singular values d : matrix with SVD X n X v 1 ? . 12 z ( 0 ) = c 1 ⃗ z ( t ) = c 1 σ 2 t ⃗ v 1 + c 2 ⃗ v 2 + . . . + c d ⃗ ⇒ ⃗ 1 ⃗ 2 ⃗ d ⃗ v d = v 1 + c 2 σ 2 t v 2 + . . . + c d σ 2 t Write σ 2 = ( 1 − γ ) σ 1 for ‘gap’ γ = σ 1 − σ 2 σ 1 . How many iterations t does it take to have σ 2 t 2 ≤ 1 2 · σ 2 t 1 ? O ( 1 / γ ) . ( ) log ( 1 / δ ) How many iterations t does it take to have σ 2 t 2 ≤ δ · σ 2 t γ 1 ? O How small must we set δ to ensure that c 1 σ 2 t 1 dominates all other components and so ⃗ z ( t ) is very close to ⃗ right singular vector, being computed, z i : iterate at step i , converging to v 1 .
random initialization j v 1 . v 1 : top c 1 c j max Corollary: 13 Claim: When z ( 0 ) is chosen with random Gaussian entries, writing z ( 0 ) = c 1 v 1 + c 2 v 2 + . . . + c d v d , with very high probability, for all i : O ( 1 / d 2 ) ≤ | c i | ≤ O ( log d ) � � � � � ≤ O ( d 2 log d ) . � � � X ∈ R n × d : matrix with SVD X = U Σ V T . Singular values σ 1 , σ 2 , . . . , σ d . ⃗ z ( i ) : iterate at step i , converging to ⃗ right singular vector, being computed, ⃗
random initialization iterations: v 1 . v 1 : top v d d by Claim 1 will have: v d 14 max j c j Claim 1: When z ( 0 ) is chosen with random Gaussian entries, writing z ( 0 ) = c 1 v 1 + c 2 v 2 + . . . + c d v d , with very high probability, c 1 ≤ O ( d 2 log d ) . ( ) log ( 1 /δ ) Claim 2: For gap γ = σ 1 − σ 2 , after t = O σ 1 γ z ( t ) = c 1 σ 2 t ⃗ 1 ⃗ 2 ⃗ d ⃗ v d ∝ c 1 ⃗ v 1 + c 2 δ⃗ v 2 + . . . + c d δ⃗ v 1 + c 2 σ 2 t v 2 + . . . + c d σ 2 t ( ) ϵ If we set δ = O d 3 log d v 1 + ϵ z ( t ) ∝ ⃗ ( ⃗ ) ⃗ v 2 + . . . + ⃗ . z ( t ) − ⃗ Gives ∥ ⃗ v 1 ∥ 2 ≤ O ( ϵ ) . X ∈ R n × d : matrix with SVD X = U Σ V T . Singular values σ 1 , σ 2 , . . . , σ d . ⃗ z ( i ) : iterate at step i , converging to ⃗ right singular vector, being computed, ⃗
power method theorem steps: O Theorem (Basic Power Method Convergence) 15 singular values. If Power Method is initialized with a random be the relative gap between the first and second largest Let γ = σ 1 − σ 2 σ 1 ( ) log d /ϵ v ( 0 ) then, with high probability, after t = O Gaussian vector ⃗ γ z ( t ) − ⃗ ∥ ⃗ v 1 ∥ 2 ≤ ϵ. Total runtime: O ( t ) matrix-vector multiplications. ( ) ( ) nnz ( X ) · log ( d /ϵ ) nd · log ( d /ϵ ) · = O . γ γ How is ϵ dependence? How is γ dependence?
krylov subspace methods steps for the same guarantee. 1 Krylov subspace methods (Lanczos method, Arnoldi method.) 16 • How svds / eigs are actually implemented. Only need ( ) log d /ϵ t = O √ γ Main Idea: Need to separate σ 1 from σ i for i ≥ 2. • Power method: power up to σ 2 · t and σ 2 · t i . • Krylov methods: apply a better degree t polynomial T t ( σ 2 1 ) and T t ( σ 2 i ) . • Still requires just 2 t matrix vector multiplies. Why?
krylov subspace methods vs. Optimal ‘jump’ polynomial in general is given by a degree t Chebyshev polynomial. Krylov methods find a polynomial tuned to the input matrix that does at least as well. 17
generalizations to larger k ‘Gapless’ Runtime : O low-rank approximation when you project onto them. to accurately compute the top k singular vectors. Subspace Iteration aka Orthogonal Iteration Runtime : O 18 • Block Power Method aka Simultaneous Iteration aka • Block Krylov methods ( ) ndk · log d /ϵ √ γ ( ) ndk · log d /ϵ √ ϵ if you just want a set of vectors that gives an ϵ -optimal
connection to random walks Consider a random walk on a graph G with adjacency matrix A . At each step, move to a random vertex, chosen uniformly at random from the neighbors of the current vertex. 19
Recommend
More recommend