low rank approximation lecture 5
play

Low Rank Approximation Lecture 5 Daniel Kressner Chair for - PowerPoint PPT Presentation

Low Rank Approximation Lecture 5 Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch 1 Randomized column/row sampling Aim: Obtain rank- r approximation from randomly selected rows and


  1. Low Rank Approximation Lecture 5 Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch 1

  2. Randomized column/row sampling Aim: Obtain rank- r approximation from randomly selected rows and columns of A . Popular sampling strategies: ◮ Uniform sampling. ◮ Sampling based on row/column norms. ◮ Sampling based on more complicated quantities. 2

  3. Preliminaries on randomized sampling Exponential function example from Lecture 4 (Slide 14). Comparison between best approximation, greedy approximation, approximation obtained by randomly selecting rows. 10 0 10 0 10 -5 10 -5 10 -10 10 -10 0 2 4 6 8 10 0 2 4 6 8 10 10 0 10 0 10 -5 10 -5 10 -10 10 -10 0 2 4 6 8 10 0 2 4 6 8 10 3

  4. Preliminaries on randomized sampling A simple way to fool uniformly random row selection: � 0 ( n − r ) × r � U = I r for n very large and r ≪ n . 4

  5. Column sampling Basic algorithm aiming at rank- r approximation: 1. Sample (and possibly rescale) k > r columns of A � m × k matrix C . 2. Compute SVD C = U Σ V T and set Q = U r ∈ R m × r . 3. Return low-rank approximation QQ T A . ◮ Can be combined with streaming algorithm [Liberty’2007] to limit memory/cost of working with C . ◮ Quality of approximation crucially depends on sampling strategy. 5

  6. Column sampling Lemma For any matrix C ∈ R m × r , let Q be the matrix computed above. Then 2 ≤ σ r + 1 ( A ) 2 + 2 � AA T − CC T � 2 . � A − QQ T A � 2 Proof. We have ( A − QQ T A )( A − QQ T A ) T ( I − QQ T ) CC T ( I − QQ T ) + ( I − QQ T )( AA T − CC T )( I − QQ T ) = Hence, � A − QQ T A � 2 � ( A − QQ T A )( A − QQ T A ) T � = λ max 2 + � AA T − CC T � 2 ( I − QQ T ) CC T ( I − QQ T ) � � ≤ λ max σ r + 1 ( C ) 2 + � AA T − CC T � 2 . = The proof is completed by applying Weyl’s inequality: σ r + 1 ( C ) 2 = λ r + 1 ( CC T ) ≤ λ r + 1 ( AA T ) + � AA T − CC T � 2 . 6

  7. Random column sampling Using the lemma, the goal now becomes to approximate the matrix product AA T using column samples of A . Notation: � a 1 � � c 1 � A = · · · a n , C = · · · c k General sampling method: Input: A ∈ R m × n , probabilities p 1 , . . . , p n � = 0, integer k . Output: C ∈ R m × k containing selected columns of A . 1: for t = 1 , . . . , k do Pick j t ∈ { 1 , . . . , n } with P [ j t = ℓ ] = p ℓ , ℓ = 1 , . . . , n , 2: independently and with replacement. � Set c t = a j t / kp j t . 3: 4: end for 7

  8. Random column sampling Lemma For the matrix C returned by algorithm, it holds that n a 2 i ℓ a 2 Var [( CC T ) ij ] = 1 − 1 � j ℓ E [ CC T ] = AA T , k ( AA T ) 2 ij . k p ℓ ℓ = 1 � c t c T 1 Proof. For fixed i , j , consider X t = t ) ij = kp jt a i , j t a j , j t , for which n 1 a i ,ℓ a j ,ℓ = 1 � k ( AA T ) ij . E [ X t ] = p ℓ kp ℓ ℓ = 1 Analogously, n a 2 i ℓ a 2 t ] − E [ X t ] 2 = 1 − 1 j ℓ Var ( X t ) = E [( X t − E [ X t ]) 2 ] = E [ X 2 � k 2 ( AA T ) 2 ij . k 2 p ℓ ℓ = 1 t X t ] = k · E [ X t ] = ( AA T ) ij , Because of independence, it follows E [ � and analogously for variance. 8

  9. Random column sampling As a consequence of the lemma, E [ � AA T − CC T � 2 E [( AA T − CC T ) 2 � F ] = ij ] ij � Var [( CC T ) ij ] = ij n a 2 i ℓ a 2 1 − 1 � � j ℓ � � k ( AA T ) 2 = ij k p ℓ ij ℓ = 1 � n � 1 1 � � a ℓ � 4 2 − � AA T � 2 = . F k p ℓ ℓ = 1 Lemma F minimizes E [ � AA T − CC T � 2 The choice p ℓ = � a ℓ � 2 2 / � A � 2 F ] and yields F ] = 1 E [ � AA T − CC T � 2 � � A � 4 F − � AA T � 2 � F k Proof. Established by showing that this choice of p ℓ satisfies first-order conditions of constrained optimization problem. 9

  10. Random column sampling Norm based sampling: Input: A ∈ R m × n , integer k . Output: C ∈ R m × k containing selected columns of A . 1: Set p ℓ = � a ℓ � 2 2 / � A � 2 F for ℓ = 1 , . . . , n . 2: for t = 1 , . . . , k do Pick j t ∈ { 1 , . . . , n } with P [ j t = ℓ ] = p ℓ , ℓ = 1 , . . . , n , 3: independently and with replacement. � Set c t = a j t / kp j t . 4: 5: end for 5: Compute SVD C = U Σ V T and set Q = U r ∈ R m × r . 5: Return low-rank approximation QQ T A . 10

  11. Random column sampling Lemma For the matrix C returned by algorithm, it holds with probability 1 − δ that η � AA T − CC T � F ≤ √ � A � F , k � where η = 1 + 8 · log( 1 /δ ) . Proof. Aim at applying Azuma-Hoeffding inequality. Define F ( i 1 , i 2 , . . . , i k ) = � AA T − CC T � F , � a i 1 · · · � with C = a i k . Quantify the effect of varying an index (w.l.o.g. the first one) on F : | F ( i 1 , i 2 , . . . , i k ) − F ( i ′ 1 , i 2 , . . . , i k ) | � � AA T − CC T � F − � AA T − C ′ C ′ T � F � � = � 1 1 � CC T − C ′ C ′ T � F ≤ � a i 1 � 2 1 � 2 ≤ 2 + � a i ′ 2 kp i 1 kp i ′ 1 2 k � A � 2 ≤ F := ∆ . 11

  12. Random column sampling This implies that Doob martingales g ℓ = E [ f ( i 1 , . . . , i k ) | i 1 , . . . , i ℓ ] for 1 ≤ ℓ ≤ k satisfy | g ℓ + 1 − g ℓ | ≤ ∆ . Note that g k = E [ � AA T − CC T � F ] . By lemma and Jensen’s inequality √ we know that g k ≤ � A � 2 F / k . Applying Azuma-Hoeffding inequality to g k yields √ � AA T − CC T � F ≥ � A � 2 ≤ exp( − γ 2 / 2 k ∆ 2 ) =: δ. � � P F / k + γ � Setting γ = 8 · log( 1 /δ ) completes the proof. 12

  13. Random column sampling Theorem (Drineas/Kannan/Mahoney’2006) For the matrix Q returned by the algorithm above it holds that � A − QQ T A � 2 ≤ σ 2 r + 1 ( A ) + ε � A � 2 F for k ≥ 4 /ε 2 . � � E 2 With probability at least 1 − δ , � A − QQ T A � 2 2 ≤ σ 2 r + 1 ( A ) + ε � A � 2 � 8 · log( 1 /δ )) 2 /ε 2 . F for k ≥ 4 ( 1 + Proof. Follows from combining very first lemma with last two lemmas. Remarks: ◮ Dependence of k on ε pretty bad. Unlikely to achieve something significantly better without assuming further properties of A (e.g., incoherence of singular vectors) with sampling based on row norms only. ◮ Simple “counter example”: � � 1 1 1 1 ∈ R n × ( n + 1 ) . A = √ n e 1 √ n e 1 · · · √ n e 1 √ n e 2 13

  14. Random column sampling [Drineas/Mahoney/Muthukrishnan’2007]: Let V k contain k dominant right singular vectors of A . Setting p ℓ = � V k ( ℓ, :) � 2 2 / k , ℓ = 1 , . . . , n and sampling O ( k 2 (log 1 /δ ) /ε 2 ) columns 1 yields � A − QQ T A � F ≤ ( 1 + ε ) � A − T k ( A ) � F with probability 1 − δ . Relative error bound! CUR decomposition can be obtained by applying ideas to rows and columns (yielding R and C , respectively) and choosing U appropriately. 1 There are variants that improve this to O ( k log k log( 1 /δ ) /ε 2 ) . 14

Recommend


More recommend