Low Rank Approximation Lecture 3 Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch 1
Basic randomized salgorithm for low-rank approx Must read: Halko/Martinsson/Tropp’2010: Finding Structure with Randomness... Randomized Algorithm: 1. Draw Gaussian random matrix Ω ∈ R n × r . 2. Perform block mat-vec Y = A Ω . 3. Compute (economic) QR decomposition Y = QR . 4. Form B = Q T A . 5. Return ˆ A = QB (in factorized form) Exact recovery: If A has rank r , we recover � A = A with probability 1. 2
Three test matrices (a) The 100 × 100 Hilbert matrix A defined by A ( i , j ) = 1 / ( i + j − 1 ) . (b) The matrix A defined by A ( i , j ) = exp ( − γ | i − j | / n ) with γ = 0 . 1. (c) 30 × 30 diagonal matrix with diagonal entries 10 , 0 . 99 10 , 0 . 98 100 , 0 . 99 100 , 0 . 98 1 , 0 . 99 , 0 . 98 , 1 1 10 , 100 , . . . 10 0 10 -10 10 -20 0 20 40 60 80 100 Singular values of test matrices 3
Randomized algorithm applied to test matrices errors measured in spectral norm: (a) Hilbert matrix, r = 5: Exact mean std 0 . 0019 0 . 0092 0 . 0099 (b) Matrix with slower decay, r = 25: Exact mean std 0 . 0034 0 . 012 0 . 002 (c) Matrix with staircase sv, r = 7: Exact mean std 0 . 010 0 . 038 0 . 025 4
Randomized algorithm applied to test matrices errors measured in Frobenius norm: (a) Hilbert matrix, r = 5: Exact mean std 0 . 0019 0 . 0093 0 . 0099 (b) Matrix with slower decay, r = 25: Exact mean std 0 . 011 0 . 024 0 . 001 (c) Matrix with staircase sv, r = 7: Exact mean std 0 . 014 0 . 041 0 . 024 5
Basic randomized algorithms for low-rank approx Add oversampling. (usually small) integer p Randomized Algorithm: 1. Draw standard Gaussian random matrix Ω ∈ R n × ( r + p ) . 2. Perform block mat-vec Y = A Ω . 3. Compute (economic) QR decomposition Y = QR . 4. Form B = Q T A . 5. Return � A = QB (in factorized form) Problem: ˆ A has rank r + p > r . Solution: Compress B ≈ T r ( B ) � Q T r ( B ) has rank r . Error: � Q T r ( B ) − A � = � Q T r ( B ) − QB + QB − A � �T r ( B ) − B � + � ( I − QQ T ) A � ≤ 6
Basic randomized algorithms for low-rank approx Add oversampling. (usually small) integer p Randomized Algorithm: 1. Draw standard Gaussian random matrix Ω ∈ R n × ( r + p ) . 2. Perform block mat-vec Y = A Ω . 3. Compute (economic) QR decomposition Y = QR . 4. Form B = Q T A . 5. Return � A = Q T r ( B ) (in factorized form) Gold standard best rank- r approximation error: ◮ spectral norm: σ r + 1 � ◮ Frobenius norm: σ 2 r + 1 + · · · + σ 2 n . 7
Randomized algorithm applied to test matrices errors measured in spectral norm: (a) Hilbert matrix, r = 5: Exact mean std 0 . 0019 0 . 0092 0 . 0099 p = 0 0 . 0019 0 . 0026 0 . 0019 p = 1 0 . 0019 0 . 0019 0 . 0001 p = 2 (b) Matrix with slower decay, r = 25: Exact mean std 0 . 0034 0 . 012 0 . 002 p = 0 0 . 0034 0 . 011 0 . 0017 p = 1 0 . 0034 0 . 010 0 . 0015 p = 2 0 . 0034 0 . 0064 0 . 0008 p = 10 0 . 0034 0 . 0037 0 . 0002 p = 25 (c) Matrix with staircase sv, r = 7: Exact mean std 0 . 010 0 . 038 0 . 025 p = 0 0 . 010 0 . 021 0 . 012 p = 1 0 . 010 0 . 012 0 . 005 p = 2 8
Analysis: basic setting Goal: Say something sensible about � ( I − QQ T ) A � . Expected value, tail bounds, ... Multivariate normal distribution X ∼ N ( µ, Σ) with mean µ ∈ R n and (positive definite) covariance matrix Σ ∈ R n × n has density � � 1 − 1 2 ( x − µ ) T Σ − 1 ( x − µ ) � f X ( x ) = exp ( 2 π ) n det (Σ) X ∼ N ( 0 , I m ) is called a Gaussian random vector. Lemma Let V ∈ R m × n be orthonormal. If X ∼ N ( 0 , I m ) then V T X ∼ N ( 0 , I n ) . This invariance of Gaussian random vectors allows us to assume w.l.o.g. in the analysis that A = Σ diagonal (and square). 9
Analysis: r = 1 , p = 0 Partition � ω 1 � � σ 1 � 0 Ω = , A = . ω 2 Σ 2 0 Then � � � � σ 1 ω 1 1 Y = A Ω = = σ 1 ω 1 . 1 Σ 2 ω 2 σ 1 ω 1 Σ 2 ω 2 σ 1 ω 1 Σ 2 ω 2 but ω − 1 1 Problem: Need to control norm of (reciprocal of 1 standard normal random variable) is Cauchy distribution with undefined mean and variance. Need to consider p ≥ 2. 10
Analysis: r = 1 , p ≥ 2 Lemma Y , Π Y be orthogonal projectors onto subspaces ˜ Let Π ˜ Y ⊂ Y . Then � Π ˜ Y A � F ≤ � Π Y A � F , � ( I − Π ˜ Y ) A � F ≥ � ( I − Π Y ) A � F . for any matrix A. Proof. EFY. � � � � σ 1 Ω 1 1 ˜ Y = A Ω = , Y = , σ − 1 1 Σ 2 Ω 2 Ω † Σ 2 Ω 2 1 where Ω † 1 is pseudoinverse of Ω 1 . Because Ω 1 is a row vector and nonzero (with probability one), we obtain Ω † 1 = Ω T 1 / � Ω 1 � 2 2 . Thus, range ( ˜ Y ) ⊂ range ( Y ) . 1 Σ 2 Ω 2 Ω † Thus, with f = σ − 1 1 , we get � 1 � 1 � ( I − QQ T ) A � F ≤ � ( I − ˜ Q ˜ ˜ Q T ) A � F , Q = � , f 1 + � f � 2 2 11
Analysis: r = 1 , p ≥ 2 � ( I − ˜ Q ˜ F − � ˜ Q T ) A � 2 F = � A � 2 Q T A � 2 F σ 2 1 � ˜ Q T A � 2 ( σ 2 1 + � Σ 2 f � 2 1 = 2 ) ≥ F 1 + � f � 2 1 + � f � 2 2 2 σ 2 1 ( 1 − � f � 2 2 ) = σ 2 1 − � Σ 2 Ω 2 Ω † 1 � 2 ≥ 2 In summary, we have: Lemma � ( I − ˜ Q ˜ Q T ) A � 2 F ≤ � Σ 2 � 2 F + � Σ 2 Ω 2 Ω † 1 � 2 F . To analyze red term, we use � � �� E � Σ 2 Ω 2 Ω † � Σ 2 Ω 2 Ω † F · E � Ω † 1 � 2 1 � 2 = � Σ 2 � 2 1 � 2 F | Ω 1 F = E E F . (See exercises for proof that E � A Ω B � 2 F = � A � 2 F � B � 2 F for Gaussian matrix Ω and constant matrices A , B .) 12
Analysis: r = 1 , p ≥ 2 F and � Ω † It remains to control � Ω 1 � 2 1 � 2 F = 1 / � Ω 1 � 2 F . � Ω 1 � 2 F is a sum of p + 1 squared independent standard normal random variables. 2 π e − x 2 / 2 . Pdf for Y = X 2 zero 1 Pdf for X ∼ N ( 0 , 1 ) given by f X ( x ) = √ for nonpositive values. For y > 0, we obtain Pr ( −√ y ≤ X ≤ √ y ) Pr ( 0 ≤ Y ≤ y ) = � √ y 2 e − x 2 / 2 d x = √ 2 π 0 � y 1 e − t / 2 d t , = √ 2 π 0 Y is called chi-squared distribution (1 degree of freedom): Y ∼ χ 2 1 . Ω 1 ∼ χ 2 p + 1 chi-squared distribution with p + 1 d.o.f.; pdf 2 − ( p + 1 ) / 2 Γ(( p + 1 ) / 2 ) x ( p + 1 ) / 2 − 1 exp ( − x / 2 )) , f Ω 1 ( x ) = x > 0 . 13
Analysis: r = 1 , p ≥ 2 � p � − 1 � 1 � Ω † 1 � 2 Ω 2 ∼ Inv − χ 2 ( p + 1 ) , F = = 1 , i � Ω 1 � 2 F i = 1 the inverse-chi-squared distribution with p + 1 degrees of freedom. Pdf given by 2 − ( p + 1 ) / 2 Γ(( p + 1 ) / 2 ) x − ( p + 1 ) / 2 − 1 exp ( − 1 / ( 2 x )) . 10 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 pdf for p = 1, p = 3, p = 9 14
Analysis: r = 1 , p ≥ 2 Textbook results: E � Ω † ◮ E � Ω 1 � 2 1 � 2 F = ( p − 1 ) − 1 F = p + 1, Tail bound by [Laurent/Massart’2000]: � � � � t 2 � Ω 1 � 2 ◮ P F ≤ p + 1 − t ≤ exp − 4 ( p + 1 ) Theorem For r = 1 , p ≥ 2 , we have � 1 E � ( I − QQ T ) A � F ≤ 1 + p − 1 � Σ 2 � F . Probability of deviating from this upper bound decays exponentially, as indicated by tail bound for χ 2 p + 1 . 15
Analysis: general r , p ≥ 2 Lemma F + � Σ 2 Ω 2 Ω † � ( I − QQ T ) A � 2 F ≤ � Σ 2 � 2 1 � 2 F . Proof. � Σ 1 Ω 1 � � I r � ˜ F = Σ 2 Ω 2 Ω † 1 Σ − 1 Y = A Ω = , Y = , 1 . Σ 2 Ω 2 F ONB � I r � ˜ ( I r + F T F ) − 1 / 2 . Q = F F ≤ � ( I − ˜ Q ˜ F − � ˜ � ( I − QQ T ) A � 2 Q T ) A � 2 F = � A � 2 Q T A � 2 F . � � � � � ˜ � 2 � 2 � ( I r + F T F ) − 1 / 2 Σ 1 � ( I r + F T F ) − 1 / 2 F T Σ 2 Q T A � 2 = F + F F � � � 2 � ( I r + F T F ) − 1 / 2 Σ 1 F = trace (Σ 1 ( I r + F T F ) − 1 Σ 1 ) ≥ trace (Σ 1 ( I r − F T F )Σ 1 ) = � Σ 1 � 2 F − � Σ 2 Ω 2 Ω † 1 � 2 ≥ F (Proof of trace (Σ 1 ( I r + F T F ) − 1 Σ 1 ) ≥ trace (Σ 1 ( I r − F T F )Σ 1 ) on next slide.) 16
Let G = F T F . Then we have � � ( I + G ) − 1 − ( I − G ) = G 2 ( I + G ) and hence ( I + G ) − 1 − ( I − G ) = ( I + G ) − 1 G 2 = G ( I + G ) − 1 G is symmetric positive semidefinite. This implies that � � ( I + G ) − 1 − ( I − G ) Σ 1 Σ 1 is symmetric positive semidefinite as well and hence � � � � ( I + G ) − 1 − ( I − G ) 0 ≤ trace Σ 1 Σ 1 or, equivalently, trace (Σ 1 ( I r + F T F ) − 1 Σ 1 ) ≥ trace (Σ 1 ( I r − F T F )Σ 1 ) . 17
Analysis: general r , p ≥ 2 Again use E � Σ 2 Ω 2 Ω † 1 � 2 F = � Σ 2 � 2 F · E � Ω † 1 � 2 F . By standard results in multivariate statistics, we have r E � Ω † 1 � 2 F = p − 1 . Sketch of argument: ◮ Ω 1 Ω T 1 ∼ W r ( r + p ) (Wishart distribution with r + p degrees of freedom) 1 ) − 1 ∼ W − 1 ◮ (Ω 1 Ω T ( r + p ) (inverse Wishart distribution with r + p r degrees of freedom) 1 ) − 1 = 1 ◮ E (Ω 1 Ω T p − 1 I r ; see Page 96 in [Muirhead’1982] 1 � ◮ Result follows from � Ω † 1 � 2 F = � Ω T 1 (Ω 1 Ω T 1 ) − 1 � 2 (Ω 1 Ω T 1 ) − 1 F = trace 1 R. J. Muirhead, Aspects of Multivariate Statistical Theory, Wiley, New York, NY, 1982. 18
Recommend
More recommend