Optimized Polar Decomposition for Modern Computer Architectures Joint work with Pierre.Blanchard † @manchester.ac.uk NLA Group Meeting Manchester, UK. October 2, 2018 † School of Mathematics, The University of Manchester
Introduction Introduction 2
Context NLAFET (H2020 Project) - end April 2019 • Task-based implementation of algorithms in Plasma library • Porting from QUARK to OpenMP Runtime System • Novel SVD algorithms: • 2-stage SVD : reduction to band format then eigensolver or SVD (D&C or QR). • Polar Decomposition-based SVD : QDWH iterations. • Benefit of QDWH only at large scale on many-core architectures. 3
Polar Decomposition The Polar Decomposition For all non-singular A ∈ C m × n , there exists a unique decomposition A = UH where • U ∈ C m × n a set of n orthonormal columns • H ∈ C n × n a Hermitian pos. semi-def. matrix Reverse or Left PD A = H ℓ U , with H ℓ = UHU ∗ ∈ C m × m Relation with SVD • Decompositions are equivalent • Proofs of PD rely on SVD 4
Polar Decomposition Applications • Matrix nearness [Higham, 1986] • Nearest orthogonal matrix ( Procrustes problem) • Factor Analysis, Multidimensional Scaling, . . . • Optimization (Gradient descent) • Nearby ( H ) or nearest ( . 5( H + A )) pos. def. matrix • Matrix functions: square root, p -th root, . . . • ex: Square root of spd A ∈ R n × n � If A = LL T ( Chol ) H = A 1 / 2 ⇒ and L T = UH ( PD ) 5
Polar Decomposition Application to Matrix Decompositions • SVD of A ∈ C m × n � A = W Σ V T If A = UH ( PD ) ( SVD ) ⇒ and H = V Σ V T ( EVD ) with W = UV • QDWH-Eig not detailled here • Only a portion of the spectrum • Spectral D&C algorithm (QDWH + subspace iterations) 6
Polar Decomposition Algorithms Root finding algorithms on singular values of U . • Scaled Newton’s iterations • 9 iterations (2nd order) • Backward stable • Q R-based ( D ynamically W eighted) Halley’s iterations • 6 iterations (3rd order - Pad´ e family) • Backward stable [Nakatsukasa et al., 2010] • Inverse-free + communications-friendly • Many cheap (Lvl3 Blas) flops 7
Polar Decomposition Algorithms Root finding algorithms on singular values of U . • Scaled Newton’s iterations • 9 iterations (2nd order) • Backward stable • Q R-based ( D ynamically W eighted) Halley’s iterations • 6 iterations (3rd order - Pad´ e family) • Backward stable [Nakatsukasa et al., 2010] • Inverse-free + communications-friendly • Many cheap (Lvl3 Blas) flops • Zolotarev functions • 2 iterations (order 17!) • More flops but more paralellism • Well-suited for high granularity computing resources! • Best rational approximant of matrix sign function , see [Nakatsukasa and Freund, 2014] or [Higham, 2008, Ch.5]. 7
State of the art implementation H. Ltaief, D. Sukkari & D. Keyes (KAUST) • PD, PD-SVD and PD-eig (PD=QDWH or Zolo) • Distributed memory: Scalapack + Chameleon + StarPU • Massively Parallel PD [Ltaief et al., 2018] • Zolo up to 2 . 3 × faster than QDWH • Cray XC40 system on 3200 Intel 16-core Haswell nodes Other implementations • QDWH in Elemental library • QDWH-(S,E)VD with Scalapack + ELPA [Li et al., 2018] 8
QDWH based ma- trix decomposition QDWH based matrix decomposi- tion Algorithm Convergence Flops count 9
Polar Factor: U = lim k →∞ X k QDWH iterations [ U ] = qdwh ( A , α, β, ε ) X 0 = A /α , ℓ 0 = β/α 1 k = 0 2 while | 1 − ℓ k | < ε 3 a k = h ( ℓ k ), b k = g ( a k ), c k = a k + b k − 1 4 5 � √ c k X k ; I n � [ Q 1 ; Q 2 ] R = qr ( ) 6 X k +1 = ( b k / c k ) X k + (1 / c k )( a k − b k / c k ) Q 1 Q H 7 2 8 ℓ k +1 = ℓ k ( a k + b k ℓ 2 k ) / (1 + c k ℓ 2 k ) 9 k = k + 1 10 end 11 U = X k +1 12 10
Convergence of QDWH iterations Number of iterations depends on conditioning κ 2 ( A ) = 1 /ℓ 0 . • Goal: Mapping all singular values of X 0 in [ ℓ 0 , 1] to 1 • Criterion: closeness of σ ( X k ) to 1, i.e. the distance | 1 − ℓ k | • Estimate # of iterations a priori using σ i ( X k +1 ) = σ i ( X k ) a k + b k σ 2 i ( X k ) 1 + c k σ 2 i ( X k ) • Parameters ( a k , b k , c k ) → (3 , 1 , 3) and optimized to ensure cubical convergence • In practice, less than 6 iterations in double precision 11
Convergence of QDWH iterations Estimating parameters α = � A � 2 and ℓ 0 = β/α with β = 1 / � A − 1 � 2 • Upper bound for α = σ max ( A ) • Can use ˆ α = � A � F or • 2-norm estimate based on power iterations ( normest ) • Lower bound for ℓ 0 = σ min ( X 0 ) ℓ 0 = � X 0 � 1 / ( √ n κ 1 ( X 0 )) • ˆ • Estimate κ 1 ( X 0 ) ( 8 / 3n 3 ) • using condest [Higham and Tisseur, 2000] • or simply � X − 1 ( 5 / 3n 3 ) � 1 using QR + triangular solve 0 • Poor (over)-estimation of ℓ can increase # of iterations Optimized QR it. [Nakatsukasa and Higham, 2013] • Re-use Q in first it QR 3 ) n 3 to 5 n 3 • Use identity structure to decrease it QR cost from (6 + 2 12
Fast Cholesky iterations Optimized QDWH Iterations [ U ] = qdwh ( A , α, β, ε ) [ . . . ] 1 - 3mn 2 + n 3 / 3 if c k < 100 // PO-based it. 2 Z = I n + c k X ∗ k X k 3 W = chol ( Z ) 4 � U k W − 1 � X k +1 = ( a k / b k ) X k + ( a k − b k / c k ) W −∗ 5 - 5mn 2 else // QR-based it. 6 [ . . . ] 7 Cholesky-based iterations are not stable [Nakatsukasa and Higham, 2012] • Forward error in X k +1 is bounded by c k ε • c k → 3 and c k ≤ 100 for k ≥ 2 for all practical ℓ 0 • Hence, switch to PO iterations when c k < 100 13
Matrix Decomposition QDWH-PD QDWH-SVD � W , Σ , V H � [ U , H ] = qdwh − pd ( A ) = qdwh − svd ( A ) U = qdwh ( A ) [ U , H ] = qdwh − pd ( A ) 1 1 H = U H A +2 m 2 n +4 n 3 [ V , Σ] = syev ( H ) 2 2 H = ( H + H H ) / 2 +2 mn 2 W = UV 3 3 Additional cost • PD needs 1 extra matrix multiplication ( gemm ) • SVD needs 2 extra gemm + 1 syev • Can both be implemented with similar memory footprint 14
Flops count: QDWH-PD Overall count 1 of QDWH-PD with m = n � 8 + 2 � � 4 + 1 � n 3 # it QR + n 3 # it PO + 2 n 3 3 3 Nature of iterations with respect to condition number 10 1 -10 2 10 3 -10 5 10 6 -10 13 10 14 -10 16 κ 2 1 QR 1 1 2 2 3 PO 3 4 3 4 3 23 + 2 32 + 1 36 + 2 flops 28 41 3 3 3 + it opt 20 + 1 24 + 2 33 + 1 37 + 2 29 QR 3 3 3 3 Table 1: # of QR and PO iterations and flops count ( / n 3 ) for QDWH-PD. 3 n 3 for estimating ℓ 0 or exploiting trailing identity matrix structure. 1 without 5 15
Flops count: QDWH-PD Overall count 1 of QDWH-PD with m = n � 8 + 2 � � 4 + 1 � n 3 # it QR + n 3 # it PO + 2 n 3 3 3 Nature of iterations with respect to condition number 10 1 -10 2 10 3 -10 5 10 6 -10 13 10 14 -10 16 κ 2 1 QR 0 . . . 2 PO 2 . . . 4 10 + 2 36 + 2 flops . . . 3 3 + it opt 12 + 1 33 + 1 . . . QR 3 3 Table 1: # of QR and PO iterations and flops count ( / n 3 ) for QDWH-PD. 3 n 3 for estimating ℓ 0 or exploiting trailing identity matrix structure. 1 without 5 15
Flops count: QDWH-SVD 3 ) ≤ # flops • QDWH-PD: (10 + 2 ≤ (36 + 2 3 ) n 3 • Symmetric Eigensolver + Multiplication • QDWH-eig: • 2-stage-eig: • (16 + 1 9 ) ≤ . . . ≤ (50 + 7 4 9 ) • 3 • (17 + 4 9 ) ≤ . . . ≤ (52 + 4 9 ) with vecs • 4 with vecs Σ U , Σ , V 2 + 2 22 dges(vd/dd) 3 10 3 = 3 + 1 10 3 + 4 = 7 + 1 2-stage-svd 3 3 14 + 2 3 ≤ . . . ≤ 40 + 2 (+ 2-stage-eig ) 12 ≤ . . . ≤ 38 3 QDWH-svd (+ QDWH-eig ) 26 + 5 9 ≤ . . . ≤ 52 + 5 27 + 8 9 ≤ . . . ≤ 53 + 8 9 9 Table 2: Floating point operations counts ( / n 3 ) for SVD. 16
Flops Count Singular values only Singular values and vectors 2 - stage - svd 2 - stage - svd 400 qdwh - svd = qdwh + syev qdwh - svd = qdwh + syev 400 qdwh - svd = qdwh + qdwh - eig qdwh - svd = qdwh + qdwh - eig Flops count (GFlops) Flops count (GFlops) 300 300 200 200 100 100 0 0 5 10 15 20 5 10 15 20 Matrix Size: n ( / 1 . 000) Matrix Size: n ( / 1 . 000) 17
Memory footprint Real double precision - QDWH-PD Stored matrices m = n m = 3 n 20 m = 10 n • A ∈ R m × n Memory Footprint (GiB) • U ∈ R m × n and H ∈ R n × n 15 � √ c k X k ; I n ∈ R ( m + n ) × n 10 � • B = • Q = [ Q 1 ; Q 2 ] ∈ R ( m + n ) × n 5 ⇒ 4 mn + 3 n 2 entries 0 5 10 15 20 Number of rows: m ( / 1 . 000) Memory available on Intel nodes or NVIDIA accelerators • Intel KNL has up to 16GiB of MCDRAM (depending on mode) • Haswell/Sandy Bridge around 64GiB • Skylake: 64/128GiB • Tesla V100 GPU: 16/32GiB 18
Experiments Experiments Runtime System QR-Optimization Architecture 19
Numerical Experiments Real square matrices in double precision • dlatms : prescribed condition number and spectrum • dlarnv : entries sampled at random in [0 , 1] • m = n = 2 . 000 , . . . , 16 . 000 Computer architectures (Intel) Runtime systems • Haswell: 20 cores • Quark (Plasma-2.8) • Sandy Bridge: 16 cores • NEW OpenMP (Plasma-17) • KNL: 68 cores 20
QDWH on runtime systems Intel Haswell - 20 cores 150 QUARK qdwh QUARK qdwh opt. OpenMP qdwh OpenMP qdwh opt. 100 Time (s) 50 0 2 4 6 8 10 12 Matrix Size: n ( / 1 . 000) 21
Recommend
More recommend