Optimized Polar Decomposition for Modern Computer Architectures - PowerPoint PPT Presentation

Optimized Polar Decomposition for Modern Computer Architectures Joint work with Pierre.Blanchard † @manchester.ac.uk NLA Group Meeting Manchester, UK. October 2, 2018 † School of Mathematics, The University of Manchester

Introduction Introduction 2

Context NLAFET (H2020 Project) - end April 2019 • Task-based implementation of algorithms in Plasma library • Porting from QUARK to OpenMP Runtime System • Novel SVD algorithms: • 2-stage SVD : reduction to band format then eigensolver or SVD (D&C or QR). • Polar Decomposition-based SVD : QDWH iterations. • Benefit of QDWH only at large scale on many-core architectures. 3

Polar Decomposition The Polar Decomposition For all non-singular A ∈ C m × n , there exists a unique decomposition A = UH where • U ∈ C m × n a set of n orthonormal columns • H ∈ C n × n a Hermitian pos. semi-def. matrix Reverse or Left PD A = H ℓ U , with H ℓ = UHU ∗ ∈ C m × m Relation with SVD • Decompositions are equivalent • Proofs of PD rely on SVD 4

Polar Decomposition Applications • Matrix nearness [Higham, 1986] • Nearest orthogonal matrix ( Procrustes problem) • Factor Analysis, Multidimensional Scaling, . . . • Optimization (Gradient descent) • Nearby ( H ) or nearest ( . 5( H + A )) pos. def. matrix • Matrix functions: square root, p -th root, . . . • ex: Square root of spd A ∈ R n × n � If A = LL T ( Chol ) H = A 1 / 2 ⇒ and L T = UH ( PD ) 5

Polar Decomposition Application to Matrix Decompositions • SVD of A ∈ C m × n � A = W Σ V T If A = UH ( PD ) ( SVD ) ⇒ and H = V Σ V T ( EVD ) with W = UV • QDWH-Eig not detailled here • Only a portion of the spectrum • Spectral D&C algorithm (QDWH + subspace iterations) 6

Polar Decomposition Algorithms Root finding algorithms on singular values of U . • Scaled Newton’s iterations • 9 iterations (2nd order) • Backward stable • Q R-based ( D ynamically W eighted) Halley’s iterations • 6 iterations (3rd order - Pad´ e family) • Backward stable [Nakatsukasa et al., 2010] • Inverse-free + communications-friendly • Many cheap (Lvl3 Blas) flops 7

Polar Decomposition Algorithms Root finding algorithms on singular values of U . • Scaled Newton’s iterations • 9 iterations (2nd order) • Backward stable • Q R-based ( D ynamically W eighted) Halley’s iterations • 6 iterations (3rd order - Pad´ e family) • Backward stable [Nakatsukasa et al., 2010] • Inverse-free + communications-friendly • Many cheap (Lvl3 Blas) flops • Zolotarev functions • 2 iterations (order 17!) • More flops but more paralellism • Well-suited for high granularity computing resources! • Best rational approximant of matrix sign function , see [Nakatsukasa and Freund, 2014] or [Higham, 2008, Ch.5]. 7

State of the art implementation H. Ltaief, D. Sukkari & D. Keyes (KAUST) • PD, PD-SVD and PD-eig (PD=QDWH or Zolo) • Distributed memory: Scalapack + Chameleon + StarPU • Massively Parallel PD [Ltaief et al., 2018] • Zolo up to 2 . 3 × faster than QDWH • Cray XC40 system on 3200 Intel 16-core Haswell nodes Other implementations • QDWH in Elemental library • QDWH-(S,E)VD with Scalapack + ELPA [Li et al., 2018] 8

QDWH based matrix decomposition QDWH based matrix decomposition Algorithm Convergence Flops count 9

Polar Factor: U = lim k →∞ X k QDWH iterations [ U ] = qdwh ( A , α, β, ε ) X 0 = A /α , ℓ 0 = β/α 1 k = 0 2 while | 1 − ℓ k | < ε 3 a k = h ( ℓ k ), b k = g ( a k ), c k = a k + b k − 1 4 5 � √ c k X k ; I n � [ Q 1 ; Q 2 ] R = qr ( ) 6 X k +1 = ( b k / c k ) X k + (1 / c k )( a k − b k / c k ) Q 1 Q H 7 2 8 ℓ k +1 = ℓ k ( a k + b k ℓ 2 k ) / (1 + c k ℓ 2 k ) 9 k = k + 1 10 end 11 U = X k +1 12 10

Convergence of QDWH iterations Number of iterations depends on conditioning κ 2 ( A ) = 1 /ℓ 0 . • Goal: Mapping all singular values of X 0 in [ ℓ 0 , 1] to 1 • Criterion: closeness of σ ( X k ) to 1, i.e. the distance | 1 − ℓ k | • Estimate # of iterations a priori using σ i ( X k +1 ) = σ i ( X k ) a k + b k σ 2 i ( X k ) 1 + c k σ 2 i ( X k ) • Parameters ( a k , b k , c k ) → (3 , 1 , 3) and optimized to ensure cubical convergence • In practice, less than 6 iterations in double precision 11

Convergence of QDWH iterations Estimating parameters α = � A � 2 and ℓ 0 = β/α with β = 1 / � A − 1 � 2 • Upper bound for α = σ max ( A ) • Can use ˆ α = � A � F or • 2-norm estimate based on power iterations ( normest ) • Lower bound for ℓ 0 = σ min ( X 0 ) ℓ 0 = � X 0 � 1 / ( √ n κ 1 ( X 0 )) • ˆ • Estimate κ 1 ( X 0 ) ( 8 / 3n 3 ) • using condest [Higham and Tisseur, 2000] • or simply � X − 1 ( 5 / 3n 3 ) � 1 using QR + triangular solve 0 • Poor (over)-estimation of ℓ can increase # of iterations Optimized QR it. [Nakatsukasa and Higham, 2013] • Re-use Q in first it QR 3 ) n 3 to 5 n 3 • Use identity structure to decrease it QR cost from (6 + 2 12

Fast Cholesky iterations Optimized QDWH Iterations [ U ] = qdwh ( A , α, β, ε ) [ . . . ] 1 - 3mn 2 + n 3 / 3 if c k < 100 // PO-based it. 2 Z = I n + c k X ∗ k X k 3 W = chol ( Z ) 4 � U k W − 1 � X k +1 = ( a k / b k ) X k + ( a k − b k / c k ) W −∗ 5 - 5mn 2 else // QR-based it. 6 [ . . . ] 7 Cholesky-based iterations are not stable [Nakatsukasa and Higham, 2012] • Forward error in X k +1 is bounded by c k ε • c k → 3 and c k ≤ 100 for k ≥ 2 for all practical ℓ 0 • Hence, switch to PO iterations when c k < 100 13

Matrix Decomposition QDWH-PD QDWH-SVD � W , Σ , V H � [ U , H ] = qdwh − pd ( A ) = qdwh − svd ( A ) U = qdwh ( A ) [ U , H ] = qdwh − pd ( A ) 1 1 H = U H A +2 m 2 n +4 n 3 [ V , Σ] = syev ( H ) 2 2 H = ( H + H H ) / 2 +2 mn 2 W = UV 3 3 Additional cost • PD needs 1 extra matrix multiplication ( gemm ) • SVD needs 2 extra gemm + 1 syev • Can both be implemented with similar memory footprint 14

Flops count: QDWH-PD Overall count 1 of QDWH-PD with m = n � 8 + 2 � � 4 + 1 � n 3 # it QR + n 3 # it PO + 2 n 3 3 3 Nature of iterations with respect to condition number 10 1 -10 2 10 3 -10 5 10 6 -10 13 10 14 -10 16 κ 2 1 QR 1 1 2 2 3 PO 3 4 3 4 3 23 + 2 32 + 1 36 + 2 flops 28 41 3 3 3 + it opt 20 + 1 24 + 2 33 + 1 37 + 2 29 QR 3 3 3 3 Table 1: # of QR and PO iterations and flops count ( / n 3 ) for QDWH-PD. 3 n 3 for estimating ℓ 0 or exploiting trailing identity matrix structure. 1 without 5 15

Flops count: QDWH-PD Overall count 1 of QDWH-PD with m = n � 8 + 2 � � 4 + 1 � n 3 # it QR + n 3 # it PO + 2 n 3 3 3 Nature of iterations with respect to condition number 10 1 -10 2 10 3 -10 5 10 6 -10 13 10 14 -10 16 κ 2 1 QR 0 . . . 2 PO 2 . . . 4 10 + 2 36 + 2 flops . . . 3 3 + it opt 12 + 1 33 + 1 . . . QR 3 3 Table 1: # of QR and PO iterations and flops count ( / n 3 ) for QDWH-PD. 3 n 3 for estimating ℓ 0 or exploiting trailing identity matrix structure. 1 without 5 15

Flops count: QDWH-SVD 3 ) ≤ # flops • QDWH-PD: (10 + 2 ≤ (36 + 2 3 ) n 3 • Symmetric Eigensolver + Multiplication • QDWH-eig: • 2-stage-eig: • (16 + 1 9 ) ≤ . . . ≤ (50 + 7 4 9 ) • 3 • (17 + 4 9 ) ≤ . . . ≤ (52 + 4 9 ) with vecs • 4 with vecs Σ U , Σ , V 2 + 2 22 dges(vd/dd) 3 10 3 = 3 + 1 10 3 + 4 = 7 + 1 2-stage-svd 3 3 14 + 2 3 ≤ . . . ≤ 40 + 2 (+ 2-stage-eig ) 12 ≤ . . . ≤ 38 3 QDWH-svd (+ QDWH-eig ) 26 + 5 9 ≤ . . . ≤ 52 + 5 27 + 8 9 ≤ . . . ≤ 53 + 8 9 9 Table 2: Floating point operations counts ( / n 3 ) for SVD. 16

Flops Count Singular values only Singular values and vectors 2 - stage - svd 2 - stage - svd 400 qdwh - svd = qdwh + syev qdwh - svd = qdwh + syev 400 qdwh - svd = qdwh + qdwh - eig qdwh - svd = qdwh + qdwh - eig Flops count (GFlops) Flops count (GFlops) 300 300 200 200 100 100 0 0 5 10 15 20 5 10 15 20 Matrix Size: n ( / 1 . 000) Matrix Size: n ( / 1 . 000) 17

Memory footprint Real double precision - QDWH-PD Stored matrices m = n m = 3 n 20 m = 10 n • A ∈ R m × n Memory Footprint (GiB) • U ∈ R m × n and H ∈ R n × n 15 � √ c k X k ; I n ∈ R ( m + n ) × n 10 � • B = • Q = [ Q 1 ; Q 2 ] ∈ R ( m + n ) × n 5 ⇒ 4 mn + 3 n 2 entries 0 5 10 15 20 Number of rows: m ( / 1 . 000) Memory available on Intel nodes or NVIDIA accelerators • Intel KNL has up to 16GiB of MCDRAM (depending on mode) • Haswell/Sandy Bridge around 64GiB • Skylake: 64/128GiB • Tesla V100 GPU: 16/32GiB 18

Experiments Experiments Runtime System QR-Optimization Architecture 19

Numerical Experiments Real square matrices in double precision • dlatms : prescribed condition number and spectrum • dlarnv : entries sampled at random in [0 , 1] • m = n = 2 . 000 , . . . , 16 . 000 Computer architectures (Intel) Runtime systems • Haswell: 20 cores • Quark (Plasma-2.8) • Sandy Bridge: 16 cores • NEW OpenMP (Plasma-17) • KNL: 68 cores 20

QDWH on runtime systems Intel Haswell - 20 cores 150 QUARK qdwh QUARK qdwh opt. OpenMP qdwh OpenMP qdwh opt. 100 Time (s) 50 0 2 4 6 8 10 12 Matrix Size: n ( / 1 . 000) 21

Optimized Polar Decomposition for Modern Computer Architectures - PowerPoint PPT Presentation

Optimized Polar Decomposition for Modern Computer Architectures Joint work with Pierre.Blanchard @manchester.ac.uk NLA Group Meeting Manchester, UK. October 2, 2018 School of Mathematics, The University of Manchester Introduction

Polar Decomposition of a Matrix Garrett Buffington May 4, 2014 The Polar Decomposition SVD and

Research on polar bears at Norwegian Polar Tracking the Polar Bear Institute Dr. Jon Aars,

Polar Tankers - ConocoPhillips Five Endeavour Class Tankers Built in 2001-2006 30,000

Thermal decomposition of the Thermal decomposition of the Thermal decomposition of the Thermal

Observations for the Polar Upper Atmosphere Research at Korea Polar Research Institute (KOPRI)

Polar Knowledge Canada Collaborating for the Future Canadas Polar Agency An Integrated

POLAR CAPITAL HOLDINGS PLC Results Presentation June 2007 Contents Overview of Polar

JUST THE MATHS SLIDES NUMBER 6.3 COMPLEX NUMBERS 3 (The polar & exponential forms)

First Results of POLAR: A dedicated Gamma-Ray Burst Polarimeter Merlin Kole on behalf of the

Polar Passage Art & Interpretives Oregon Zoo Bond Citizens Oversight Committee Meeting

POLAR CAPITAL HOLDINGS PLC 2008 Interim Results Mark Kary (CEO) John Mansell (COO) December

1 Organic Chemistry The Functional Group Approach Br OH alkane alcohol halide alkene

Helium Isotopes with BESS-Polar II Nicolas PICOT-CLEMENTE for the BESS-Polar Collaboration

1 Organic Chemistry The Functional Group Approach Br OH alkane alcohol halide alkene

Organic Chemistry The Functional Group Approach Br OH alkane alcohol halide alkene (no

JUST THE MATHS UNIT NUMBER 5.11 GEOMETRY 11 (Polar curves) by A.J.Hobson 5.11.1

Forward-Secure ID-Based Setting Madeline Gonz lez Mu iz* and Peeter Laud Theory Days Trve

Testing Gravity in the Cosmic Web Bridget Falck Institute of Cosmology and Gravitation

Can your UI change colors like chameleon? Vikrant Patil 1 Whats in? 2 Whats in? The talk

detector at the CAST experiment Christoph Krieger University of Bonn On behalf of the CAST

Testbeds for Reproducible Research Lucas Nussbaum lucas.nussbaum@loria.fr Lucas Nussbaum

Logical Structures in Natural Language: Exercises Propositional Logic (In class on Wed. 12th)

Revisiting TESLA in the quantum random oracle model Selected history of Fiat-Shamir style

The RTOS Chameleon for Linux Jan Kiszka Dipl.-Ing. Leibniz Universitt Hannover 2007-01-24

Optimized Polar Decomposition for Modern Computer Architectures - PowerPoint PPT Presentation

Optimized Polar Decomposition for Modern Computer Architectures Joint work with Pierre.Blanchard @manchester.ac.uk NLA Group Meeting Manchester, UK. October 2, 2018 School of Mathematics, The University of Manchester Introduction

Polar Decomposition of a Matrix Garrett Buffington May 4, 2014 The Polar Decomposition SVD and

Research on polar bears at Norwegian Polar Tracking the Polar Bear Institute Dr. Jon Aars,

Polar Tankers - ConocoPhillips Five Endeavour Class Tankers Built in 2001-2006 30,000

Thermal decomposition of the Thermal decomposition of the Thermal decomposition of the Thermal

Observations for the Polar Upper Atmosphere Research at Korea Polar Research Institute (KOPRI)

Polar Knowledge Canada Collaborating for the Future Canadas Polar Agency An Integrated

POLAR CAPITAL HOLDINGS PLC Results Presentation June 2007 Contents Overview of Polar

JUST THE MATHS SLIDES NUMBER 6.3 COMPLEX NUMBERS 3 (The polar &amp; exponential forms)

First Results of POLAR: A dedicated Gamma-Ray Burst Polarimeter Merlin Kole on behalf of the

Polar Passage Art &amp; Interpretives Oregon Zoo Bond Citizens Oversight Committee Meeting

POLAR CAPITAL HOLDINGS PLC 2008 Interim Results Mark Kary (CEO) John Mansell (COO) December

1 Organic Chemistry The Functional Group Approach Br OH alkane alcohol halide alkene

Helium Isotopes with BESS-Polar II Nicolas PICOT-CLEMENTE for the BESS-Polar Collaboration

1 Organic Chemistry The Functional Group Approach Br OH alkane alcohol halide alkene

Organic Chemistry The Functional Group Approach Br OH alkane alcohol halide alkene (no

JUST THE MATHS UNIT NUMBER 5.11 GEOMETRY 11 (Polar curves) by A.J.Hobson 5.11.1

Forward-Secure ID-Based Setting Madeline Gonz lez Mu iz* and Peeter Laud Theory Days Trve

Testing Gravity in the Cosmic Web Bridget Falck Institute of Cosmology and Gravitation

Can your UI change colors like chameleon? Vikrant Patil 1 Whats in? 2 Whats in? The talk

detector at the CAST experiment Christoph Krieger University of Bonn On behalf of the CAST

Testbeds for Reproducible Research Lucas Nussbaum lucas.nussbaum@loria.fr Lucas Nussbaum

Logical Structures in Natural Language: Exercises Propositional Logic (In class on Wed. 12th)

Revisiting TESLA in the quantum random oracle model Selected history of Fiat-Shamir style

The RTOS Chameleon for Linux Jan Kiszka Dipl.-Ing. Leibniz Universitt Hannover 2007-01-24

JUST THE MATHS SLIDES NUMBER 6.3 COMPLEX NUMBERS 3 (The polar & exponential forms)

Polar Passage Art & Interpretives Oregon Zoo Bond Citizens Oversight Committee Meeting