Parallel Shift-Invert Spectrum Slicing on Distributed Architectures with GPU Accelerators (pap167s1) Da David Williams-You Young, Chao o Ya Yang Sc Scalable e So Solver ers s Group up Co Computational Research Di Division La Lawrence Berke keley National La Lab th In The 49 th Th International Conference ce on Paral Par allel Pr Processing (I (ICPP20) 20) Office of Of BERKELE LEY LA LAB 1 Scienc Sc ence
Di Diagona onalization on is the he Bot ottlenec eneck for or Large e Sc Scale e DFT DFT Calcul ulations ons Requires repeated partial ! " # " # = " # % # : ' ( ) diagonalization of 2 ( 2 < ( ) eigenpairs of the EVP. ! ∈ ℝ / × / Guess " # Form F(" # ) Solve EVP → " " # ∈ ℝ / × 1 % # ∈ ℝ 1 × 1 Terminate Converged? No Yes Of Office of BERKELE LEY LA LAB 2 Sc Scienc ence
General M Ge Methods ds a are f for Ge General P Probl blems We Must Exploit the Structure of the SCF Problem and Aspects of Modern Computing Architectures to Obtain Performance Improvements Office of Of BERKELE LEY LA LAB 3 Scienc Sc ence
FLOPs are FL e Chea heap vs Com ommuni unication on Reliance on accelerators (GPUs): IBM POWER9: 1 TFLOP/s • Intel KNL: 3 TFLOP/s • NVIDIA Tesla V100: 7.8 TFLOP/s • 46.8 TFLOP/s / Summit node (6x) • Cheap FLOPs → Exposing Bottlenecks E.g. 10k x 10k DGEMM on V100 (PCI-E) DGEMM time = 2 sec • Communication (H2D + D2H) = 6 sec • Of Office of BERKELE LEY LA LAB 4 Scienc Sc ence
The SISLICE Method A Parallel Implementation of Shift-Invert Spectrum Slicing for the SCF Eigenvalue Problem CPU Implementation: arXiv:1908.06043 (to appear in ACM Trans. Math. Softw. ) Of Office of BERKELE LEY LA LAB 5 Sc Scienc ence
Spec Sp ectrum um Sl Slicing ng Partitions ons the he Ei Eigenspectrum in into In Independent t Tasks σ n s − 1 σ n s σ 1 σ 2 Slice n s Slice n s + 1 Slice 1 Slice 2 • Eigenvalues in each slice are to be determined “independently” • Trades redundant FLOPs for less communication Of Office of BERKELE LEY LA LAB 6 Scienc Sc ence
The SISLICE Method Exploits the Conve Th vergent Properties of the SCF Pro th rocedure re Initial Guess for ! " Cheap / Replicated Computation (FLOPs) Yes First SCF Perform Shift-Invert Subspace Partition Spectrum → $ % Iteration? Iterations in Parallel No Update $ % Extract ! " Synchronization DBWY , et al . arXiv:1908.06043 Lin, et al. SIAM Rev 58, 34, 2016 Of Office of BERKELE LEY LA LAB 7 Scienc Sc ence
Shift-In Shi Invert rt Spectru trum Slicing Tra rades Diagonalizati tion for Linear System Solves Li ! " = " − %& '( ∶ * + , % ( % - : Symmetric F ∈ R N × N , shift partition { � j } n s j =1 , number of Input desired eigenpairs M , basis dimension K , and max iteration n iter . Output: Eigenvectors C ∈ R N × M , and eigenvalues E ∈ R M × M . 1 Distribute work over j . 2 for j assigned to this execution context do Form initial guess V j ∈ R N × K 3 Factorize ( F − � j I ) (TRF) 4 for i = 1 : n iter do V j ← ( F − � j I ) − 1 V j (TRS) 5 V j ← orth( V j ) (CholQR) 6 end ( V j , E j , ~ r i ) ← RayleighRitz( F, V j ) (RR) 7 end 8 ( C, E ) ← DistValidate( { ( V j , E j , ~ r j ) } ) DBWY , et al . arXiv:1908.06043 Of Office of BERKELE LEY LA LAB 8 Scienc Sc ence
Shift-In Shi Invert rt Spectru trum Slicing Tra rades Diagonalizati tion for Linear System Solves Li ! " = " − %& '( ∶ * + , % ( % - Triangular Factorization (LU / LDLT) + Back Solve • ü Lower prefactor / better strong scaling ü Able to exploit sparsity (SuperLU, PARDISO, etc) ✗ Orders of magnitude more FLOPs ü Shift Independence → massive parallelism DBWY , et al . arXiv:1908.06043 Of Office of BERKELE LEY LA LAB 9 Scienc Sc ence
Sync Sy nchr hroni onization on Onl nly Req equi uires es Com ommuni unication on of of !(#) Da Data Pre-Synchronization Post-Synchronization K = 100 Λ 1 Λ 1 X 1 X 1 10 K = 500 r 1 r 1 K = 1000 Rank 0 Wall Time / ms Λ 2 r 2 Λ 3 r 3 5 Λ 1 r 1 Rank 1 0 Λ 2 Λ 2 X 2 X 2 r 2 r 2 0 100 200 Λ 3 Nodes r 3 Strong scaling of SISLICE synchronization Λ 1 (MPI_Allgather) for various values of # . Timings r 1 Rank 2 were obtained on the Summit Supercomputer. Λ 2 r 2 Λ 3 Λ 3 X 3 X 3 r 3 r 3 DBWY , et al . arXiv:1908.06043 DBWY, et al. ICPP20 Office of Of BERKELE LEY LA LAB 10 10 Scienc Sc ence
Th The SISLICE CPU Implementation Exhibits Linear Strong Sc Scaling ng ScaLAPACK Si 10 H 16 (UF Sparse Matrix Collection) ELPA 103 SISLICE 8x8 • N = 17,077 SISLICE 16x16 • M = 8,500 • NS = 100, K = 100 • NNZ = 87,592 (99.7% zeros) Wall Time / s 102 • SuperLU for distributed LU factorization • NB = 128 for ScaLAPACK/ELPA 2.7x speedup 101 102 103 104 Number of Processors DBWY , et al . arXiv:1908.06043 Office of Of BERKELE LEY LA LAB 11 11 Scienc Sc ence
Th The Proxy y Applica cation for GPU-SI SISL SLICE SISLICE ≈ SISUBIT + Synchronization Three limiting cases for the shift-invert subspace iteration: 1. Shared memory, dense matrices (Dense SM) 2. Distributed memory, dense matrices (Dense DM) K = 100 3. Sparse matrices 10 K = 500 K = 1000 Wall Time / ms 5 0 0 100 200 Nodes Of Office of BERKELE LEY LA LAB 12 12 Sc Scienc ence
Dense Dens e SM SM-SI SISU SUBIT 10 6 TRF SM-V100 10 4 TRS SM-POWER9 Wall Time / ms Wall Time / ms CholQR SM-KNL 10 4 RR SM-XG H2D + D2H ELPA1-GPU 10 2 ELPA2-CPU 10 2 ScaLAPACK 10 0 10 0 V100 XG SISS SYEVD Speedup Kernel GPU: cuSOLVER + cuBLAS N ≤ 1,000 N ≥ 10,000 TRF 6x (XG) 3x (KNL) Intel CPU: MKL TRS 1.5x (POWER9) 4-5x (XG) CholQR 50x (POWER9) 20x (POWER9) IBM CPU: ESSL RR 1.5-2x (XG) 6x (XG) SISUBIT 1.5-2x (XG) 4x (XG) DBWY , et al . ICPP20 Office of Of BERKELE LEY LA LAB 13 13 Scienc Sc ence
Dense Dens e DM DM-SI SISU SUBIT · 10 5 6 ScaLAPACK N = 50,000 SLATE 10 6 ScaLAPACK N = 100,000 ScaLAPACK ScaLAPACK N = 200,000 Wall Time / ms Wall Time / ms ScaLAPACK N = 300,000 ELPA1-GPU SLATE N = 50,000 4 ELPA2-CPU SLATE N = 100,000 10 5 SLATE N = 200,000 SLATE N = 300,000 2 10 4 4 16 32 64 0 Nodes SISS SYEVD Speedup GPU: SLATE Kernel N = 100 , 000 N = 300 , 000 4 nodes 64 nodes 32 nodes 64 nodes CPU: ScaLAPACK / ELPA TRF 2.1x 0.8x 1.7x 1.8x TRS 0.4x 0.2x 0.1x 0.1x CholQR 0.07x 0.04x 0.07x 0.04x RR 0.07x 0.02x 0.02x 0.01x SISUBIT 2.3x 0.5x 1.1x 0.9x DBWY , et al . ICPP20 Office of Of BERKELE LEY LA LAB 14 14 Scienc Sc ence
Sp Sparse e SI SISU SUBIT 800 SuperLU_DIST SuiteSparse: Ga10As10H30 PARDISO • N = 113,081 Wall Time / s ScaLAPACK 600 ELPA1-GPU • NNZ = 6,115,633 (99.95% zero) ELPA2-CPU 400 200 SISS SYEVD SuperLU_DIST CPU TRF 10 2 SuperLU_DIST GPU TRF SuperLU_DIST TRS Wall Time / s PARDISO TRF PARDISO TRS 10 1 1 4 16 Nodes DBWY , et al . ICPP20 Office of Of BERKELE LEY LA LAB 15 15 Scienc Sc ence
Co Conclusions • For matrices that can occupy the memory of a single compute node, the GPU implementation of SISLICE exhibits performance gains over CPU implementations of SISLICE as well as SYEVD • Further improvements in the distributed memory GPU linear algebra software stack will yield drastic improvements in years to come Of Office of BERKELE LEY LA LAB 16 16 Scienc Sc ence
Acknowledgements Ac The development of SISLICE has been supported by the U.S. Department of Energy: Scientific Discovery Through Advanced Computing (SciDAC-4) • Exascale Computing Project (NWChemEx 17-SC-20-SC) • Calculations were performed using DOE computing facilities: Cori (NERSC) • Summit (OLCF) • Office of Of BERKELE LEY LA LAB 17 17 Scienc Sc ence
Recommend
More recommend