accelerated sparse matrix multiplication for quantum
play

Accelerated Sparse Matrix Multiplication for Quantum Chemistry with - PowerPoint PPT Presentation

Department of Materials Accelerated Sparse Matrix Multiplication for Quantum Chemistry with CP2K on Hybrid Supercomputers Ole Sch utt ole.schuett@mat.ethz.ch Nanoscale Simulations ole.schuett@mat.ethz.ch 1 / 17 Application: Emerging


  1. Department of Materials Accelerated Sparse Matrix Multiplication for Quantum Chemistry with CP2K on Hybrid Supercomputers Ole Sch¨ utt ole.schuett@mat.ethz.ch Nanoscale Simulations ole.schuett@mat.ethz.ch 1 / 17

  2. Application: Emerging Photovoltaics Processes at TiO 2 -Interface Electron Transport across Hole Transporting Material Nanoparticles (HTM) Schiffmann et al. (2010) 17k atoms, 80k electrons spiro-MeOTAD Requirements electronic properties = ⇒ Schr¨ odinger equation ( H Ψ = E Ψ) lack of symmetries = ⇒ large simulation cells ( > 1000 atoms) ole.schuett@mat.ethz.ch 2 / 17

  3. Linear Scaling Self Consistent Field Dense linear algebra Density P as matrix function of P : Sparse linear algebra Guess initial density ρ �� − 1 � � H − µ ✶ P = 1 + exp Calculate matrix H from ρ kT Costs: O ( N ), but dominates for small systems = 1 2 [1 − sign ( H − µ ✶ )] Calculate SCF Iteration limit of small kT (ground state) eigenvectors ψ i of H Costs: O ( N 3 ) Calculate ρ directly as Evaluate sign() as polynomial series: matrix function of H Costs: O ( N ) Calculate new density X 0 = A · || A || − 1 ρ = � i | ψ i | 2 X n +1 = 1 2 X n (3 ✶ − X 2 n ) Calculate energy from ρ sign( A ) = X ∞ LS-SCF entirely based on sparse linear algebra. ole.schuett@mat.ethz.ch 3 / 17

  4. Benchmarks of Condensed Phase Systems DFT on 46.656 cores Diagonalization 100 Linear scaling O ( N ) methods 80 DFTB on 9.216 cores are inevitable Wall time [min] 60 for large systems 40 20 VandeVondele et al. (2012): 0 0 10000 20000 30000 40000 50000 60000 Linear Scaling Self-Consistent Field Calculations Number of atoms with Millions of Atoms in the Condensed Phase ole.schuett@mat.ethz.ch 3 / 17

  5. The DBCSR Library DBCSR = Distributed Block Compressed Sparse Row Working horse of CP2K’s linear scaling DFT code Non-zero elements are small dense blocks e.g. 13 × 13 Each block corresponds to interaction between two atoms Additions are local operations Multiplications are more elaborate... H H H neglect exploit O O O distant symmetry atom pairs H H H H O H H O H H O H ole.schuett@mat.ethz.ch 4 / 17

  6. Architecture of DBCSR’s Multiplication Code Cluster Cannon MPI Parallelization Node Multrec Cache Optimization CSR Stack generation Scheduler CPU/GPU Load balancing Host Driver Cuda Driver fallback GPU BLAS Libsmm Libcusmm ole.schuett@mat.ethz.ch 5 / 17

  7. Hiding Communication with Double Buffering time MPI send MPI receive MPI send Host Buffer 1 generate stacks generate stacks host host to to Device process stacks process stack device device Buffer 1 MPI receive MPI send MPI receive Host generate stacks Buffer 2 host to Device process stacks device Buffer 2 1. Cannon Tick 2. Cannon Tick 3. Cannon Tick Ideally: Network and GPU always busy ole.schuett@mat.ethz.ch 6 / 17

  8. Managing Dependencies with Cuda Events and Streams time a panel host2dev b panel host2dev Streams c panel set zero dev2host stack buffer 1 host2dev calc stack buffer 2 host2dev calc Queried before reusing host stack buffer ole.schuett@mat.ethz.ch 7 / 17

  9. Cuda Kernel Implementation GPU Memory Usage Larger matrices are processed in slabs P A , P B , P C Each thread computes a tile T of the result slab P C Results T are keept in thread’s registers Outer-product style multiplication reduces access to P A and P B P B is stored transposed to coalesced memory access Write back to global memory uses Compare-and-Swap ole.schuett@mat.ethz.ch 8 / 17

  10. Cuda Kernel Auto-Tuning 300 Winner 250 Performance GFlop/s 200 150 100 50 0 1000 2000 3000 4000 5000 6000 7000 # Parameter Set Six parameters to optimize: v , w , N , M , #threads, #minBlocksPerSM On average > 8500 parameters-sets per kernel (heuristically pruned) Number of kernels optimized so far: 2349 ole.schuett@mat.ethz.ch 9 / 17

  11. Cuda Kernel Performance 1400 7 1200 libcusmm 6 cuBLAS 1000 Perfomance [GFlop/s] Arithmetic Intensity Roofline 5 wo/ writeback 800 4 600 3 400 2 200 1 0 0 0 20 40 60 80 100 120 140 160 Block size (n=m=k) K20X GPU has 1.3TFlop/s and 180GB/s memory bandwidth with ECC ole.schuett@mat.ethz.ch 10 / 17

  12. GPU Model Comparison Tesla K80 1200 Tesla K40 Tesla K20X 1000 Perfomance [GFlop/s] 800 600 400 200 0 0 5 10 15 20 25 30 35 Block size (n=m=k) ole.schuett@mat.ethz.ch 11 / 17

  13. Single Node Performance 350 300 Performance [GFLOP/s] 250 GPU+CPU CPU-only 200 150 100 50 0 2 4 6 8 10 12 Cores 4.5x Speedup GPU+CPU vs CPU-only Artifical benchmark with favorable 23x23 block-size; Dual Sandy Bridge (E5-2620, 2.0GHz, 6 cores); Nvidia K20 GPU. ole.schuett@mat.ethz.ch 12 / 17

  14. Full Daint System Science Case Matrix dims: 772868 × 772868 Filter threshold: 10 − 6 Matrix occupation ≈ 4% SCF steps ≈ 50 # multiplies needed ≈ 2000 Dense flops needed: 1846613343679824128000 Actual flops needed: 849928403736295802 Sparsity boost: 2172 × GPU flop share: 99.4% 80’000 atoms DFT, high accuracy settings Aggregated nanoparticles in explicit solution Walltime on 5184 nodes: 6264s Relevant for 3 rd generation solar cells ole.schuett@mat.ethz.ch 13 / 17

  15. Bridging from Linear Scaling SCF to Materials Properties 2D polymers: synthetically tailored 2D materials beyond graphene Based on linear scaling MD simulations for 10’000s of atoms, the morphology and properties of the proposed 2D polymer sheets has been investigated Payamyar et al., (2013) ADVANCED MATERIALS, DOI: 10.1002/adma.201304705 ole.schuett@mat.ethz.ch 14 / 17

  16. Bridging from Linear Scaling SCF to Materials Properties A 2 A 2 Area: 223˚ Area: 168˚ 2ps of MD Payamyar et al., (2013) ADVANCED MATERIALS, DOI: 10.1002/adma.201304705 ole.schuett@mat.ethz.ch 15 / 17

  17. Outlook: Strong Scaling of Dense Matrix Multiplications Matrix Functions: Diagonalization → Taylor series Matrix Inverse: Cholesky → Hotteling 250 Total Perfomance [TFlop/s] 200 150 100 cuBLAS wo/ comm 32er kernel wo/ comm 50 DBCSR (32er blocks) Cray's libsci_acc 0 0 200 400 600 800 1000 1200 # nodes Benchmark of pdgemm, 32kx32k double precision matrix ole.schuett@mat.ethz.ch 16 / 17

  18. Conclusion Our DBCSR library enables O ( N ) quantum chemistry methods, which allow for novel science. Lessons learned Overlapping communication with computation is key Auto-tuning is the way to go Avoid manual scheduling, use Cuda events Acknowledgements Contacts mailto:ole.schuett@mat.ethz.ch Joost VandeVondele http://nanosim.ethz.ch Florian Schiffmann http://dbcsr.cp2k.org Urban Borstnik http://cp2k.org Peter Messmer Please complete the Presenter Evaluation sent to you by email or through the GTC Mobile App. Your feedback is important! ole.schuett@mat.ethz.ch 17 / 17

Recommend


More recommend