parallelizing the hamiltonian computation in dqmc
play

Parallelizing the Hamiltonian Computation in DQMC Simulations: - PowerPoint PPT Presentation

Parallelizing the Hamiltonian Computation in DQMC Simulations: Checkerboard Method for Sparse Matrix Exponentials on Multicore and GPU Che-Rung Lee National Tsing Hua University joint work with Zhi-Hung Chen and Quey-Liang Kao Second


  1. Parallelizing the Hamiltonian Computation in DQMC Simulations: Checkerboard Method for Sparse Matrix Exponentials on Multicore and GPU Che-Rung Lee National Tsing Hua University joint work with Zhi-Hung Chen and Quey-Liang Kao Second International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) May 25th, 2012 Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31

  2. Outline Determinant quantum Monte Carlo simulations 1 Matrix multiplication of sparse matrix exponentials 2 Parallel block checkerboard methods on multicore and GPU 3 Experiments and results 4 Concluding remarks 5 Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 2 / 31

  3. Computational Material Science To study the properties of solid-state materials: magnetism, metal-insulator transition, high temperature superconductivity, ... Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 3 / 31

  4. Computational Material Science To study the properties of solid-state materials: magnetism, metal-insulator transition, high temperature superconductivity, ... The Hubbard model: Energy operator H is associated with a lattice of particles. Boltzmann weight is expressed as a path integral e − βH ≈ e − τH ( h 1 ) e − τH ( h 2 ) · · · e − τH ( h L ) . β = 1 /T is the “imaginary time”. τ = β/L is the discretized time step. { h i } is the “Hubbard-Stratonovich field”. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 3 / 31

  5. Determinant Quantum Monte Carlo (DQMC) Simulations DQMC algorithm Random HS field 1 Given a random h = ( h ℓ,i ) = ( ± 1) . warmup DQMC step 2 Until there are enough measurements no For ℓ = 1 , . . . , L and i = 1 , . . . , N thermalized Propose a new HS config h ′ . yes 1 Compute the ratio γ of the 2 DQMC step determinants of new/old configs. sampling Generate a random number ρ ∈ [0 , 1] . 3 Measurements If γ > ρ , accept h = h ′ . 4 If the system is thermalized, sample no 5 enough samples the interested physical measurements. yes 3 Aggregate the sampled measurements. Aggregation Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 4 / 31

  6. DQMC Parallelization Random HS field warmup Parallel Monte Carlo method can DQMC step speedup DQMC simulations by no parallelizing the sampling stage. thermalized yes DQMC step DQMC step DQMC step ... sampling Measurements Measurements Measurements ... Aggregation Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 5 / 31

  7. DQMC Parallelization Random HS field warmup Parallel Monte Carlo method can DQMC step speedup DQMC simulations by no parallelizing the sampling stage. thermalized yes DQMC step DQMC step DQMC step Coarse-grained parallelization. ... (Communication only happens before sampling sampling and in aggregation.) Measurements Measurements Measurements ... Aggregation Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 5 / 31

  8. DQMC Parallelization Random HS field warmup Parallel Monte Carlo method can DQMC step speedup DQMC simulations by no parallelizing the sampling stage. thermalized yes DQMC step DQMC step DQMC step Coarse-grained parallelization. ... (Communication only happens before sampling sampling and in aggregation.) Measurements Measurements Measurements ... Strong scalability if the number of desired samplings is much larger than Aggregation the number of processors. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 5 / 31

  9. Computational Challenges By Amdahl’s law, the speedup of parallel Monte Carlo method is limited by the warmup stage (non-parallelizable). T warmup + T sampling Speedup = T warmup + T sampling /p T warmup + T sampling → T warmup Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 6 / 31

  10. Computational Challenges By Amdahl’s law, the speedup of parallel Monte Carlo method is limited by the warmup stage (non-parallelizable). T warmup + T sampling Speedup = T warmup + T sampling /p T warmup + T sampling → T warmup Parallel Monte Carlo method does not scale with problem size, i.e. number of particles and discretized time length. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 6 / 31

  11. Computational Challenges By Amdahl’s law, the speedup of parallel Monte Carlo method is limited by the warmup stage (non-parallelizable). T warmup + T sampling Speedup = T warmup + T sampling /p T warmup + T sampling → T warmup Parallel Monte Carlo method does not scale with problem size, i.e. number of particles and discretized time length. Coarse grained parallelization does not fit well on multicore and GPU. The computation of each DQMC step is complicated. Slower execution because of resource contention. Memory per core is reduced with the number of cores. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 6 / 31

  12. Inside Each DQMC Step A DQMC step Random HS field warmup 1 Propose a local change: h → h ′ . DQMC step 2 Throw a random number 0 < r < 1 . no 3 Accept the change if r < det( e − βH ( h ′ ) ) thermalized det( e − βH ( h ) ) . yes DQMC step sampling Measurements no enough samples yes Aggregation Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 7 / 31

  13. Inside Each DQMC Step A DQMC step Random HS field warmup 1 Propose a local change: h → h ′ . DQMC step 2 Throw a random number 0 < r < 1 . no 3 Accept the change if r < det( e − βH ( h ′ ) ) thermalized det( e − βH ( h ) ) . yes DQMC step Computational Kernel: Green’s function cal- sampling culation Measurements G = ( I + B L · · · B 2 B 1 ) − 1 . no enough samples yes for computation of det( e − βH ( h ′ ) ) and phys- Aggregation ical measurements. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 7 / 31

  14. Green’s Function Calculation G = ( I + B L · · · B 2 B 1 ) − 1 . N : the number of particles; L : the number of time slices. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 8 / 31

  15. Green’s Function Calculation G = ( I + B L · · · B 2 B 1 ) − 1 . N : the number of particles; L : the number of time slices. Time complexity of computing G is O ( N 3 L ) . Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 8 / 31

  16. Green’s Function Calculation G = ( I + B L · · · B 2 B 1 ) − 1 . N : the number of particles; L : the number of time slices. Time complexity of computing G is O ( N 3 L ) . For 10 3 warmup steps and 10 4 sampling steps, it takes 15 hours. For large simulations, N = O (10 4 ) , L = O (10 2 ) , the projected execution time could take several days to months. Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 8 / 31

  17. Green’s Function Calculation G = ( I + B L · · · B 2 B 1 ) − 1 . N : the number of particles; L : the number of time slices. Time complexity of computing G is O ( N 3 L ) . For 10 3 warmup steps and 10 4 sampling steps, it takes 15 hours. For large simulations, N = O (10 4 ) , L = O (10 2 ) , the projected execution time could take several days to months. Profile of a DQMC simulation ( N = 256 , L = 96 ) Matrix kernel Execution time Matrix-matrix multiplication 72.39% Pivoted QR decomposition 17.83% Matrix inversion 3.02% Others 6.76% Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 8 / 31

  18. Green’s Function Calculation G = ( I + B L · · · B 2 B 1 ) − 1 . N : the number of particles; L : the number of time slices. Time complexity of computing G is O ( N 3 L ) . For 10 3 warmup steps and 10 4 sampling steps, it takes 15 hours. For large simulations, N = O (10 4 ) , L = O (10 2 ) , the projected execution time could take several days to months. Profile of a DQMC simulation ( N = 256 , L = 96 ) Matrix kernel Execution time Matrix-matrix multiplication 72.39% Pivoted QR decomposition 17.83% Matrix inversion 3.02% Others 6.76% Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 8 / 31

  19. Outline Determinant quantum Monte Carlo simulations 1 Matrix multiplication of sparse matrix exponentials 2 Parallel block checkerboard methods on multicore and GPU 3 Experiments and results 4 Concluding remarks 5 Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 9 / 31

  20. Matrix-matrix Multiplication Some tuned result on multicore and on GPU (Fermi) DGEMM on Intel Core i7-920 4 core with MKL is about 40 Gflop/s. (my laptop.) SGEMM can reach 662 Gflop/s on Fermi. [Jakub Kurzak LAWN 245, 2010] DGEMM (362Gflop/s on Fermi) [Guangming Tan et. al. SC11] Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 10 / 31

  21. Matrix-matrix Multiplication Some tuned result on multicore and on GPU (Fermi) DGEMM on Intel Core i7-920 4 core with MKL is about 40 Gflop/s. (my laptop.) SGEMM can reach 662 Gflop/s on Fermi. [Jakub Kurzak LAWN 245, 2010] DGEMM (362Gflop/s on Fermi) [Guangming Tan et. al. SC11] It is great, but the running time grows cubically with problem size N . Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 10 / 31

Recommend


More recommend