unleashing the performance potential of
play

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic - PowerPoint PPT Presentation

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University April/5th/2016 Tsinghua HPGC


  1. Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University April/5th/2016

  2. Tsinghua HPGC Group  HPGC: high performance geo-computing http://www.thuhpgc.org  High performance computational solutions for geoscience applications  simulation-oriented research: providing highly efficient and highly scalable simulation applications (exploration geophysics, climate modeling)  data-oriented research: data processing, data compression, and data mining  Combine optimizations from three different perspectives (Application, Algorithm, and Architecture), especially focused on new accelerator architectures

  3. A Design Process That Combines Optimizations from Different Layers Application Algorithm Architecture The “Best” Computational Solution 3

  4. • Exploration Geophysics • GPU-based BEAM Migration (sponsored by Statoil) • GPU-based ETE Forward Modeling (sponsored by BGP) • Parallel Finite Element Electromagnetic Forward Modeling Method (sponsored by NSFC) • FPGA-based RTM (sponsored by NSFC and IBM) • Climate Modeling Application • global-scale atmospheric simulation (800 Tflops Shallow Water Equation Solver on Tianhe-1A, 1.4 Pflops atmospheric simulation 3D Euler Equation Solver on Tianhe-2) • FPGA-based atmospheric simulation (selected as one of the 27 Significant papers in the 25 years of the FPL conference) • Remote Sensing Data Processing • data analysis and visualization (sponsored by Microsoft) • deep learning based land cover mapping • Parallel Stencil on Different HPC Architectures • Parallel Sparse Matrix Solver Algorithm • Parallel Data Compression (PLZMA) (sponsored by ZTE) • Hardware-Based Gaussian Mixture Model Clustering Engine: 517x speedup • multi-core/many-core (CPU, GPU, MIC) Architecture • reconfigurable hardware (FPGA) Tsinghua HPGC Group: a Quick Overview on existing projects

  5. A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers 5

  6. The Gap between Software and Hardware 50P • millions lines of legacy code • poor scalability • written for multi-core, rather than many-core 100T China’s supercomputers China’s models • • heterogeneous systems pure CPU code • with GPUs or MICs scaling to hundreds or • millions of cores thousands of cores 6

  7. Our Research Goals • highly scalable framework that can efficiently utilize many-core accelerators • automated tools to with the legacy code 100T~1P China’s supercomputers China’s models • • heterogeneous systems pure CPU code • with GPUs or MICs scaling to hundreds or • millions of cores thousands of cores 7

  8. Our Research Goals • highly scalable framework that can efficiently utilize many-core accelerators • automated tools to with the legacy code 100T~1P China’s supercomputers China’s models • • heterogeneous systems pure CPU code • with GPUs or MICs scaling to hundreds or • millions of cores thousands of cores 8

  9. Example: Highly-Scalable Atmospheric Simulation Framework Yang, Chao Institute of Software, CAS cube-sphere grid or cloud resolving computational mathematics other grid explicit, implicit, or Wang, Lanning semi-implicit Beijing Normal University method climate modeling Application Algorithm Xue, Wei Tsinghua University computer science Architecture Fu, Haohuan Tsinghua University CPU, GPU, MIC, FPGA geo-computing C/C++, Fortran, MPI, CUDA, Java, … The “ Best ” Computational Solution 9

  10. A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers: Previous Efforts 10

  11. Highly-Scalable Framework for Atmospheric Modeling  2012: solving 2D SWE using CPU + GPU  800 Tflops on 40,000 CPU cores, and 3750 GPUs For more details, please refer to our PPoPP 2013 paper: “ A Peta-Scalable CPU-GPU Algorithm for Global Atmospheric 11 Simulations”, in Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) , pp. 1-12, Shenzhen, 2013. .

  12. Highly-Scalable Framework for Atmospheric Modeling  2012: solving 2D SWE using CPU + GPU  800 Tflops on 40,000 CPU cores, and 3750 GPUs  2013: 2D SWE on MIC and FPGA  1.26 Pflops on 207,456 CPU cores, and 25,932 MICs  another 10x on FPGA For more details, please refer to our IPDPS 2014 paper: "Enabling and Scaling a Global Shallow-Water Atmospheric Model on Tianhe- 2”; and our FPL 2013 paper: “Accelerating Solvers for Global Atmospheric Equations Through Mixed -Precision Data Flow Engine ”.

  13. Highly-Scalable Framework for Atmospheric Modeling  2012: solving 2D SWE using CPU + GPU  800 Tflops on 40,000 CPU cores, and 3750 GPUs  2013: 2D SWE on CPU+MIC and CPU+FPGA  1.26 Pflops on 207,456 CPU cores, and 25,932 MICs  another 10x on FPGA  2014: 3D Euler on MIC  1.7 Pflops on 147,456 CPU cores, and 18,432 MICs For more details, please refer to our paper: “Ultra -scalable CPU-MIC Acceleration of Mesoscale Atmospheric Modeling on Tianhe- 2” , IEEE Transaction on Computers.

  14. A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers: 3D Euler on CPU+GPU 14

  15. CPU-only Algorithm  Parallel Version - Multi-node & Multi-core - MPI Parallelism 25 points stencil 3D channel 15

  16. CPU-only Algorithm  Parallel Version CPU Algorithm per Stencil sweep Multi-node & Multi-core For each subdomain MPI Parallelism ① Update Halo  CPU Algorithm ② Calculate Euler stencil Workflow a. Compute Local Coordinate b. Compute Fluxes c. Compute Source Terms Per Stencil Sweep Halo CPU Stencil Computation Updating ② ① CPU Workflow 16

  17. Hybrid (CPU+GPU) Algorithm  Hybrid Partition  GPU  Inner Stencil Computation  CPU  Halo Updating & Outer Stencil Computation  CPU-GPU Hybrid Algorithm  CPU-GPU Hybrid Algorithm Per Stencil Sweep For each subdomain GPU side: PETSc Inner-part Euler Stencil CPU side: ① Update Halo ② Outer-part Euler stencil 3D channel Inner part Outer part BARRIER 4 layers GPU CPU-GPU Exchange CPU 17

  18. Hybrid Algorithm Design Per Stencil Sweep Halo CPU Stencil Computation Updating ① ② Per Stencil Sweep Inner Stencil Computation G2C GPU Halo Outer Stencil C2G CPU Updating Computation ① ② ③ Barrier Workflow 18

  19. A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers: GPU-related Optimizations 19

  20. Optimizations Pinned Memory SMEM/L1 AoS -> SoA Register Adjustment GPU Opt Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling OpenMP CPU SIMD Vectorization Opt Cache blocking 20

  21. Optimizations Pinned Memory SMEM/L1 AoS -> SoA Register Adjustment GPU Opt Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling OpenMP CPU SIMD Vectorization Opt Cache blocking 21

  22. Optimizations Pinned-memory Virtual Memory Physical Memory T2 Physical T1 Memory GPU GPU Theoretic: T2 = 1/3 * T1 Reality: T2 < 1/2 * T1 22

  23. Optimizations Pinned Memory SMEM/L1 AoS -> SoA Register Adjustment GPU Opt Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling OpenMP Compiler option CPU SIMD Vectorization Opt -Xptxas dlcm= ca Cache blocking 23

  24. Optimizations Pinned Memory SMEM/L1 AoS -> SoA Register Adjustment GPU Opt Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling OpenMP CPU SIMD Vectorization Opt Cache blocking 24

  25. Optimizations Pinned Memory SMEM/L1 Streaming Multi- AoS -> SoA Processor 64K Register Register Adjustment 2048 threads GPU Opt Kernel Splitting Rt: Register per thread Occupancy = (64*1024) / (2048*Rt) Other Methods Customizable Data Cache 256 registers per threads Inner-thread Rescheduling Rt = 256 1 Block per OpenMP SM CPU SIMD Vectorization Opt Occupancy = (64*1024) / (2048*Rt) = 12.5% Cache blocking 25

  26. Optimizations Pinned Memory SMEM/L1 Streaming Multi- Processor AoS -> SoA 64K Register 2048 threads Register Adjustment GPU Opt Kernel Splitting Rt: Register per thread Occupancy = (64*1024) / (2048*Rt) Other Methods Customizable Data Cache 64 registers per threads Inner-thread Rescheduling Rt = 64 4 Block per SM OpenMP Occupancy = (64*1024) / (2048*Rt) = 50% CPU SIMD Vectorization Opt Compiler option -maxrregcount = 64 Cache blocking 26

  27. Optimizations Pinned Memory SMEM/L1 AoS -> SoA Register Adjustment GPU Opt Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling OpenMP CPU SIMD Vectorization Opt Cache blocking 27

  28. Optimizations 28

  29. Optimizations 29

  30. Optimizations 30

  31. A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers: Results 31

  32. Experimental Result OpenMP CPU 19.7s SIMD Vectorization Opt Cache blocking 70% Pinned Memory 5.91s SMEM/L1 31.64x AoS -> SoA speedup over 69% Kernel Splitting 12-core CPU GPU 1.80s Register Adjustment Opt (E5-2697 v2) Other Methods Customized Data Cache 49% Inner-thread Rescheduling 0.92s 32

  33. Experimental Result 33

Recommend


More recommend