Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University April/5th/2016
Tsinghua HPGC Group HPGC: high performance geo-computing http://www.thuhpgc.org High performance computational solutions for geoscience applications simulation-oriented research: providing highly efficient and highly scalable simulation applications (exploration geophysics, climate modeling) data-oriented research: data processing, data compression, and data mining Combine optimizations from three different perspectives (Application, Algorithm, and Architecture), especially focused on new accelerator architectures
A Design Process That Combines Optimizations from Different Layers Application Algorithm Architecture The “Best” Computational Solution 3
• Exploration Geophysics • GPU-based BEAM Migration (sponsored by Statoil) • GPU-based ETE Forward Modeling (sponsored by BGP) • Parallel Finite Element Electromagnetic Forward Modeling Method (sponsored by NSFC) • FPGA-based RTM (sponsored by NSFC and IBM) • Climate Modeling Application • global-scale atmospheric simulation (800 Tflops Shallow Water Equation Solver on Tianhe-1A, 1.4 Pflops atmospheric simulation 3D Euler Equation Solver on Tianhe-2) • FPGA-based atmospheric simulation (selected as one of the 27 Significant papers in the 25 years of the FPL conference) • Remote Sensing Data Processing • data analysis and visualization (sponsored by Microsoft) • deep learning based land cover mapping • Parallel Stencil on Different HPC Architectures • Parallel Sparse Matrix Solver Algorithm • Parallel Data Compression (PLZMA) (sponsored by ZTE) • Hardware-Based Gaussian Mixture Model Clustering Engine: 517x speedup • multi-core/many-core (CPU, GPU, MIC) Architecture • reconfigurable hardware (FPGA) Tsinghua HPGC Group: a Quick Overview on existing projects
A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers 5
The Gap between Software and Hardware 50P • millions lines of legacy code • poor scalability • written for multi-core, rather than many-core 100T China’s supercomputers China’s models • • heterogeneous systems pure CPU code • with GPUs or MICs scaling to hundreds or • millions of cores thousands of cores 6
Our Research Goals • highly scalable framework that can efficiently utilize many-core accelerators • automated tools to with the legacy code 100T~1P China’s supercomputers China’s models • • heterogeneous systems pure CPU code • with GPUs or MICs scaling to hundreds or • millions of cores thousands of cores 7
Our Research Goals • highly scalable framework that can efficiently utilize many-core accelerators • automated tools to with the legacy code 100T~1P China’s supercomputers China’s models • • heterogeneous systems pure CPU code • with GPUs or MICs scaling to hundreds or • millions of cores thousands of cores 8
Example: Highly-Scalable Atmospheric Simulation Framework Yang, Chao Institute of Software, CAS cube-sphere grid or cloud resolving computational mathematics other grid explicit, implicit, or Wang, Lanning semi-implicit Beijing Normal University method climate modeling Application Algorithm Xue, Wei Tsinghua University computer science Architecture Fu, Haohuan Tsinghua University CPU, GPU, MIC, FPGA geo-computing C/C++, Fortran, MPI, CUDA, Java, … The “ Best ” Computational Solution 9
A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers: Previous Efforts 10
Highly-Scalable Framework for Atmospheric Modeling 2012: solving 2D SWE using CPU + GPU 800 Tflops on 40,000 CPU cores, and 3750 GPUs For more details, please refer to our PPoPP 2013 paper: “ A Peta-Scalable CPU-GPU Algorithm for Global Atmospheric 11 Simulations”, in Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) , pp. 1-12, Shenzhen, 2013. .
Highly-Scalable Framework for Atmospheric Modeling 2012: solving 2D SWE using CPU + GPU 800 Tflops on 40,000 CPU cores, and 3750 GPUs 2013: 2D SWE on MIC and FPGA 1.26 Pflops on 207,456 CPU cores, and 25,932 MICs another 10x on FPGA For more details, please refer to our IPDPS 2014 paper: "Enabling and Scaling a Global Shallow-Water Atmospheric Model on Tianhe- 2”; and our FPL 2013 paper: “Accelerating Solvers for Global Atmospheric Equations Through Mixed -Precision Data Flow Engine ”.
Highly-Scalable Framework for Atmospheric Modeling 2012: solving 2D SWE using CPU + GPU 800 Tflops on 40,000 CPU cores, and 3750 GPUs 2013: 2D SWE on CPU+MIC and CPU+FPGA 1.26 Pflops on 207,456 CPU cores, and 25,932 MICs another 10x on FPGA 2014: 3D Euler on MIC 1.7 Pflops on 147,456 CPU cores, and 18,432 MICs For more details, please refer to our paper: “Ultra -scalable CPU-MIC Acceleration of Mesoscale Atmospheric Modeling on Tianhe- 2” , IEEE Transaction on Computers.
A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers: 3D Euler on CPU+GPU 14
CPU-only Algorithm Parallel Version - Multi-node & Multi-core - MPI Parallelism 25 points stencil 3D channel 15
CPU-only Algorithm Parallel Version CPU Algorithm per Stencil sweep Multi-node & Multi-core For each subdomain MPI Parallelism ① Update Halo CPU Algorithm ② Calculate Euler stencil Workflow a. Compute Local Coordinate b. Compute Fluxes c. Compute Source Terms Per Stencil Sweep Halo CPU Stencil Computation Updating ② ① CPU Workflow 16
Hybrid (CPU+GPU) Algorithm Hybrid Partition GPU Inner Stencil Computation CPU Halo Updating & Outer Stencil Computation CPU-GPU Hybrid Algorithm CPU-GPU Hybrid Algorithm Per Stencil Sweep For each subdomain GPU side: PETSc Inner-part Euler Stencil CPU side: ① Update Halo ② Outer-part Euler stencil 3D channel Inner part Outer part BARRIER 4 layers GPU CPU-GPU Exchange CPU 17
Hybrid Algorithm Design Per Stencil Sweep Halo CPU Stencil Computation Updating ① ② Per Stencil Sweep Inner Stencil Computation G2C GPU Halo Outer Stencil C2G CPU Updating Computation ① ② ③ Barrier Workflow 18
A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers: GPU-related Optimizations 19
Optimizations Pinned Memory SMEM/L1 AoS -> SoA Register Adjustment GPU Opt Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling OpenMP CPU SIMD Vectorization Opt Cache blocking 20
Optimizations Pinned Memory SMEM/L1 AoS -> SoA Register Adjustment GPU Opt Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling OpenMP CPU SIMD Vectorization Opt Cache blocking 21
Optimizations Pinned-memory Virtual Memory Physical Memory T2 Physical T1 Memory GPU GPU Theoretic: T2 = 1/3 * T1 Reality: T2 < 1/2 * T1 22
Optimizations Pinned Memory SMEM/L1 AoS -> SoA Register Adjustment GPU Opt Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling OpenMP Compiler option CPU SIMD Vectorization Opt -Xptxas dlcm= ca Cache blocking 23
Optimizations Pinned Memory SMEM/L1 AoS -> SoA Register Adjustment GPU Opt Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling OpenMP CPU SIMD Vectorization Opt Cache blocking 24
Optimizations Pinned Memory SMEM/L1 Streaming Multi- AoS -> SoA Processor 64K Register Register Adjustment 2048 threads GPU Opt Kernel Splitting Rt: Register per thread Occupancy = (64*1024) / (2048*Rt) Other Methods Customizable Data Cache 256 registers per threads Inner-thread Rescheduling Rt = 256 1 Block per OpenMP SM CPU SIMD Vectorization Opt Occupancy = (64*1024) / (2048*Rt) = 12.5% Cache blocking 25
Optimizations Pinned Memory SMEM/L1 Streaming Multi- Processor AoS -> SoA 64K Register 2048 threads Register Adjustment GPU Opt Kernel Splitting Rt: Register per thread Occupancy = (64*1024) / (2048*Rt) Other Methods Customizable Data Cache 64 registers per threads Inner-thread Rescheduling Rt = 64 4 Block per SM OpenMP Occupancy = (64*1024) / (2048*Rt) = 50% CPU SIMD Vectorization Opt Compiler option -maxrregcount = 64 Cache blocking 26
Optimizations Pinned Memory SMEM/L1 AoS -> SoA Register Adjustment GPU Opt Kernel Splitting Other Methods Customizable Data Cache Inner-thread Rescheduling OpenMP CPU SIMD Vectorization Opt Cache blocking 27
Optimizations 28
Optimizations 29
Optimizations 30
A Highly Scalable Framework for Atmospheric Modeling on Heterogeneous Supercomputers: Results 31
Experimental Result OpenMP CPU 19.7s SIMD Vectorization Opt Cache blocking 70% Pinned Memory 5.91s SMEM/L1 31.64x AoS -> SoA speedup over 69% Kernel Splitting 12-core CPU GPU 1.80s Register Adjustment Opt (E5-2697 v2) Other Methods Customized Data Cache 49% Inner-thread Rescheduling 0.92s 32
Experimental Result 33
Recommend
More recommend