Daino: A High-level Framework for Parallel and Efficient AMR on GPUs Mohamed Wahib 1 , Naoya Maruyama 1,2 , Takayuki Aoki 2 1 RIKEN Advanced Institute for Computational Science, Kobe, Japan 2 Tokyo Institute of Technology, GSIC, Tokyo, Japan 11 th May 2017 GTC17
Summary Motivation & Problem: “ AMR is one of the paths to multi-scale exascale applications “ Producing efficient AMR code is hard (especially for GPU) Solution: A framework for producing efficient AMR code (for GPUs) Architecture-independent interface provided to the user A speedup model for quantifying the efficiency of AMR code Key results: We evaluate three AMR applications Speedups & scalability comparable to hand-written code (~3,642 K20x GPUs) 2
Adaptive Mesh Refinement (AMR) For meshes in some simulations using PDEs: We only require high resolution for areas of interest Resolution changes dynamically during simulation Achieving efficient AMR is challenging Managing an adaptive mesh can be complicated Balancing compute load and communication costs 3
Structured Tree-based AMR Many ways to represent the mesh We focus on octree representation (quadtree in 2D) Mesh divided into blocks, refine/coarsen if required PE 1 PE 2 PE 3 (a) (b) Octree-based meshes: (a) Adaptive mesh (b) Tree representation Operations applied on tree are distributed 4
How AMR Works Reduced Computation Initialize the Mesh (less data in mesh) FOR Simulation time DO Execute stencil operations for all blocks Computation Exchange ghost layers with neighbor nodes IF time to remesh Calculate remeshing critirion for all blocks Remeshing Refine or consolidate blocks Balance the mesh ENDIF Overhead IF time to load balance Load balancing Apply load balancing algorithm ENDIF ENDFOR 5
AMR on GPUs Hard to achieve efficient AMR with GPUs Few existing AMR frameworks support GPU: User must provide code optimized for GPU 1 Scalability problems due to CPU-GPU data movement 2 No speedup-bound model 3 Contributions of our framework 6
Framework for Efficient AMR A compiler and runtime Input: Serial code applying stencil on a uniform grid User adds directives to identify relevant data arrays Architecture-neutral Output: Executable binary for target architecture Code is parallel and optimized for GPU (MPI+CUDA) 7
Architecture-neutral Interface 1 (1 of 2) AMR frameworks Our framework CUDA code OpenMP Code Uniform Mesh Serial C Code #pragma daino kernel Two benefits: __global__ 3D_alloy(..) void 3D_alloy(..) void 3D_alloy(..) { { - Productivity { … CUDA kernel code ... #pragma omp for - Ability to apply #pragma daino data (Nx,Ny,Nz) } … kernel code ... {p, u, dpt, no, o;} low-level GPU } … kernel code ... optimizations } Framework Framework GPU AMR CPU AMR GPU AMR CPU AMR Executable Executable Executable Executable 8
Architecture-neutral Interface 1 (2 of 2) Minimal example of using directives in our framework #pragma dno kernel A target kernel void func(float ***a, float ***b, ..) { #pragma dno data domName(i, j, k) Data arrays + iterators a, b; #pragma dno timeloop Target loop for(int t; t< TIME_MAX;t++) { for(int i; i<NX; i++) for(int j; i<NY; j++) { ... // comput. not related to a and b for(int k; k<NZ; k++) { a[i][j][k] = c * (b[i-1][j][k] + b[i+1][j][k] + b[i][j][k] + b[i][j+1][k] + b[i][j-1][k]); } } } } 9
Scalable AMR: Data-centric Model 2 (1 of 2) A data-centric approach Each computing element specializes on its data Blocks on GPU, octree data structure on CPU Migrate all operations touching block data to GPU CPU only processes octree data structure 10
Scalable AMR: data-centric Model 2 (2 of 2) GPU CPU GPU 0 Memory Copy Initial Arrays Initialize 1. Copy Ghost Layers Exchange Ghost Layers CPU Memory 2. Octants (Data Arrays) Invoke Compute Stencil Stencil Kernel Invoke 3. GPU 1 Memory Invoke Loop Correction Kernel Post-Stencil (Correction) 4. MOVE BLOCKS Invoke Error Estim. Kernel Evaluate Error 5. Copy δ Octants Invoke < δ Octree Refine Kernel Refine (Data Arrays) (AMR Metadata) Invoke GPU 2 Memory > δ Consolidate Consolidate Kernel Update & Balance Octree 6. Finalize 7. Octants Copy Final Arrays (Data Arrays) Conceptual Overview of Data-centric GPU AMR All kernels are data parallel (i.e. well-suited to GPU) [1] Mohamed Wahib, Naoya Maruyama, Data-centric GPU-based Adaptive Mesh Refinement, IA^3'15, 5th Workshop on Irregular Applications Architectures and Algorithms, co- located with SC’15 11
Speedup Model 3 AMR promises reduced computation Problem overhead in managing hierarchal mesh Project speedup bound Informs framework designer of efficiency of AMR code Compare achieved speedup vs. projected upper-bound speedup Takes into account AMR overhead If projected speedup far from achieved speedup Some AMR overheads(s) not properly accounted for 12
Framework Implementation (1 of 2) Fixed Mesh Code LLVM-IR (Annotated) Compiler Front End Passes Daino Runtime Optimized LLVM-IR AMR library Call LLVM Comm. library Object Files Linker Adapted Mesh Executable Figure 1: Overview of framework implementation Pass Pass Pass Machine Front Back C/C++ IR IR IR IR Code End End Clang LLVM proper Apply translations and optimizations as passes 13
Framework Implementation (2 of 2) Application C Code Stencil Code Emit Translator AST Emit Application Refine Kernel Generate LLVM IR (NVVM IR) Coarsen Kernel Stencil GPU Stencil IR (NVVM IR) Kernel Error Kernel (NVVM IR) (CUDA) IR Pass NVPTX Application CUDA Driver API Call LLVM IR PTX AMR Driver Daino Runtime IR API Call AMR library Compile Object Files Comm. library Link Executable The Daino framework overview. Application C code is transformed to an optimized executable. Daino components enclosed in red dotted line 14
Runtime Libraries AMR Management Maintain the octree Orchestration of work Memory manager Especially important with GPU Communication MPI processes Halo data exchange Transparent access to blocks Moving blocks (load balancing) 15
Evaluation Application Description A 2 nd order directionally split hyperbolic Hydrodynamics schemes to solve Euler equations. [RTVD scheme modified from GAMER 1 ] Solver We model shallow water simulations by Shallow-water depth-averaging Navier – Stokes equations. Solver [2 nd order Runge-Kutta method] 3D dendritic growth during binary alloy Phase-field solidification 2 Simulation [Time integartion by Allen-Chan equation] [1] H.-Y. Schive, U.-H. Zhang, and T. Chiueh. Directionally Unsplit Hydrodynamic Schemes with Hybrid MPI/ OpenMP/GPU Parallelization in AMR. Int. J. High Perform. Comput. Appl. , 26(4):367 – 377, Nov. 2012 [2] T. Shimokawabe et. Al, Peta-scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer, SC’11 16
Results (1 of 4) We use TSUBAME2.5 supercomputer (TokyoTech) Up to 3,642 K20x GPUs TSUBAME Grand Challenge Category A (full machine) HYDRODYNAMICS PHASE-FIELD SHALLOW-WATERS Uniform Mesh Auto AMR (Daino) Uniform Mesh Auto AMR Hand-written AMR Uniform Mesh Auto AMR Hand-written AMR 2.5E+03 Hand-written AMR Auto AMR (GAMER) 3.0E+02 2.0E+03 1.66 x 1.78 x 2.0E+03 2.5E+02 2.9 x 3.8 x 1.5E+03 Runtime (Seconds) Runtime (Seconds) Runtime (Seconds) 2.0E+02 8.5 x 9.4 x 1.5E+03 1.5E+02 1.0E+03 1.0E+03 1.0E+02 5.0E+02 5.0E+02 5.1E+01 1.0E+00 1.0E+00 0.0E+00 16 64 256 576 1024 1600 2288 2880 3600 16 64 256 576 1024 1600 2288 2880 3600 16 64 256 576 1024 1600 2288 2880 3600 Number GPUs (Mesh size per GPU: 4,096 3 ) Number GPUs (Mesh size per GPU: 8,192 3 ) Number GPUs (Mesh size per 16 GPUs: 4,096x512x512) Weak scaling of uniform mesh, hand-written and automated AMR (GAMER-generated AMR included in hydrodynamic) 17
Results (2 of 4) Notes: Phase-field achieves 1.7x speedup Original implementation is Gordon Bell 2011 winner Daino is faster than GAMER AMR version GAMER is a leading framework for AMR over GPUs PHASE-FIELD HYDRODYNAMICS SHALLOW-WATERS 2.1E+03 Uniform Mesh Uniform Mesh 3.5E+02 Uniform Mesh 3.5E+04 3.0E+02 Auto AMR (Daino) Auto AMR Auto AMR 1.7 x 3.0E+02 Hand-written AMR 3.0E+04 9.6 x Hand-written AMR Hand-written AMR 2.5E+02 Auto AMR (GAMER) 2.5E+02 2.5E+04 Runtime (Seconds) 4.1 x Runtime (Seconds) Runtime (Seconds) 2.0E+02 2.0E+02 2.0E+04 1.5E+02 1.5E+02 1.5E+04 1.0E+02 1.0E+02 1.0E+04 7.4 x 3.2 x 1.3 x 5.1E+01 5.1E+01 5.0E+03 1.0E+00 1.0E+00 1.0E+00 16 64 256 576 1024 1600 2288 2880 3600 16 64 256 576 1024 1600 2288 2880 3600 16 64 256 576 1024 1600 2288 2880 3600 Number GPUs (Mesh size per GPU: 4,096 3 ) Number GPUs (Mesh size 4,096 3 ) Number GPUs (Mesh size per GPU: 8,192 3 ) Strong scaling of uniform mesh, hand-written and automated AMR (GAMER-generated AMR included in hydrodynamic) 18
Results (3 of 4) Overhead of the AMR framework (weak scaling): AMR overhead Remeshing from 12% in 16 kernels are well- GPUs to 16% in suited to GPU 3600 GPUs 19
Recommend
More recommend