Daino: A High-level Framework for Parallel and Efficient AMR on GPUs - PowerPoint PPT Presentation

Daino: A High-level Framework for Parallel and Efficient AMR on GPUs Mohamed Wahib 1 , Naoya Maruyama 1,2 , Takayuki Aoki 2 1 RIKEN Advanced Institute for Computational Science, Kobe, Japan 2 Tokyo Institute of Technology, GSIC, Tokyo, Japan 11 th May 2017 GTC17

Summary  Motivation & Problem:  “ AMR is one of the paths to multi-scale exascale applications “  Producing efficient AMR code is hard (especially for GPU)  Solution:  A framework for producing efficient AMR code (for GPUs)  Architecture-independent interface provided to the user  A speedup model for quantifying the efficiency of AMR code  Key results: We evaluate three AMR applications  Speedups & scalability comparable to hand-written code (~3,642 K20x GPUs) 2

Adaptive Mesh Refinement (AMR)  For meshes in some simulations using PDEs:  We only require high resolution for areas of interest  Resolution changes dynamically during simulation  Achieving efficient AMR is challenging  Managing an adaptive mesh can be complicated  Balancing compute load and communication costs 3

Structured Tree-based AMR  Many ways to represent the mesh  We focus on octree representation (quadtree in 2D)  Mesh divided into blocks, refine/coarsen if required PE 1 PE 2 PE 3 (a) (b) Octree-based meshes: (a) Adaptive mesh (b) Tree representation Operations applied on tree are distributed 4

How AMR Works Reduced Computation  Initialize the Mesh (less data in mesh)  FOR Simulation time DO  Execute stencil operations for all blocks Computation  Exchange ghost layers with neighbor nodes  IF time to remesh  Calculate remeshing critirion for all blocks Remeshing  Refine or consolidate blocks  Balance the mesh  ENDIF Overhead  IF time to load balance Load balancing  Apply load balancing algorithm  ENDIF  ENDFOR 5

AMR on GPUs  Hard to achieve efficient AMR with GPUs  Few existing AMR frameworks support GPU:  User must provide code optimized for GPU 1  Scalability problems due to CPU-GPU data movement 2  No speedup-bound model 3 Contributions of our framework 6

Framework for Efficient AMR  A compiler and runtime  Input:  Serial code applying stencil on a uniform grid  User adds directives to identify relevant data arrays  Architecture-neutral  Output:  Executable binary for target architecture  Code is parallel and optimized for GPU (MPI+CUDA) 7

Architecture-neutral Interface 1 (1 of 2) AMR frameworks Our framework CUDA code OpenMP Code Uniform Mesh Serial C Code #pragma daino kernel Two benefits: __global__ 3D_alloy(..) void 3D_alloy(..) void 3D_alloy(..) { { - Productivity { … CUDA kernel code ... #pragma omp for - Ability to apply #pragma daino data (Nx,Ny,Nz) } … kernel code ... {p, u, dpt, no, o;} low-level GPU } … kernel code ... optimizations } Framework Framework GPU AMR CPU AMR GPU AMR CPU AMR Executable Executable Executable Executable 8

Architecture-neutral Interface 1 (2 of 2) Minimal example of using directives in our framework #pragma dno kernel A target kernel void func(float ***a, float ***b, ..) { #pragma dno data domName(i, j, k) Data arrays + iterators a, b; #pragma dno timeloop Target loop for(int t; t< TIME_MAX;t++) { for(int i; i<NX; i++) for(int j; i<NY; j++) { ... // comput. not related to a and b for(int k; k<NZ; k++) { a[i][j][k] = c * (b[i-1][j][k] + b[i+1][j][k] + b[i][j][k] + b[i][j+1][k] + b[i][j-1][k]); } } } } 9

Scalable AMR: Data-centric Model 2 (1 of 2)  A data-centric approach  Each computing element specializes on its data  Blocks on GPU, octree data structure on CPU  Migrate all operations touching block data to GPU  CPU only processes octree data structure 10

Scalable AMR: data-centric Model 2 (2 of 2) GPU CPU GPU 0 Memory Copy Initial Arrays Initialize 1. Copy Ghost Layers Exchange Ghost Layers CPU Memory 2. Octants (Data Arrays) Invoke Compute Stencil Stencil Kernel Invoke 3. GPU 1 Memory Invoke Loop Correction Kernel Post-Stencil (Correction) 4. MOVE BLOCKS Invoke Error Estim. Kernel Evaluate Error 5. Copy δ Octants Invoke < δ Octree Refine Kernel Refine (Data Arrays) (AMR Metadata) Invoke GPU 2 Memory > δ Consolidate Consolidate Kernel Update & Balance Octree 6. Finalize 7. Octants Copy Final Arrays (Data Arrays) Conceptual Overview of Data-centric GPU AMR  All kernels are data parallel (i.e. well-suited to GPU) [1] Mohamed Wahib, Naoya Maruyama, Data-centric GPU-based Adaptive Mesh Refinement, IA^3'15, 5th Workshop on Irregular Applications Architectures and Algorithms, co- located with SC’15 11

Speedup Model 3  AMR promises reduced computation  Problem  overhead in managing hierarchal mesh  Project speedup bound  Informs framework designer of  efficiency of AMR code  Compare achieved speedup vs. projected upper-bound speedup  Takes into account AMR overhead  If projected speedup  far from  achieved speedup  Some AMR overheads(s) not properly accounted for 12

Framework Implementation (1 of 2) Fixed Mesh Code LLVM-IR (Annotated) Compiler Front End Passes Daino Runtime Optimized LLVM-IR AMR library Call LLVM Comm. library Object Files Linker Adapted Mesh Executable Figure 1: Overview of framework implementation Pass Pass Pass Machine Front Back C/C++ IR IR IR IR Code End End Clang LLVM proper Apply translations and optimizations as passes 13

Framework Implementation (2 of 2) Application C Code Stencil Code Emit Translator AST Emit Application Refine Kernel Generate LLVM IR (NVVM IR) Coarsen Kernel Stencil GPU Stencil IR (NVVM IR) Kernel Error Kernel (NVVM IR) (CUDA) IR Pass NVPTX Application CUDA Driver API Call LLVM IR PTX AMR Driver Daino Runtime IR API Call AMR library Compile Object Files Comm. library Link Executable The Daino framework overview. Application C code is transformed to an optimized executable. Daino components enclosed in red dotted line 14

Runtime Libraries  AMR Management  Maintain the octree  Orchestration of work  Memory manager  Especially important with GPU  Communication  MPI processes  Halo data exchange  Transparent access to blocks  Moving blocks (load balancing) 15

Evaluation Application Description A 2 nd order directionally split hyperbolic Hydrodynamics schemes to solve Euler equations. [RTVD scheme modified from GAMER 1 ] Solver We model shallow water simulations by Shallow-water depth-averaging Navier – Stokes equations. Solver [2 nd order Runge-Kutta method] 3D dendritic growth during binary alloy Phase-field solidification 2 Simulation [Time integartion by Allen-Chan equation] [1] H.-Y. Schive, U.-H. Zhang, and T. Chiueh. Directionally Unsplit Hydrodynamic Schemes with Hybrid MPI/ OpenMP/GPU Parallelization in AMR. Int. J. High Perform. Comput. Appl. , 26(4):367 – 377, Nov. 2012 [2] T. Shimokawabe et. Al, Peta-scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer, SC’11 16

Results (1 of 4)  We use TSUBAME2.5 supercomputer (TokyoTech)  Up to 3,642 K20x GPUs  TSUBAME Grand Challenge Category A (full machine) HYDRODYNAMICS PHASE-FIELD SHALLOW-WATERS Uniform Mesh Auto AMR (Daino) Uniform Mesh Auto AMR Hand-written AMR Uniform Mesh Auto AMR Hand-written AMR 2.5E+03 Hand-written AMR Auto AMR (GAMER) 3.0E+02 2.0E+03 1.66 x 1.78 x 2.0E+03 2.5E+02 2.9 x 3.8 x 1.5E+03 Runtime (Seconds) Runtime (Seconds) Runtime (Seconds) 2.0E+02 8.5 x 9.4 x 1.5E+03 1.5E+02 1.0E+03 1.0E+03 1.0E+02 5.0E+02 5.0E+02 5.1E+01 1.0E+00 1.0E+00 0.0E+00 16 64 256 576 1024 1600 2288 2880 3600 16 64 256 576 1024 1600 2288 2880 3600 16 64 256 576 1024 1600 2288 2880 3600 Number GPUs (Mesh size per GPU: 4,096 3 ) Number GPUs (Mesh size per GPU: 8,192 3 ) Number GPUs (Mesh size per 16 GPUs: 4,096x512x512) Weak scaling of uniform mesh, hand-written and automated AMR (GAMER-generated AMR included in hydrodynamic) 17

Results (2 of 4)  Notes:  Phase-field achieves 1.7x speedup  Original implementation is Gordon Bell 2011 winner  Daino is faster than GAMER AMR version  GAMER is a leading framework for AMR over GPUs PHASE-FIELD HYDRODYNAMICS SHALLOW-WATERS 2.1E+03 Uniform Mesh Uniform Mesh 3.5E+02 Uniform Mesh 3.5E+04 3.0E+02 Auto AMR (Daino) Auto AMR Auto AMR 1.7 x 3.0E+02 Hand-written AMR 3.0E+04 9.6 x Hand-written AMR Hand-written AMR 2.5E+02 Auto AMR (GAMER) 2.5E+02 2.5E+04 Runtime (Seconds) 4.1 x Runtime (Seconds) Runtime (Seconds) 2.0E+02 2.0E+02 2.0E+04 1.5E+02 1.5E+02 1.5E+04 1.0E+02 1.0E+02 1.0E+04 7.4 x 3.2 x 1.3 x 5.1E+01 5.1E+01 5.0E+03 1.0E+00 1.0E+00 1.0E+00 16 64 256 576 1024 1600 2288 2880 3600 16 64 256 576 1024 1600 2288 2880 3600 16 64 256 576 1024 1600 2288 2880 3600 Number GPUs (Mesh size per GPU: 4,096 3 ) Number GPUs (Mesh size 4,096 3 ) Number GPUs (Mesh size per GPU: 8,192 3 ) Strong scaling of uniform mesh, hand-written and automated AMR (GAMER-generated AMR included in hydrodynamic) 18

Results (3 of 4)  Overhead of the AMR framework (weak scaling): AMR overhead Remeshing from 12% in 16 kernels are well- GPUs to 16% in suited to GPU 3600 GPUs 19

Daino: A High-level Framework for Parallel and Efficient AMR on GPUs - PowerPoint PPT Presentation

Daino: A High-level Framework for Parallel and Efficient AMR on GPUs Mohamed Wahib 1 , Naoya Maruyama 1,2 , Takayuki Aoki 2 1 RIKEN Advanced Institute for Computational Science, Kobe, Japan 2 Tokyo Institute of Technology, GSIC, Tokyo, Japan 11 th

UN High UN High UN High UN High- - - -Level Meeting on TB Level Meeting on TB Level Meeting

PowerWizard Level 1.0 & Level 2.0 Control Systems Training Systems Comparison Level 2

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Hydra: : a Python Framework a Python Framework Hydra for Parallel Computing for Parallel

PASCAL A Parallel Algorithmic SCALable Framework A Parallel Algorithmic SCALable Framework for

Completed Rehab of Level 1 and Level 3 Completed Bypass Adit and Entry into Level 1

Energy-efficient parallel software for mobile hand-held devices Antti P Miettinen , Nokia Research

Adaptive Meshing, Adaptive Physics, Advanced Numerics for Reacting-LES Computations Nathan Mundis

Adaptive Mesh Refinement With a Moving Mesh using sprayDyMFoam Andreas Nygren Chalmers

ComEds Prairie Program January 19, 2016 Agenda 1. ComEd and 2. Prairie Program

Big Data & Analytics: A. Vaccani&Partner AG Bid Confusion, Big Threat or Big Opportunity?

On the Importance of Adaptivity for Higher-order Discretizations in Aerospace Applications

STAR-CCM+ Hypersonic Validation of a 70 Sweep Slab Author: Nathan Richardson 1 Purpose

Contents 01 Background 02 Program development strategy 03 Main concepts 04 Algorithm 05

Adjoint Error Estimation for Tsunami Modeling Brisa Davis Department of Applied Mathematics

Daino: A High-level Framework for Parallel and Efficient AMR on GPUs - PowerPoint PPT Presentation

Daino: A High-level Framework for Parallel and Efficient AMR on GPUs Mohamed Wahib 1 , Naoya Maruyama 1,2 , Takayuki Aoki 2 1 RIKEN Advanced Institute for Computational Science, Kobe, Japan 2 Tokyo Institute of Technology, GSIC, Tokyo, Japan 11 th

UN High UN High UN High UN High- - - -Level Meeting on TB Level Meeting on TB Level Meeting

PowerWizard Level 1.0 &amp; Level 2.0 Control Systems Training Systems Comparison Level 2

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Hydra: : a Python Framework a Python Framework Hydra for Parallel Computing for Parallel

PASCAL A Parallel Algorithmic SCALable Framework A Parallel Algorithmic SCALable Framework for

Completed Rehab of Level 1 and Level 3 Completed Bypass Adit and Entry into Level 1

Energy-efficient parallel software for mobile hand-held devices Antti P Miettinen , Nokia Research

Adaptive Meshing, Adaptive Physics, Advanced Numerics for Reacting-LES Computations Nathan Mundis

Adaptive Mesh Refinement With a Moving Mesh using sprayDyMFoam Andreas Nygren Chalmers

ComEds Prairie Program January 19, 2016 Agenda 1. ComEd and 2. Prairie Program

Big Data &amp; Analytics: A. Vaccani&amp;Partner AG Bid Confusion, Big Threat or Big Opportunity?

On the Importance of Adaptivity for Higher-order Discretizations in Aerospace Applications

STAR-CCM+ Hypersonic Validation of a 70 Sweep Slab Author: Nathan Richardson 1 Purpose

Contents 01 Background 02 Program development strategy 03 Main concepts 04 Algorithm 05

Adjoint Error Estimation for Tsunami Modeling Brisa Davis Department of Applied Mathematics

PowerWizard Level 1.0 & Level 2.0 Control Systems Training Systems Comparison Level 2

Big Data & Analytics: A. Vaccani&Partner AG Bid Confusion, Big Threat or Big Opportunity?