First Steps of YALES2 Code Towards GPU Acceleration on Standard and - PowerPoint PPT Presentation

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster YALES2: Semi-industrial code for turbulent combustion and flows Jean-Matthieu Etancelin, ROMEO, NVIDIA GPU Application Lab, University of Reims GTC Europe - M¨ unchen October, 11th 2017

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions 1. Introduction : Context ROMEO HPC Center – GPU Application Lab Yales2 2. Existing code profiling 3. Code porting Porting stategies Internal kernels performances 4. Benchmarks 5. Conclusions Limitations and future work JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster 2

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions ROMEO HPC Center University of Reims I about 25000 students I Multidisciplinary university (undergraduate, graduate, PhD, research labs) ROMEO HPC Center I HPC resources for both academic and industrial research I Expertise and teaching in HPC and GPU technologies I Integrated in the European HPC ecosystem (French Tier 1.5 equip@meso , ETP4HPC ) I Full hybrid cluster (2 × Intel Ivy Bridge + 2 × K20 + IB QDR ) JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster 3

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions GPU Application Lab Objectives : intensive exploitation of ROMEO I Expertise in hybrid HPC, in particular in GPU Technologies I GPU code porting I Optimization and scaling-up towards a large number of GPUs I Training and teaching for ROMEO users Activities I GPU, hybrid and parallel codes optimization I Algorithms improvements regarding targeted architectures I Numerical methods adapting to hybrid and parallel architecture Various collaborations I Local URCA laboratories, and some external collaborations (ONERA, Univ. of Normandy) I Several domains of application (fluid mechanics, chemistry, computer science, applied maths, ...) JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster 4

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions Massively parallel solver for multi-physics problems in fluid dynamics from primary atomisation to pollutants dispersion in complex geometries I Code developped at CORIA (University of Normandy) since 2007 I V. Moureau, G. Lartigue, P. B´ enard (projects leaders) I ∼ 10 developers (engineers, researchers, PhD students, . . .) + contributors Code I Diphasic and reactive fluids flows simulations at low Mach number on complex geometries I LES and DNS solvers on unstructured meshes I 3D flow simulations on massively parallel architectures I Use by more than 160 academic and industrial researchers I 60+ scientific publications JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster 5

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions YALES2, a complete library Main features I 350 000 lines of code f90 and f03 I Portable I Python Interface I Main solvers : I Scalar solver (SCS) I Level set solver (LSS) I Lagrangian solver (LGS) I Incompressible solver (ICS) I Variable density solver (VDS) I Spray solver (SPS) I Magneto-Hydrodynamic solver (MHD) I Heat transfer solver (HTS) I Chemical reactor solver (CRS) I Darcy solver (DCY) I Mesh movement solver (MMS) I ALE solver (ALE) I Linear acoustics solver (ACS) I 5+ solvers in progress JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster 6

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions HPC with YALES2 in combustion Multi-scale and multi-physics applications I More than 85% of used energy comes from combustion I Related to many fields (transportation, industry, energy, . . .) Examples in aeronautics : JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster 7

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions HPC with YALES2 HPC I Using up to 10000 cores on national french clusters (IDRIS, CINES, . . .), regional (CRIANN) and local machines I Using advanced parallel programming techiques (hybrid computing, automatic mesh adaptation, . . .) I Collaborations with Exascale Lab, INTEL/CEA/GENCI/UVSQ I Code used as benchmark on prototypes (IDRIS, Ouessant : Power8+P100), Cellule de Veille technologique GENCI I Collaboration on GPU porting, GPU Application Lab, ROMEO JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster 8

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions Existing code profiling 1. Introduction : Context ROMEO HPC Center – GPU Application Lab Yales2 2. Existing code profiling 3. Code porting Porting stategies Internal kernels performances 4. Benchmarks 5. Conclusions Limitations and future work JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster 9

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions Profiling the existing code Specific tools (MAQAO + TAU + PAPI) I In-depth profiling : I Computational time (per functions, per internal and external loops) I Number of floating point operations I Number of caches misses I . . . I Hot-spot : matrix-vector product in Preconditionned Conjugate Gradient (PCG) Functions profiles External loops profile JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster 10

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions Profiling existing code Indentifying hot-spot I Preconditioned conjugate gradient : I Matrix-vector product : 250 lines of code for 55% of total time 30 lines of code for 30% of total time JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster 11

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions Existing code profiling 1. Introduction : Context ROMEO HPC Center – GPU Application Lab Yales2 2. Existing code profiling 3. Code porting Porting stategies Internal kernels performances 4. Benchmarks 5. Conclusions Limitations and future work JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster 12

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions How to port hot-spot on GPU ? Code main feature : data-centered structure I Hierarchical well-defined data structures based on bloc-decomposition of the mesh I Every computing loops follow the same skeleton (two levels of nested loops : over blocs meshes, then over vertex, edges or elements) Code porting Three major possibilities : I CUDA/C with Intel compilers I OpenACC with PGI compilers I Fine management of GPU (code+data) I Non intrusive for code (macros) I Passing through intermediary C interfaces I Complementary with in-progress OpenMP I No deep copy for complex data structures version I Code rewriting (only for computational loops) I Strong potential with unified memory I No deep copy for complex data structures I CUDA/Fortran with PGI (not tested) I No support for Fortran pointers I Similar to CUDA/C without interface JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster 13

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions Code porting with CUDA Key points Data management I Exploiting the Fortran/C interoperability for data structures I Fortran derived types translation to C typedef (automatic translation tool YALES2-specific) GPU memory management I Allocation ans management of GPU specific data and utilities arrays I CPU-GPU transfers optimized with a bu ff er array (in Pinned memory ) Execution model I Mapping mesh decomposition and hierarchical data structure to CUDA blocks/threads Algorithm adaptation : inverse connectivity for mesh exploration I Loop first over vertices instead of edges (Finite Volumes method works on edges by construction) JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster 14

Introduction : Context Existing code profiling Code porting Benchmarks Conclusions CUDA code porting Inverse connectivity for mesh exploration Matrix-vector product computing (op product) I Initial algorithm (not well suited to GPU) : I Algorithm with inverse connectivity : Foreach bloc b of mesh //blocks Foreach bloc b of mesh //blocks Foreach edge e of b //threads Foreach vertex v of b //threads vs, ve = vertex(e) r = 0 // Register result(vs) += f(value(e), data(vs), data(ve)) Foreach edge e from vertex s result(ve) -= f(value(e), data(vs), data(ve)) ve = end(e) r += f(value(e), data(v), data(ve)) Foreach edge e to vertex s vs = start(e) r -= f(value(e), data(vs), data(v)) result(v) = r JM Etancelin – First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster 15

First Steps of YALES2 Code Towards GPU Acceleration on Standard and - PowerPoint PPT Presentation

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster YALES2: Semi-industrial code for turbulent combustion and flows Jean-Matthieu Etancelin, ROMEO, NVIDIA GPU Application Lab, University of Reims GTC Europe -

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge

GPYTORCH : BLACKBOX MATRIX- MATRIX GAUSSIAN PROCESS INFERENCE WITH GPU ACCELERATION Jacob R.

ACCELERATION OF A COMPUTATIONAL FLUID DYNAMICS CODE WITH GPU USING OPENACC NICHOLSON K.

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Open House Demographics Murdoch MacKay is neighbouring high school 350 students from

Robust analysis in a quadratic separation framework and application to Demeter satellite attitude

Video

Coal transitions: what is happening internationally? Symposium South Africa- 27/02/2019

MAVERICK EARLY READER OUTLINE 1. Features 6. Student Resources 2. Structure 7. Market

Towards an emerging animal disease surveillance system based on textual media analysis S.

Governing Board Meeting 10/18/2017 Assessment Report Part I Components Section 1

2015 IT Service Quality Benchmark Project RESULTS SESSION Format of reporting OVERALL

First Steps of YALES2 Code Towards GPU Acceleration on Standard and - PowerPoint PPT Presentation

First Steps of YALES2 Code Towards GPU Acceleration on Standard and Prototype Cluster YALES2: Semi-industrial code for turbulent combustion and flows Jean-Matthieu Etancelin, ROMEO, NVIDIA GPU Application Lab, University of Reims GTC Europe -

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge

GPYTORCH : BLACKBOX MATRIX- MATRIX GAUSSIAN PROCESS INFERENCE WITH GPU ACCELERATION Jacob R.

ACCELERATION OF A COMPUTATIONAL FLUID DYNAMICS CODE WITH GPU USING OPENACC NICHOLSON K.

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Open House Demographics Murdoch MacKay is neighbouring high school 350 students from

Robust analysis in a quadratic separation framework and application to Demeter satellite attitude

Video

Coal transitions: what is happening internationally? Symposium South Africa- 27/02/2019

MAVERICK EARLY READER OUTLINE 1. Features 6. Student Resources 2. Structure 7. Market

Towards an emerging animal disease surveillance system based on textual media analysis S.

Governing Board Meeting 10/18/2017 Assessment Report Part I Components Section 1

2015 IT Service Quality Benchmark Project RESULTS SESSION Format of reporting OVERALL

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team