Achieving Portable Performance for GTC-P with OpenACC on GPU, - PowerPoint PPT Presentation

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor Step ephen en Wa Wang †1 , James Lin †1,4 , William Tang †2 , Stephane Ethier †2 , Bei Wang †2 , Simon See †1,3 and Satoshi Matsuoka †4 †1 Shanghai Jiao Tong University, Center for HPC †2 Princeton University, Institute for Computational Science & Engineering (PICSciE) and Plasma Physics Laboratory(PPPL) †3 NVIDIA corporation †4 Tokyo Institute of Technology GTC 2017, San Jose, USA May 11, 2017 1

Challenges of supporting multi- and many-cores, the territory of OpenMP Core Number 1000 100 10 2

GTC-P: Gyrokinetic Toroidal Code - Princeton • Developed by Princeton to accelerate progress in highly-scalable plasma turbulence HPC Particle-in-Cell (PIC) codes • Successfully applied to high-resolution problem-size-scaling studies relevant to the Fusion’s next- generation International Thermonuclear Experimental Reactor (ITER). • Modern “co-design” version of the comprehensive original GTC code with focus on using Computer Science performance modeling to improve basic PIC operations to deliver simulations at extreme scales with unprecedented resolution & speed on variety of different architectures worldwide • Includes present-day multi-petaflop supercomputers, including Tianhe-2, Titan, Sequoia, Mira, etc., that feature GPU, CPU multicore, and many- core processors • KEY REFERENCE : W. Tang, B. Wang, S. Ethier, G. Kwasniewski, T. Hoefler and etc. , “Extreme Scale Plasma Turbulence Simulations on Top Supercomputers Worldwide” , Supercomputing (SC), 2016 Conference , Salt Lake City, Utah, USA 3

OpenACC Implementations hotspots • Challenges a. Memory-bound kernels b. Data hazard c. Random memory access • Implementations a. Increase memory bandwidth b. Use atomic operations c. Take advantage of local memory Six Major Subroutines of GTC-P 4

OpenACC Implementations – present directive 5

OpenACC Implementations – atomic directive 6

Run the single OpenACC code base: huge performance gap on x86 and Sunway GPU (NVIDIA K20) Baseline: CUDA OpenACC 7.9 16.7 Elapsed Time (s) 2x slower x86 multicore Baseline: OpenMP OpenACC (Intel SNB) 201x slower! 7.9 1572.8 Elapsed Time (s) OpenMP allocates the array copy on each thread and reduce, without atomic operations. Sunway many-core Baseline: Serial code OpenACC code (SW 26010) on 1 MPE on 64 CPE 504x slower !!! 4.7 2360.5 Elapsed Time (s) unacceptable 7

Our solution for multi- and many-core : using thread-id to duplicate copies for reduction to replace the Fetch-and-Add atomic operation array[thread-id][n] - copy for T1 array[thread-id][n] - copy for T2 Data Hazard Reduction array[thread-id][n] - copy for T3 array[n] (Add) T1 T2 T3 T4 Irregular Memory Access array[thread-id][n] - copy for T4 (Fetch-and-Add) 8

Performance w/o atomic operations on x86 CPU • Thread ID is not supported for x86 in OpenACC standard yet. • Private function in PGI Baseline compiler is used here: __pgi_blockidx() PGI compiler 16.10 9

Implementation on Sunway many-core processor: a customized thread-id extension available from Sunway OpenACC Architecture overview of SW26010 acc_thread_id is a customized extension provided in Sunway OpenACC 10

Optimization on Sunway many-core processor: data locality in 64KB Scratch Pad Memory • Using tile directive to coalesced access data by per DMA request. The optimum tile size can take full usage of 64KB SPM. SPM Elapsed Time(s) Lower is better Memory hierarchy of CPE • Keep data in SPM instead of global memory access. tile_size 11

(*) Optimization on Sunway many-core processor • 256-bit SIMD intrinsic swacc (S2S compiler) sw5cc (native compiler) OpenACC code Execution file immediate code (.host and .slave) “-keep” or “-dumpcommand” can let compiler generate it. But the cost of this kernel is This part in push kernel can achieve 5.6x speedup. too small compared with the entire GTC-P code. 12

Performance on Sunway many-core processor 2500 Shift Lower is better Smooth • Avoid atomic operations. Field Poisson Push 2000 Charge • Increase DMA bandwidth Elapsed time [sec] 1500 • Strengthen data locality 1000 Baseline 1.1X in SPM 2.5X 500 • (*) In-build SIMD code 0 13 Sequential�(MPE) OpenACC�(CPE) +Tile +SPM�library +w/o�atomics

Performance and portability of GTC-P on GPU 14

Use native atomic instructions on P100 • Native atomic instructions (FP64) are supported on Pascal architecture. • Compare the PTX code generated by PGI 16.10 compiler on K80 and P100. 15

OpenACC version of GTC-P on K80 and P100 • Performance of OpenACC version on P100 is close to CUDA code due to the better atomic instructions support. • OpenACC benefit from the hardware support on the latest GPU architecture. 16

Use specific algorithm for GPU in OpenACC code Remove auxiliary array which use to store the 4 points 17

Performance results of OpenACC version with new algorithm on GPU CUDA OpenACC new OpenACC Tesla K40 GPU B 100 1 1399MB/GPU 3070MB/GPU 1501MB/GPU 2 * GPU B 100 1 742MB/GPU 1569MB/GPU 785MB/GPU 4 * GPU 50% device memory usage reduce 18

Core Number 1000 100 Hardware support for key operations 10 Gap of memory hierarchy 19

Summary • Optimizations for specific architecture are necessary to reasonable performance in GTC-P code. • Native atomic support on GPU can achieve better performance of OpenACC code compared with the same operations on multi- and many-core now. • The gap of memory hierarchy between different architectures may cause different algorithm for OpenACC code. 20

Reference • Stephen Wang, James Lin, Linjin Cai, William Tang, Stephane Ethier, Bei Wang, Simon See and Satoshi Matsuoka. “Porting and Optimizing GTC-P on TaihuLight Supercomputer with Sunway OpenACC.” HPC China, 2016. Best Paper Award (Acceptance Rate < 3%) • Yueming Wei, Stephen Wang, Linjin Cai, William Tang, Bei Wang, Stephane Ethier, Simon See and James Lin. “Performance and Portability Studies with OpenACC Accelerated Version of GTC-P.” PDCAT, 2016. 21

Achieving Portable Performance for GTC-P with OpenACC on GPU, - PowerPoint PPT Presentation

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor Step ephen en Wa Wang 1 , James Lin 1,4 , William Tang 2 , Stephane Ethier 2 , Bei Wang 2 , Simon See 1,3 and Satoshi

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

PC PORTABLE PC PORTABLE PC PORTABLE Introducing the PC Portable Lamp, one of a range of

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

Using OpenACC for NGS Techniques to Create a Portable and Easy-to- Use Code Base Sanhu Li (Ph.D.

Portable fuel cell system s Jaeyoung Lee September 19, 2006 http:/ / w w w .h2 fc.re.kr Energy

COMPARING OPENACC AND OPENMP PERFORMANCE AND PROGRAMMABILITY JEFF LARKIN, NVIDIA GUIDO

HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: A LBM CASE STUDY Jiri Kraus,

PORTABLE MANAGEMENT BEX/BTA Oversight Committee May 17, 2019 Agenda Portable Management

Portable Enforcement Solution International Product Marketing Department Portable PTZ Dome Body

Interplanetary Shock Detection and Geomagnetic Storm Evolution JOSH HAGOOD / ROB STEENBURGH

Market Settlement Timeline Stakeholder Conference Call August 22, 2019 James Lynn Senior

Learning Program m es I m plem entation Septem ber 2 0 1 8 Presentation Outline

Study conducted at Oregon National Guards Lane Readiness Center Lane Readiness Center Located

Classic DEVS An Introduction Using PythonPDEVS Yentl Van Tendeloo, Hans Vangheluwe Introduction

Large-scale Physical Model Tests of Micropile Stabilized Slopes J. Erik Loehr and Andrew

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Microcontroller Driven Electroluminescent Display Jamie Buckmann Christopher Stedman Advised