Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor Step ephen en Wa Wang †1 , James Lin †1,4 , William Tang †2 , Stephane Ethier †2 , Bei Wang †2 , Simon See †1,3 and Satoshi Matsuoka †4 †1 Shanghai Jiao Tong University, Center for HPC †2 Princeton University, Institute for Computational Science & Engineering (PICSciE) and Plasma Physics Laboratory(PPPL) †3 NVIDIA corporation †4 Tokyo Institute of Technology GTC 2017, San Jose, USA May 11, 2017 1
Challenges of supporting multi- and many-cores, the territory of OpenMP Core Number 1000 100 10 2
GTC-P: Gyrokinetic Toroidal Code - Princeton • Developed by Princeton to accelerate progress in highly-scalable plasma turbulence HPC Particle-in-Cell (PIC) codes • Successfully applied to high-resolution problem-size-scaling studies relevant to the Fusion’s next- generation International Thermonuclear Experimental Reactor (ITER). • Modern “co-design” version of the comprehensive original GTC code with focus on using Computer Science performance modeling to improve basic PIC operations to deliver simulations at extreme scales with unprecedented resolution & speed on variety of different architectures worldwide • Includes present-day multi-petaflop supercomputers, including Tianhe-2, Titan, Sequoia, Mira, etc., that feature GPU, CPU multicore, and many- core processors • KEY REFERENCE : W. Tang, B. Wang, S. Ethier, G. Kwasniewski, T. Hoefler and etc. , “Extreme Scale Plasma Turbulence Simulations on Top Supercomputers Worldwide” , Supercomputing (SC), 2016 Conference , Salt Lake City, Utah, USA 3
OpenACC Implementations hotspots • Challenges a. Memory-bound kernels b. Data hazard c. Random memory access • Implementations a. Increase memory bandwidth b. Use atomic operations c. Take advantage of local memory Six Major Subroutines of GTC-P 4
OpenACC Implementations – present directive 5
OpenACC Implementations – atomic directive 6
Run the single OpenACC code base: huge performance gap on x86 and Sunway GPU (NVIDIA K20) Baseline: CUDA OpenACC 7.9 16.7 Elapsed Time (s) 2x slower x86 multicore Baseline: OpenMP OpenACC (Intel SNB) 201x slower! 7.9 1572.8 Elapsed Time (s) OpenMP allocates the array copy on each thread and reduce, without atomic operations. Sunway many-core Baseline: Serial code OpenACC code (SW 26010) on 1 MPE on 64 CPE 504x slower !!! 4.7 2360.5 Elapsed Time (s) unacceptable 7
Our solution for multi- and many-core : using thread-id to duplicate copies for reduction to replace the Fetch-and-Add atomic operation array[thread-id][n] - copy for T1 array[thread-id][n] - copy for T2 Data Hazard Reduction array[thread-id][n] - copy for T3 array[n] (Add) T1 T2 T3 T4 Irregular Memory Access array[thread-id][n] - copy for T4 (Fetch-and-Add) 8
Performance w/o atomic operations on x86 CPU • Thread ID is not supported for x86 in OpenACC standard yet. • Private function in PGI Baseline compiler is used here: __pgi_blockidx() PGI compiler 16.10 9
Implementation on Sunway many-core processor: a customized thread-id extension available from Sunway OpenACC Architecture overview of SW26010 acc_thread_id is a customized extension provided in Sunway OpenACC 10
Optimization on Sunway many-core processor: data locality in 64KB Scratch Pad Memory • Using tile directive to coalesced access data by per DMA request. The optimum tile size can take full usage of 64KB SPM. SPM Elapsed Time(s) Lower is better Memory hierarchy of CPE • Keep data in SPM instead of global memory access. tile_size 11
(*) Optimization on Sunway many-core processor • 256-bit SIMD intrinsic swacc (S2S compiler) sw5cc (native compiler) OpenACC code Execution file immediate code (.host and .slave) “-keep” or “-dumpcommand” can let compiler generate it. But the cost of this kernel is This part in push kernel can achieve 5.6x speedup. too small compared with the entire GTC-P code. 12
Performance on Sunway many-core processor 2500 Shift Lower is better Smooth • Avoid atomic operations. Field Poisson Push 2000 Charge • Increase DMA bandwidth Elapsed time [sec] 1500 • Strengthen data locality 1000 Baseline 1.1X in SPM 2.5X 500 • (*) In-build SIMD code 0 13 Sequential�(MPE) OpenACC�(CPE) +Tile +SPM�library +w/o�atomics
Performance and portability of GTC-P on GPU 14
Use native atomic instructions on P100 • Native atomic instructions (FP64) are supported on Pascal architecture. • Compare the PTX code generated by PGI 16.10 compiler on K80 and P100. 15
OpenACC version of GTC-P on K80 and P100 • Performance of OpenACC version on P100 is close to CUDA code due to the better atomic instructions support. • OpenACC benefit from the hardware support on the latest GPU architecture. 16
Use specific algorithm for GPU in OpenACC code Remove auxiliary array which use to store the 4 points 17
Performance results of OpenACC version with new algorithm on GPU CUDA OpenACC new OpenACC Tesla K40 GPU B 100 1 1399MB/GPU 3070MB/GPU 1501MB/GPU 2 * GPU B 100 1 742MB/GPU 1569MB/GPU 785MB/GPU 4 * GPU 50% device memory usage reduce 18
Core Number 1000 100 Hardware support for key operations 10 Gap of memory hierarchy 19
Summary • Optimizations for specific architecture are necessary to reasonable performance in GTC-P code. • Native atomic support on GPU can achieve better performance of OpenACC code compared with the same operations on multi- and many-core now. • The gap of memory hierarchy between different architectures may cause different algorithm for OpenACC code. 20
Reference • Stephen Wang, James Lin, Linjin Cai, William Tang, Stephane Ethier, Bei Wang, Simon See and Satoshi Matsuoka. “Porting and Optimizing GTC-P on TaihuLight Supercomputer with Sunway OpenACC.” HPC China, 2016. Best Paper Award (Acceptance Rate < 3%) • Yueming Wei, Stephen Wang, Linjin Cai, William Tang, Bei Wang, Stephane Ethier, Simon See and James Lin. “Performance and Portability Studies with OpenACC Accelerated Version of GTC-P.” PDCAT, 2016. 21
Recommend
More recommend