Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P Wang †1 , James Lin †1 , Step ephen en Wa William Tang †2 , Stephane Ethier †2 , Bei Wang †2 , Simon See †1,3 †1 Shanghai Jiao Tong University, Center for HPC †2 Princeton University, Institute for Computational Science & Engineering (PICSciE) and Plasma Physics Laboratory(PPPL) †3 NVIDIA corporation GTC 2018, San Jose, USA March 27, 2018 1
Background • Sunway TaihuLight is now the No.1 supercomputer on the Top500 list. In the near future, Summit in ORNL will be the next leap in the leadership-class supercomputers. à Maintaining the single code on different supercomputers . • The real-world applications with OpenACC can achieve the portability across NVIDIA GPU and Sunway processors. GTC-P code is a case study. à We proposed to analyze the performance gap between the OpenACC version and the native programming approach on two different architectures. 2
GTC-P: Gyrokinetic Toroidal Code - Princeton • Developed by Princeton to accelerate progress in highly-scalable plasma turbulence HPC Particle-in-Cell (PIC) codes • Modern “co-design” version of the comprehensive original GTC code with focus on using Computer Science performance modeling to improve basic PIC operations to deliver simulations at extreme scales with unprecedented resolution & speed on variety of different architectures worldwide • Includes present-day multi-petaflop supercomputers, including Tianhe-2, Titan, Sequoia, Mira, etc., that feature GPU, CPU multicore, and many-core processors • KEY REFERENCE : W. Tang, B. Wang, S. Ethier, G. Kwasniewski, T. Hoefler and etc. , “Extreme Scale Plasma Turbulence Simulations on Top Supercomputers Worldwide” , Supercomputing (SC), 2016 Conference , Salt Lake City, Utah, USA
The case study of GTC-P code with OpenACC • Charge : particle to grid interpolation (SCATTER) • Smooth/Poisson/Field : grid work (local stencil) • Push : • grid to particle interpolation (GATHER) • update position and velocity • Shift : in distributed memory environment, exchange particles among processors 4
The case study of GTC-P code with OpenACC • Challenges a . Memory - bound kernels b . Data hazard c . Random memory access • Methodology a . Decrease the memory bandwidth b . Use atomic operations or duplication and reduction c . Take full advantage of local memory 5
The performance of atomic operations on P100 and SW26010 NVIDIA GPU (P100) CUDA OpenACC 5.9 6.0 Elapsed Time (s) CUDA supports global atomics in a coalesced way by transposing in shared memory Sunway processor OpenACC code Serial code on 1 MPE (SW26010) on 64 CPE 504x slower !!! 4.7 2360.5 Elapsed Time (s) unacceptable Atomic operations on SW26010 are implemented by lock-and-unlock methodology. 6
Performance evaluation on NVIDIA P100 • The native atomicAdd instruction is used on P100 instead of compare-and- swap loop implemented with atomicCAS instruction on K80. • The performance gap of GTC-P between CUDA and OpenACC are narrowed with the hardware upgrade. 7
Implementation of the OpenACC version on SW26010 • Duplication and reduction algorithm is used instead of atomic operations, which is implemented with the help of the global variable acc_thread_id. • Using tile directive to coalesced access data by DMA request and fill the D 64KB LDM. M A Main Memory 8
Performance evaluation of the OpenACC version on SW26010 2500 Shift Lower is better Smooth • The performance is Field Poisson acceptable after Push 2000 Charge removing the atomic operations on SW26010. Elapsed time [sec] • Taking full advantage of 1500 DMA bandwidth is the key factor for the 1000 Baseline 1.1X memory-bound kernel. • Charge kernel is the 2.5X 500 hotspot of the OpenACC version. 0 9 Sequential�(MPE) OpenACC�(CPE) +Tile +SPM�library +w/o�atomics
Register level communication on SW26010 • The low-latency register communication mechanism is among the CPE cluster, which is the key factor for data locality. 10
The RLC optimization for the charge kernel on SW26010 irregular memory access pattern in the charge kernel • The index value are preconditioned on the MPE and then transfer to the first column of the CPE cluster. • Irregular access is implemented on the rest CPE by row communication. 11
The async optimization for the charge kernel on SW26010 • The irregular memory access implemented by RLC on CPE cluster and the rest part due to the limit of SPM space are running simultaneously. • Tuning the performance manually. 12
Performance tuning of the charge kernel on SW26010 74% Finally, we achieved around 4X speedup compared with OpenACC version and the native approach on SW26010 processors. 13
How about the scaling of the OpenACC version of GTC-P code on the real supercomputers? (Early Results) 14
Experiment results of scaling evaluation on GPU cluster in SJTU Weak Scaling 15
Experiment results of scaling evaluation on Titan supercomputer • One K20X per node • ”Gemini” internconnect • Strong scaling is to be done … 16
Experiment results of scaling evaluation on Sunway TaihuLight supercomputer 17
Summary • The case study demonstrated the portability of OpenACC on GPU and Chinese home-grown many-core processor. Although the algorithm on SW26010 has to be refractored compared with GPU. • The performance gap between the OpenACC version and CUDA of GTC-P on NVIDIA P100 is narrowed with the hardware upgrade. • The experiments showed that performance gap on SW26010 can not be ignored due to the lack of high-efficiency general software cache on the CPE cluster. We designed specific register level communication to fix the problem. 18
Reference • Performance and Portability Studies with OpenACC Accelerated Version of GTC-P . Yueming Wei, Yichao Wang, Linjin Cai, William Tang, Bei Wang, Stephane Ethier, Simon See and James Lin. The 17th International Conference on Parallel and Distributed Computing, Applications and Technologies, Guangzhou, China, December 16-18, 2016. • Porting and Optimizing GTC-P on TaihuLight Supercomputer with Sunway OpenACC . Yichao Wang, James Lin, Linjin Cai, William Tang, Stephane Ethier, Bei Wang, Simon See and Satoshi Matsuoka. Journal of Computer Research and Development, 2018, 55(4). 19
Recommend
More recommend