Achieving Portable Performance for GTC-P with OpenACC on GPU, - - PowerPoint PPT Presentation

achieving portable performance for gtc p with openacc on
SMART_READER_LITE
LIVE PREVIEW

Achieving Portable Performance for GTC-P with OpenACC on GPU, - - PowerPoint PPT Presentation

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor Step ephen en Wa Wang 1 , James Lin 1,4 , William Tang 2 , Stephane Ethier 2 , Bei Wang 2 , Simon See 1,3 and Satoshi


slide-1
SLIDE 1

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor

Step ephen en Wa Wang†1, James Lin†1,4, William Tang†2, Stephane Ethier†2, Bei Wang†2, Simon See†1,3 and Satoshi Matsuoka†4

†1 Shanghai Jiao Tong University, Center for HPC †2 Princeton University, Institute for Computational Science & Engineering (PICSciE) and Plasma Physics Laboratory(PPPL) †3 NVIDIA corporation †4 Tokyo Institute of Technology

1

GTC 2017, San Jose, USA May 11, 2017

slide-2
SLIDE 2

1000 100 10 Core Number

2

Challenges of supporting multi- and many-cores, the territory of OpenMP

slide-3
SLIDE 3

GTC-P: Gyrokinetic Toroidal Code - Princeton

  • Developed by Princeton to accelerate progress in highly-scalable plasma

turbulence HPC Particle-in-Cell (PIC) codes

  • Successfully applied to high-resolution problem-size-scaling studies relevant

to the Fusion’s next-generation International Thermonuclear Experimental Reactor (ITER).

  • Modern “co-design” version of the comprehensive original GTC code with

focus on using Computer Science performance modeling to improve basic PIC operations to deliver simulations at extreme scales with unprecedented resolution & speed on variety of different architectures worldwide

  • Includes present-day multi-petaflop supercomputers, including Tianhe-2,

Titan, Sequoia, Mira, etc., that feature GPU, CPU multicore, and many- core processors

  • KEY REFERENCE: W. Tang, B. Wang, S. Ethier, G. Kwasniewski, T.

Hoefler and etc. ,“Extreme Scale Plasma Turbulence Simulations on Top Supercomputers Worldwide” , Supercomputing (SC), 2016 Conference, Salt Lake City, Utah, USA

3

slide-4
SLIDE 4

OpenACC Implementations

  • Challenges

a. Memory-bound kernels b. Data hazard c. Random memory access

  • Implementations

a. Increase memory bandwidth b. Use atomic operations c. Take advantage of local memory

4

Six Major Subroutines of GTC-P hotspots

slide-5
SLIDE 5

OpenACC Implementations – present directive

5

slide-6
SLIDE 6

OpenACC Implementations – atomic directive

6

slide-7
SLIDE 7

Sunway many-core (SW 26010) Baseline: Serial code

  • n 1 MPE

OpenACC code

  • n 64 CPE

Elapsed Time (s) 4.7 2360.5

504x slower !!!

7

2x slower 201x slower!

Run the single OpenACC code base: huge performance gap on x86 and Sunway

GPU (NVIDIA K20) Baseline: CUDA OpenACC Elapsed Time (s) 7.9 16.7 x86 multicore (Intel SNB) Baseline: OpenMP OpenACC Elapsed Time (s) 7.9 1572.8 OpenMP allocates the array copy on each thread and reduce, without atomic operations.

unacceptable

slide-8
SLIDE 8

8

Data Hazard

T1 T2 T3 T4

array[thread-id][n] - copy for T1 Reduction (Add) array[n] Irregular Memory Access (Fetch-and-Add)

Our solution for multi- and many-core: using thread-id to duplicate copies for reduction to replace the Fetch-and-Add atomic operation

array[thread-id][n] - copy for T2 array[thread-id][n] - copy for T3 array[thread-id][n] - copy for T4

slide-9
SLIDE 9

Performance w/o atomic operations on x86 CPU

9

  • Thread ID is not

supported for x86 in OpenACC standard yet.

  • Private function in PGI

compiler is used here: __pgi_blockidx()

PGI compiler 16.10 Baseline

slide-10
SLIDE 10

10

acc_thread_id is a customized extension provided in Sunway OpenACC

Implementation on Sunway many-core processor: a customized thread-id extension available from Sunway OpenACC

Architecture overview of SW26010

slide-11
SLIDE 11

Optimization on Sunway many-core processor: data locality in 64KB Scratch Pad Memory

  • Using tile directive to coalesced access data

by per DMA request. The optimum tile size can take full usage of 64KB SPM.

11

Elapsed Time(s)

Lower is better

tile_size

Memory hierarchy of CPE

  • Keep data in SPM instead of global

memory access.

SPM

slide-12
SLIDE 12

(*) Optimization on Sunway many-core processor

  • 256-bit SIMD intrinsic

OpenACC code immediate code (.host and .slave)

“-keep” or “-dumpcommand” can let compiler generate it.

Execution file swacc (S2S compiler) sw5cc (native compiler)

This part in push kernel can achieve 5.6x speedup. But the cost of this kernel is

too small compared with the entire GTC-P code.

12

slide-13
SLIDE 13

Performance on Sunway many-core processor

500 1000 1500 2000 2500

Sequential(MPE) OpenACC(CPE) +w/oatomics +Tile +SPMlibrary

Lower is better

Baseline 1.1X 2.5X

Elapsed time [sec]

Charge Push Poisson Field Smooth Shift 13

  • Avoid atomic operations.
  • Increase DMA bandwidth
  • Strengthen data locality

in SPM

  • (*) In-build SIMD code
slide-14
SLIDE 14

Performance and portability of GTC-P on GPU

14

slide-15
SLIDE 15

Use native atomic instructions on P100

15

  • Native atomic instructions

(FP64) are supported on Pascal architecture.

  • Compare the PTX code

generated by PGI 16.10 compiler on K80 and P100.

slide-16
SLIDE 16

OpenACC version of GTC-P on K80 and P100

16

  • Performance of OpenACC

version on P100 is close to CUDA code due to the better atomic instructions support.

  • OpenACC benefit from the

hardware support on the latest GPU architecture.

slide-17
SLIDE 17

Use specific algorithm for GPU in OpenACC code

17

Remove auxiliary array which use to store the 4 points

slide-18
SLIDE 18

Performance results of OpenACC version with new algorithm on GPU

18

CUDA OpenACC new OpenACC B 100 1 2 * GPU 1399MB/GPU 3070MB/GPU 1501MB/GPU B 100 1 4 * GPU 742MB/GPU 1569MB/GPU 785MB/GPU

50% device memory usage reduce

Tesla K40 GPU

slide-19
SLIDE 19

1000 100 10 Core Number

19

Hardware support for key operations Gap of memory hierarchy

slide-20
SLIDE 20
  • Optimizations for specific architecture are necessary to reasonable

performance in GTC-P code.

  • Native atomic support on GPU can achieve better performance of

OpenACC code compared with the same operations on multi- and many-core now.

  • The gap of memory hierarchy between different architectures may

cause different algorithm for OpenACC code.

Summary

20

slide-21
SLIDE 21

Reference

  • Stephen Wang, James Lin, Linjin Cai, William Tang, Stephane Ethier, Bei Wang, Simon

See and Satoshi Matsuoka. “Porting and Optimizing GTC-P on TaihuLight Supercomputer with Sunway OpenACC.” HPC China, 2016. Best Paper Award (Acceptance Rate < 3%)

  • Yueming Wei, Stephen Wang, Linjin Cai, William Tang, Bei Wang, Stephane Ethier, Simon

See and James Lin. “Performance and Portability Studies with OpenACC Accelerated Version of GTC-P.” PDCAT, 2016.

21