Evaluation of Productivity and Performance of the XcalableACC - PowerPoint PPT Presentation

Evaluation of Productivity and Performance of the XcalableACC programming language LENS2015 INTERNATIONAL WORKSHOP, Oct. 29th. 2015 Masahiro Nakao (RIKEN AICS)

HA-PACS/TCA Cluster System Each node has four GPUs (NVDIA K20X). Therefore we assigned four processes to one node, and each process deals with one GPU. http://www.ccs.tsukuba.ac.jp CPU CPU GPU GPU GPU GPU 2

Objectives Evaluate Performance and Productivity of XcalableACC (XACC) Four benchmarks HIMENO Evaluate the performance of incompressible fluid analysis code (stencil code) NPB CG Solve minimum eigenvalue of symmetric and positive definite sparse matrix using the Conjugate Gradient method STREAM Evaluate sustainable memory bandwidth HPL High Performance Linpack. This code evaluates the floating point rate of execution for solving a linear system of equations 3

Implementation of HIMENO Transfer distributed array directives into the sequential Only add XMP and OpenACC with halo region Define distributed array Parallelize loop statement Exchange halo region to accelerator Himeno benchmark. float p[I][J][K]; #pragma xmp template t(0:K-1,0:J-1,0:I-1) #pragma xmp nodes n(1, NDY, NDX) #pragma xmp distribute t(block, block, ¥   block) onto n #pragma xmp align p[k][j][i] with t(i, j, k) #pragma xmp shadow p[1:2][1:2][0:1]; #pragma acc data copy(p) .. { .. #pragma xmp reflect (p) acc .. #pragma xmp loop (k,j,i) on t(k,j,i) #pragma acc parallel loop .. for(i=1; i<MIMAX; ++i) for(j=1; j<MJMAX; ++j){ #pragma acc loop vector .. for(k=1; k<MKMAX; ++k){ S0 = p[i+1][j][k] * ..; 5

Pingpong on HA-PACS/TCA 1000 8 64 512 4K 32K 256K 2M 100 better 10 1 Transfer data (Byte) 10000 128K Device memory to Device memory Latency (u second) (MVAPICH2-GDR 2.0) on neighbor nodes PEACH2 PEACH2： PCIe Gen.2 x 8links : 4GB/s GPUDirect： InfiniBand 4xQDR x 2rails : 8GB/s GPUDirect RDMA   10000" 1000" ��(�� ¡�� 100" 10" 1" 8" 64" 512" 4096" 32768" 262144" 2097152" ��)(��(� ¡��(�� 6

Performance of HIMENO (1/2) Number of Nodes better 14%↑ Number of Nodes array size : p[128][128][256] Performance (GFlops) array size : p[64][64][128] 4%↑ Comparison of “XACC with PEACH2” and “XACC with GDR (mvapich-GDR)” “XACC with PEACH2” is better than “XACC with GDR” in p[64][64][128] 150 400 XACC (PEACH2) XACC (PEACH2) XACC (GDR) XACC (GDR) 300 100 200 50 100 0 0 1 2 4 8 16 1 2 4 8 16 7

Performance of HIMENO (2/2) 1518GFlops lines of code SLOC is source 16nodes(64GPU) 1507GFlops (GDR) The performance of XACC is almost the same as that of OpenACC + MPI SLOC of XACC is about 60% of that of OpenACC + MPI 10000 Performance (GFlops) � 1000 SLOC:328 SLOC:198 XACC 100 OpenACC + MPI 10 1 2 4 8 16 Number of nodes � 8

Implementation of NPB CG Parallelize loop statement nodes Reduction among device memory Reduction on Define distributed array double w[NA]; #pragma xmp nodes p(PROC_COLS,PROC_ROWS) #pragma xmp nodes sub_p(PROC_COLS)=p(:,*) #pragma xmp template t(0:NA-1,0:NA-1) #pragma xmp distribute t(block, block) onto p #pragma xmp align w[i] with t(*,i) for(cgit=1;cgit<=cgitmax;cgit++){ rho0 = rho; d = 0.0; rho = 0.0; #pragma xmp loop on t(*,j) #pragma acc parallel loop gang for(j=0;j<NA;j++){ double sum = 0.0; int rowstr_j = rowstr[j]; int rowstr_j1 = rowstr[j+1]; #pragma acc loop vector reduction(+:sum) for(k=rowstr_j;k<rowstr_j1;k++){ sum = sum + a[k]*p[colidx[k]]; } w[j] = sum; } // for j #pragma xmp reduction(+:w) on sub_p(:) acc 10

64nodes(256GPU) 246Gops 236Gops SLOC of XACC is 79% of that of OpenACC + MPI. The performance of XACC is almost the same as that of OpenACC + MPI. Performance of NPB CG 1000 XACC Performance (Gops) � SLOC:609 OpenACC + MPI 100 SLOC:772 10 1 2 4 8 16 32 64 Number of nodes � 11

Implementation of STREAM XACC XMP Evaluate sustainable memory bandwidth (a[i] = b[i] + scalar * c[i]) #pragma xmp nodes p( ∗ ) #pragma xmp nodes p( ∗ ) #pragma acc data copy(a[:GSIZE], b[:GSIZE], c[:GSIZE]) { #pragma xmp barrier #pragma xmp barrier time = -xmp_wtime(); time += xmp_wtime(); �� #pragma acc parallel loop async �� for(int j=0;j<GSIZE;j++) a[j] = b[j] + scalar*c[j]; �� #pragma omp parallel for #pragma omp parallel for �� for (i=0;i<N;i++) for(i=GSIZE;i<N;i++) a[i] = b[i] + scalar ∗ c[i]; a[i] = b[i] + scalar ∗ c[i];   �� .�� #pragma acc wait �� #pragma xmp barrier #pragma xmp barrier �� time += xmp_wtime(); time += xmp_wtime(); } GBs = calc_performance(time); GBs = calc_performance(time); #pragma xmp reduction(+:GBs) #pragma xmp reduction(+:GBs) 13

Performance of STREAM The performance of XACC is 3.2 times better than that of XMP. (Note that XACC uses both GPU and CPU, and XMP uses only CPU.) 6,067GB/s 18,895GB/s 64nodes(256GPU) 100000 Performance (GB/s) � 10000 SLOC:90 1000 XACC SLOC:78 100 XMP 10 1 2 4 8 16 32 64 Number of nodes � 14

Implementation of HPL node #1 node #3 node #2 ・・・ A[N][N] NB 1. Block-cyclic distribution for coefficient matrix double A[N][N]; #pragma xmp nodes p(P,Q) #pragma xmp template t(0:N-1, 0:N-1) #pragma xmp distribute t(cyclic(NB), \ cyclic(NB)) onto p #pragma xmp align A[i][j] with t(j,i) 2. Panel Broadcast from host memory to device memory A[N][N] on Host L[N][NB] on Dev. double L[N][NB]; #pragma xmp align L[i][*] with t(*,i) k #pragma acc enter data create(L[:][:]) : len #pragma xmp gmove acc(L) L[k:len][0:NB] = A[k:len][k-NB:NB]; 3. Update matrix : Use cuBLAS DGEMM developed by NVIDIA 16

The performance in Top500 is used by using CUDA + MPI version HPL 3.3TFlops 11.6TFlops (26%) 34.6TFlops (76%) The DGEMM kernel is different ?? Under investigation. Performance of HPL developed by NVIDIA. Expected performance fromTop500 XACC 100000 XMP Performance (GFlops) � 10000 SLOC:437 1000 SLOC:343 100 1 2 4 8 Number of Node � 17

Conclusion Objective Evaluation In HIMENO, XACC using PEACH2 is better than that of mvapich- GDR in small data size SLOCs of XACC is smaller than those of OpenACC + MPI, typical   programing model Performances of XACC is the almost the same as those of OpenACC + MPI except for HPL. Now we are tuning XACC version HPL. Evaluation of productivity and performance on XACC Future plan Real world application with N-body simulations in space scientific field (collaborate with Yohei Miki) 18

Evaluation of Productivity and Performance of the XcalableACC - PowerPoint PPT Presentation

Evaluation of Productivity and Performance of the XcalableACC programming language LENS2015 INTERNATIONAL WORKSHOP, Oct. 29th. 2015 Masahiro Nakao (RIKEN AICS) HA-PACS/TCA Cluster System Each node has four GPUs (NVDIA K20X). Therefore we

EVALUATION Richard Kneller School of Economics, University of Nottingham The productivity of

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

What is a performance evaluation? Performance Management v. Performance Evaluation Evaluation

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

Telematics 2 & Performance Evaluation Chapter 4 Introduction to Performance Evaluation

X10: a High-Productivity Approach to X10: a High-Productivity Approach to High Performance

Structural change, labor productivity and globalization productivity and globalization Margaret

Training course for policy makers on productivity and working conditions in SMEs SESSION 4:

Productivity Development in Germany And the Financial Crisis by Georg Erber 22. November 2012

DARPA HPCS Overview Productivity Evaluation David Koester, Ph.D. DARPA HPCS Productivity Team

Decent work as a source of Decent work as a source of productivity in Europe productivity in

Automated Productivity Based Automated Productivity Based Schedule Animation (APBSA) Schedule

OUTLOOK, JULY 2 0 1 7 Peter Harris Productivity Com m ission Productivity Commission 1 2

Cilk for High Cilk for High Productivity Computing Productivity Computing Bradley C. Kuszmaul

Testing Kotlin at Scale: Spek Artem Zinnatullin @artem_zin - Productivity - Productivity -

OUTLOOK, JULY 2017 Peter Harris Productivity Commission Productivity Commission 1 2 Topic

HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud Huijun

Development of high-strength 122-type iron-based superconducting wires and tapes for high-field

Avoiding Register Overflow in the Bakery Algorithm The Bakery++ Algorithm The Bakery algorithm is

Generic and parallel Grbner bases in JAS Heinz Kredel, University of Mannheim 4 th

C on posite Dynamics in the E as ly Univ es se Luigi Delle Rose 2 Higgs doublets as 2 Higgs

Predrag BUNCIC, Thorsten KOLLEGER & Pierre VANDE VYVRE ALICE-USA, May 2013, CERN

Compressing IP Forwarding Tables for Fun and Profit Gbor Rtvri, Zoltn Cserntony, Attila

Benchmarks are Hard What do we measure? How do we measure it? How do we verify our

Evaluation of Productivity and Performance of the XcalableACC - PowerPoint PPT Presentation

Evaluation of Productivity and Performance of the XcalableACC programming language LENS2015 INTERNATIONAL WORKSHOP, Oct. 29th. 2015 Masahiro Nakao (RIKEN AICS) HA-PACS/TCA Cluster System Each node has four GPUs (NVDIA K20X). Therefore we

EVALUATION Richard Kneller School of Economics, University of Nottingham The productivity of

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

What is a performance evaluation? Performance Management v. Performance Evaluation Evaluation

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

Telematics 2 &amp; Performance Evaluation Chapter 4 Introduction to Performance Evaluation

X10: a High-Productivity Approach to X10: a High-Productivity Approach to High Performance

Structural change, labor productivity and globalization productivity and globalization Margaret

Training course for policy makers on productivity and working conditions in SMEs SESSION 4:

Productivity Development in Germany And the Financial Crisis by Georg Erber 22. November 2012

DARPA HPCS Overview Productivity Evaluation David Koester, Ph.D. DARPA HPCS Productivity Team

Decent work as a source of Decent work as a source of productivity in Europe productivity in

Automated Productivity Based Automated Productivity Based Schedule Animation (APBSA) Schedule

OUTLOOK, JULY 2 0 1 7 Peter Harris Productivity Com m ission Productivity Commission 1 2

Cilk for High Cilk for High Productivity Computing Productivity Computing Bradley C. Kuszmaul

Testing Kotlin at Scale: Spek Artem Zinnatullin @artem_zin - Productivity - Productivity -

OUTLOOK, JULY 2017 Peter Harris Productivity Commission Productivity Commission 1 2 Topic

HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud Huijun

Development of high-strength 122-type iron-based superconducting wires and tapes for high-field

Avoiding Register Overflow in the Bakery Algorithm The Bakery++ Algorithm The Bakery algorithm is

Generic and parallel Grbner bases in JAS Heinz Kredel, University of Mannheim 4 th

C on posite Dynamics in the E as ly Univ es se Luigi Delle Rose 2 Higgs doublets as 2 Higgs

Predrag BUNCIC, Thorsten KOLLEGER &amp; Pierre VANDE VYVRE ALICE-USA, May 2013, CERN

Compressing IP Forwarding Tables for Fun and Profit Gbor Rtvri, Zoltn Cserntony, Attila

Benchmarks are Hard What do we measure? How do we measure it? How do we verify our

Telematics 2 & Performance Evaluation Chapter 4 Introduction to Performance Evaluation

Predrag BUNCIC, Thorsten KOLLEGER & Pierre VANDE VYVRE ALICE-USA, May 2013, CERN