Evaluation of Productivity and Performance of the XcalableACC programming language LENS2015 INTERNATIONAL WORKSHOP, Oct. 29th. 2015 Masahiro Nakao (RIKEN AICS)
HA-PACS/TCA Cluster System Each node has four GPUs (NVDIA K20X). Therefore we assigned four processes to one node, and each process deals with one GPU. http://www.ccs.tsukuba.ac.jp CPU CPU GPU GPU GPU GPU 2
Objectives Evaluate Performance and Productivity of XcalableACC (XACC) Four benchmarks HIMENO Evaluate the performance of incompressible fluid analysis code (stencil code) NPB CG Solve minimum eigenvalue of symmetric and positive definite sparse matrix using the Conjugate Gradient method STREAM Evaluate sustainable memory bandwidth HPL High Performance Linpack. This code evaluates the floating point rate of execution for solving a linear system of equations 3
Objectives Evaluate Performance and Productivity of XcalableACC (XACC) Four benchmarks HIMENO Evaluate the performance of incompressible fluid analysis code (stencil code) NPB CG Solve minimum eigenvalue of symmetric and positive definite sparse matrix using the Conjugate Gradient method STREAM Evaluate sustainable memory bandwidth HPL High Performance Linpack. This code evaluates the floating point rate of execution for solving a linear system of equations 4
Implementation of HIMENO Transfer distributed array directives into the sequential Only add XMP and OpenACC with halo region Define distributed array Parallelize loop statement Exchange halo region to accelerator Himeno benchmark. float p[I][J][K]; #pragma xmp template t(0:K-1,0:J-1,0:I-1) #pragma xmp nodes n(1, NDY, NDX) #pragma xmp distribute t(block, block, ¥ block) onto n #pragma xmp align p[k][j][i] with t(i, j, k) #pragma xmp shadow p[1:2][1:2][0:1]; #pragma acc data copy(p) .. { .. #pragma xmp reflect (p) acc .. #pragma xmp loop (k,j,i) on t(k,j,i) #pragma acc parallel loop .. for(i=1; i<MIMAX; ++i) for(j=1; j<MJMAX; ++j){ #pragma acc loop vector .. for(k=1; k<MKMAX; ++k){ S0 = p[i+1][j][k] * ..; 5
Pingpong on HA-PACS/TCA 1000 8 64 512 4K 32K 256K 2M 100 better 10 1 Transfer data (Byte) 10000 128K Device memory to Device memory Latency (u second) (MVAPICH2-GDR 2.0) on neighbor nodes PEACH2 PEACH2: PCIe Gen.2 x 8links : 4GB/s GPUDirect: InfiniBand 4xQDR x 2rails : 8GB/s GPUDirect RDMA 10000" 1000" ���(���� ¡���� 100" 10" 1" 8" 64" 512" 4096" 32768" 262144" 2097152" �����)(�����(� ¡���(�� 6
Performance of HIMENO (1/2) Number of Nodes better 14%↑ Number of Nodes array size : p[128][128][256] Performance (GFlops) array size : p[64][64][128] 4%↑ Comparison of “XACC with PEACH2” and “XACC with GDR (mvapich-GDR)” “XACC with PEACH2” is better than “XACC with GDR” in p[64][64][128] 150 400 XACC (PEACH2) XACC (PEACH2) XACC (GDR) XACC (GDR) 300 100 200 50 100 0 0 1 2 4 8 16 1 2 4 8 16 7
Performance of HIMENO (2/2) 1518GFlops lines of code SLOC is source 16nodes(64GPU) 1507GFlops (GDR) The performance of XACC is almost the same as that of OpenACC + MPI SLOC of XACC is about 60% of that of OpenACC + MPI 10000 Performance (GFlops) � 1000 SLOC:328 SLOC:198 XACC 100 OpenACC + MPI 10 1 2 4 8 16 Number of nodes � 8
Objectives Evaluate Performance and Productivity of XcalableACC (XACC) Four benchmarks HIMENO Evaluate the performance of incompressible fluid analysis code (stencil code) NPB CG Solve minimum eigenvalue of symmetric and positive definite sparse matrix using the Conjugate Gradient method STREAM Evaluate sustainable memory bandwidth HPL High Performance Linpack. This code evaluates the floating point rate of execution for solving a linear system of equations 9
Implementation of NPB CG Parallelize loop statement nodes Reduction among device memory Reduction on Define distributed array double w[NA]; #pragma xmp nodes p(PROC_COLS,PROC_ROWS) #pragma xmp nodes sub_p(PROC_COLS)=p(:,*) #pragma xmp template t(0:NA-1,0:NA-1) #pragma xmp distribute t(block, block) onto p #pragma xmp align w[i] with t(*,i) for(cgit=1;cgit<=cgitmax;cgit++){ rho0 = rho; d = 0.0; rho = 0.0; #pragma xmp loop on t(*,j) #pragma acc parallel loop gang for(j=0;j<NA;j++){ double sum = 0.0; int rowstr_j = rowstr[j]; int rowstr_j1 = rowstr[j+1]; #pragma acc loop vector reduction(+:sum) for(k=rowstr_j;k<rowstr_j1;k++){ sum = sum + a[k]*p[colidx[k]]; } w[j] = sum; } // for j #pragma xmp reduction(+:w) on sub_p(:) acc 10
64nodes(256GPU) 246Gops 236Gops SLOC of XACC is 79% of that of OpenACC + MPI. The performance of XACC is almost the same as that of OpenACC + MPI. Performance of NPB CG 1000 XACC Performance (Gops) � SLOC:609 OpenACC + MPI 100 SLOC:772 10 1 2 4 8 16 32 64 Number of nodes � 11
Objectives Evaluate Performance and Productivity of XcalableACC (XACC) Four benchmarks HIMENO Evaluate the performance of incompressible fluid analysis code (stencil code) NPB CG Solve minimum eigenvalue of symmetric and positive definite sparse matrix using the Conjugate Gradient method STREAM Evaluate sustainable memory bandwidth HPL High Performance Linpack. This code evaluates the floating point rate of execution for solving a linear system of equations 12
Implementation of STREAM XACC XMP Evaluate sustainable memory bandwidth (a[i] = b[i] + scalar * c[i]) #pragma xmp nodes p( ∗ ) #pragma xmp nodes p( ∗ ) #pragma acc data copy(a[:GSIZE], b[:GSIZE], c[:GSIZE]) { #pragma xmp barrier #pragma xmp barrier time = -xmp_wtime(); time += xmp_wtime(); ����������� #pragma acc parallel loop async ��������� for(int j=0;j<GSIZE;j++) a[j] = b[j] + scalar*c[j]; �������������� #pragma omp parallel for #pragma omp parallel for ������������� for (i=0;i<N;i++) for(i=GSIZE;i<N;i++) a[i] = b[i] + scalar ∗ c[i]; a[i] = b[i] + scalar ∗ c[i]; ���������� ���.����������� #pragma acc wait ������������������� #pragma xmp barrier #pragma xmp barrier ���������� time += xmp_wtime(); time += xmp_wtime(); } GBs = calc_performance(time); GBs = calc_performance(time); #pragma xmp reduction(+:GBs) #pragma xmp reduction(+:GBs) 13
Performance of STREAM The performance of XACC is 3.2 times better than that of XMP. (Note that XACC uses both GPU and CPU, and XMP uses only CPU.) 6,067GB/s 18,895GB/s 64nodes(256GPU) 100000 Performance (GB/s) � 10000 SLOC:90 1000 XACC SLOC:78 100 XMP 10 1 2 4 8 16 32 64 Number of nodes � 14
Objectives Evaluate Performance and Productivity of XcalableACC (XACC) Four benchmarks HIMENO Evaluate the performance of incompressible fluid analysis code (stencil code) NPB CG Solve minimum eigenvalue of symmetric and positive definite sparse matrix using the Conjugate Gradient method STREAM Evaluate sustainable memory bandwidth HPL High Performance Linpack. This code evaluates the floating point rate of execution for solving a linear system of equations 15
Implementation of HPL node #1 node #3 node #2 ・・・ A[N][N] NB 1. Block-cyclic distribution for coefficient matrix double A[N][N]; #pragma xmp nodes p(P,Q) #pragma xmp template t(0:N-1, 0:N-1) #pragma xmp distribute t(cyclic(NB), \ cyclic(NB)) onto p #pragma xmp align A[i][j] with t(j,i) 2. Panel Broadcast from host memory to device memory A[N][N] on Host L[N][NB] on Dev. double L[N][NB]; #pragma xmp align L[i][*] with t(*,i) k #pragma acc enter data create(L[:][:]) : len #pragma xmp gmove acc(L) L[k:len][0:NB] = A[k:len][k-NB:NB]; 3. Update matrix : Use cuBLAS DGEMM developed by NVIDIA 16
The performance in Top500 is used by using CUDA + MPI version HPL 3.3TFlops 11.6TFlops (26%) 34.6TFlops (76%) The DGEMM kernel is different ?? Under investigation. Performance of HPL developed by NVIDIA. Expected performance fromTop500 XACC 100000 XMP Performance (GFlops) � 10000 SLOC:437 1000 SLOC:343 100 1 2 4 8 Number of Node � 17
Conclusion Objective Evaluation In HIMENO, XACC using PEACH2 is better than that of mvapich- GDR in small data size SLOCs of XACC is smaller than those of OpenACC + MPI, typical programing model Performances of XACC is the almost the same as those of OpenACC + MPI except for HPL. Now we are tuning XACC version HPL. Evaluation of productivity and performance on XACC Future plan Real world application with N-body simulations in space scientific field (collaborate with Yohei Miki) 18
Recommend
More recommend