mary hall october 24 2017 postdoctoral researcher opening
play

Mary Hall October 24, 2017 Postdoctoral Researcher Opening: ~Jan. - PowerPoint PPT Presentation

Mary Hall October 24, 2017 Postdoctoral Researcher Opening: ~Jan. 2018 >10 Open Faculty PosiEons, 1 in Programming Languages Stencils and Geometric Mul2grid Protonu Basu, Sam Williams, Brian Van Straalen, Lenny Oliker, Phil Colella


  1. Mary Hall October 24, 2017

  2. • Postdoctoral Researcher Opening: ~Jan. 2018 • >10 Open Faculty PosiEons, 1 in Programming Languages

  3. Stencils and Geometric Mul2grid Protonu Basu, Sam Williams, Brian Van Straalen, Lenny Oliker, Phil Colella Sparse Matrix Computa2ons Anand Venkat, Khalid Ahmad, Michelle Strout, Huihui Zhang Tensor Contrac2ons Thomas Nelson, Axel Rivera (Intel), Prasanna Balaprakash, Paul Hovland, Liz Jessup, Boyana Norris Funded in part by Department of Energy Office of Advanced ScienEfic CompuEng Research under awards DE-SC0008682 and ScienEfic Discovery through Advanced ComputaEon (SciDAC) award DE-SC0006947, and by the NaEonal Science FoundaEon award CCF-1018881.

  4. Which version would you prefer to write? Memory Hierarchy /* Laplacian 7-point Variable-Coefficient Stencil */ for (k=0; k<N; k++) Prefetch for (j=0; j<N; j++) Data staged in registers/buffers for (i=0; i<N; i++ temp[k][j][i] = b * h2inv * ( AVX SIMD intrinsics beta_i[k][j][i+1] * ( phi[k][j][i+1] – phi[k][j][i] ) -beta_i[k][j][i] * ( phi[k][j][i] – phi[k][j][i-1] ) +beta_j[k][j+1][i] * ( phi[k][j+1][i] – phi[k][j][i] ) -beta_j[k][j][i] * ( phi[k][j][i] – phi[k][j-1][i] ) +beta_k[k+1][j][i] * ( phi[k+1][j][i] – phi[k][j][i] ) -beta_k[k][j][i] * ( phi[k][j][i] – phi[k-1][j][i] ) ); Parallelism /* Helmholz */ Ghost zones: for (k=0; k<N; k++) for (j=0; j<N; j++) Tradeoff computaEon for communicaEon for (i=0; i<N; i++) Parallel Wavefronts: temp[k][j][i] = a * alpha[k][j][i] * phi[k][j][i] – temp[k][j][i]; Reduce sweeps over 3D grid /* Gauss-Seidel Red Black Update */ Nested OpenMP and MPI for (k=0; k<N; k++) for (j=0; j<N; j++) Spin locks in OpenMP for (i=0; i<N; i++){ if ((i+j+k+color)%2 == 0 ) phi[k][j][i] = phi[k][j][i] – lambda[k][j][i] * (temp[k][j][i] – rhs[k][j][i]);} Code B: miniGMG opEmized smooth operator Code A: miniGMG baseline smooth operator approximately 170 lines of code approximately 13 lines of code

  5. And now GPU code? Code C: miniGMG opEmized smooth operator for GPU, 308 lines of code for just kernel

  6. Which version would you prefer to write? /* SpMM from LOBCG on symmetric matrix */ for( i =0; i < n ; i ++) { for ( j = index [ i ]; j < index [ i +1]; j ++) Convert Matrix Format for( k =0; k < m ; k ++); CSR � CSB y [ i ][ k ]+= A [ j ]* x [ col [ j ]][ k ]; /* transposed computation exploiting symmetry*/ for ( j = index [ i ]; j < index [ i +1]; j ++) OpenMP w/ scheduling for( k =0; k < m ; k ++) y [ col [ j ]][ k ]+= A [ j ]* x [ i ][ k ]; } pragmas to AVX SIMD Code A: MulEple SpMV computaEons (SpMM), 7 lines of code Indexing simplificaEon 11 different block sizes/ implementaEons Code B: Manually-opEmized SpMM from LOBCG, 2109 lines of code

  7. Which version would you prefer to write? /* local_grad_3 computation from nek5000 */ /* local_grad3 from nek5000, generated CUDA code */ w[nelt i j k] += Dt[l k] U[nelt n m l] D[j m] D[i n] Code A: 1 line mathemaEcal representaEon Input to OCTOPI Code B: Generated CUDA+harness, 122 lines of code

  8. Code B/C is not Unusual • Performance portability? • ParEcularly across fundamentally different CPU and GPU architectures • Programmer producEvity? • High performance implementaEons will require low-level specificaEon that exposes architecture • Sooware maintainability and portability? • May require mulEple implementaEons of applicaEon Current solu2ons • Follow MPI and OpenMP standards • Same code unlikely to perform well across CPU and GPU • Vendor C and Fortran compilers not opEmized for HPC workloads • Some domain-specific framework strategies • Libraries, C++ template expansion, standalone DSL • Not composable with other opEmizaEons

  9. Our Approach • CHiLL: polyhedral compiler transformaEon and for (i=0;i<N;i++) { for (j=1;j<M;j++) { code generaEon framework with domain- S0: a[i][j] = b[j] – a[i][j-1]; specific specializa4on (supports C-like C++) I = {[i,j] | 0<=i<N ∧ 1<=j<=M} • Target is loop-based scienEfic applicaEons • Composable transformaEons • OpEmizaEon strategy can be specified or derived with transforma4on recipes • Also opEmizaEon parameters exposed • Separates code from mapping! • Autotuning • SystemaEc exploraEon of alternate transformaEon recipes and their opEmizaEon parameter values • Search technology to prune combinatorial space Automate process of genera2ng Code B from Code A.

  10. • Immediate: Improve performance of producEon applicaEon • Medium term: New research ideas • Long term: • Change workflow for HPC applicaEon development • Move faciliEes into more rapid adopEon of new tools • Impact compiler and autotuning technology • Projects: • DOE Exascale CompuEng Project • DOE ScienEfic Discovery through Advanced CompuEng • NSF Blue Waters PAID project

  11. b. Statement macro a. Original loop nest and itera2on space #define S0(i,j) a[(i)][(j)] = b[(j)] – a[(i)][(j-1)] for (i=0;i<N;i++) { for (j=1;j<M;j++) { S0: a[i][j] = b[j] – a[i][j-1]; I = {[i,j] | 0<=i<N ∧ 1<=j<=M} d. Dependence rela2on for S0 c. Transformed loop nest T = {[i,j] � [j,i]} {[i,j] � [i’,j’] | 0<=i,i’<N ∧ 1<=j,j’<M ∧ (i=i’ ∧ j=j’-1)} for (j=1;j<M;j++) { for (i=0;i<N;i++) { S0: a[i][j] = b[j] – a[i][j-1];

  12. • Inspector/executor methodology Inspector Code: Matrix format • Inspector analyzes indirect accesses conversion, non-affine at run4me and/or reorganizes data transformaEon and representaEon parallelizaEon • Executor is the reordered computaEon • Compose with polyhedral Executor Code: transformaEons IteraEons are opEmized and use new representaEon

  13. Three ApplicaEon Domains, One Compiler Sparse Linear Algebra Stencils and GMG Tensor Contrac2ons • Specialize matrix • Memory-bandwidth • Reduce representa;on: bound: computa;on: Data CommunicaEon- Reassociate transformaEons avoiding • Op;mize memory opEmizaEons • Incorporate run;me access paBern: informa;on: • Compute bound: Modify loop order Inspector/executor Eliminate to best match data redundant layout and memory • Support non-affine computaEon hierarchy input/ (parEal sums) transforma;ons • Adjust parallelism [HIPC’13],[WOSC’13], [CGO’14], [PLDI’15], [ICPP’15] [WOSC’14],[IPDPS’15], [IPDPS’16], [LCPC’16] [PARCO’17] [IA^3’16], [SC’16]

  14. CommunicaEon Avoiding: SomeEmes Code A Beats Code B! GSRB Smooth (Edison) � miniGMG w/CHiLL 5.0x • Fused operaEons Speedup over Baseline Smoother CHiLL generated 4.5x • CommunicaEon-avoiding Manually Tuned 4.0x wavefront Baseline 3.5x • Parallelized (OpenMP) 3.0x � Autotuning finds the best 2.5x implementaEon for each box size 2.0x • wavefront depth 1.5x • nested OpenMP configuraEon 1.0x • inter-thread synchronizaEon 0.5x (barrier vs. point-to-point) 0.0x � For fine grids (large arrays) CHiLL 64^3 32^3 16^3 8^3 4^3 a{ains nearly a 4.5x speedup over Box Size ( == Level in the V-Cycle) baseline

  15. Retargetable and Performance Portable: OpEmized Code A can beat Code C! � CHiLL can obviate the need for architecture-specific programming GSRB Smooth on 64^3 boxes 12 models like CUDA CUDA-CHiLL • CUDA-CHiLL took the sequenEal 10 GSRB implementaEon (.c) and Handtuned 8 generated CUDA that runs on Handtuned-VL Time (seconds) 5.224148 NVIDIA GPUs 6 • CUDA-CHiLL autotuned over the 4 thread block sizes and is 4.861889 4.774941 ulEmately 2% faster than the 2 hand-opEmized minigmg-cuda 0 ( Code C ) • Adaptable to new GPU generaEons 2D Thread Blocks <TX,TY> 15

  16. Example TransformaEon Recipes • These can be manually-wri{en (miniGMG, LOBCG) or automaEcally generated (tensor contracEon) /* jacobi_box_4_64.py, 27-pt stencil, 64 3 box size */ /* gsrb.lua, variable coefficient GSRB, 64 3 box size */ from chill import * init("gsrb_mod.cu", "gsrb",0,0) dofile("cudaize.lua”) # custom commands in lua #select which computaEon to opEmize source('jacobi_box_4_64.c') # set up parallel decomposiEon, adjust via autotuning procedure('smooth_box_4_64') TI=32 loop(0) TJ=4 original() # fuse wherever possible TK=64 #create a parallel wavefront TZ=64 skew([0,1,2,3,4,5],2,[2,1]) permute([2,1,3,4]) Ele_by_index(0, {"box","k","j", "i"},{TZ,TK, TJ, TI}, #parEal sum for high order stencils and fuse result {l1_control="bb", l2_control="kk", l3_control="jj", distribute([0,1,2,3,4,5],2) l4_control="ii"},{"bb","box","kk","k","jj","j","ii","i"}) stencil_temp(0) stencil_temp(5) cudaize(0, "kernel_GPU", fuse([2,3,4,5,6,7,8,9],1) {_temp=N*N*N*N,_beta_i=N*N*N*N, fuse([2,3,4,5,6,7,8,9],2) _phi=N*N*N*N},{block={"ii","jj","box"}, fuse([2,3,4,5,6,7,8,9],3) thread={"i","j"}},{}) fuse([2,3,4,5,6,7,8,9],4)

  17. • Data layout • A brick is a 4x4x4 mini domain without a ghost zone • ApplicaEon of a stencil reaches into other bricks (affinity important) • Implemented with conEguous storage and adjacency lists • OpEmizaEon advantages • Flexible mapping to SIMD, threads • Rapid copying, simplifies scheduling and code generaEon, can improve ghost zone exchange • Be{er memory hierarchy behavior (including TLB on KNL) CollaboraEon with Tuowen Zhao (Utah), Protonu Basu, Sam Williams, Hans Johansen (LBNL)

Recommend


More recommend