combining polyhedral and
play

Combining Polyhedral and AST Transformations in CHiLL Huihui Zhang , - PowerPoint PPT Presentation

Combining Polyhedral and AST Transformations in CHiLL Huihui Zhang , Anand Venkat, Protonu Basu, Mary Hall University of Utah January 19, 2016 Outline Introduction Problem Limitations of polyhedral transformation CHiLL Compiler


  1. Combining Polyhedral and AST Transformations in CHiLL Huihui Zhang , Anand Venkat, Protonu Basu, Mary Hall University of Utah January 19, 2016

  2. Outline • Introduction • Problem • Limitations of polyhedral transformation • CHiLL Compiler Abstractions • Combining polyhedral and AST transformations • Case Studies • Inspector/executor transformation for sparse matrix computation • Partial sum transformation for stencil optimization • Parallel code generation • CUDA • OpenMP • Related Work • Conclusion

  3. Introduction • Limitation of typical polyhedral transformation • Limited to affine domain • Transform iteration spaces • Array indices of statements updated • Complicated optimizations • AST transformation as a post-pass outside of polyhedral framework • Challenges • Leverage the power of composability of polyhedral framework • Introduction • Problem • CHiLL Compiler Abstractions • Case Studies • Related Work • Conclusion

  4. CHiLL Compiler Abstractions Input code: for(i=0; i < n; i++) Input code s0: a[i+1]=a[i] + 5; Shift by 4 CHiLL Compiler CHiLL Abstractions: Dep: <+1> Statement: s0: a[i+1]=a[i] + 5; Loop transformation framework IS: {[i] : 0 <= i < n} Polyhedral xform: {[i]->[0,i+4,0]} Code generation code: a[i+1]=a[i] + 5; xform_inv = {[i]->[i-4]} • Introduction • Problem • CHiLL Compiler Abstractions Generated code: Generated code • Case Studies for(i=4; i < n+ 4; i++) • Related Work s0: a[i-3]=a[i-4]+5; • Conclusion

  5. CHiLL Compiler Abstractions Input code: for(i=0; i < n; i++) Input code s0: a[i+1]=a[i] + 5; CHiLL Compiler CHiLL Abstractions: Dep: <+1> Statement: s0: a[i+1]=a[i] + 5; Loop transformation framework IS: {[i] : 0 <= i < n} Polyhedral xform: {[i] -> [0,1,0]} Code generation code: AST • Introduction • Problem • CHiLL Compiler Abstractions Modified AST Generated code • Case Studies • Related Work • Conclusion

  6. Non-Affine Extension – Coalesce Transformation • Sparse matrix computation • Non-affine indirection through index arrays • Subscript expressions CSR: • x[col[j]] for(i=0; i < n; i++) • Upper/lower loop bounds for(j=index[i];j<index[i+1];j++) • index[i], index[i+1] y[i]+=a[j]*x[col[j]] • Uninterpreted function symbol abstraction • Model functions or mappings (non-affine) • Inspector/Executor mechanism • Introduction • Inspector collects information at runtime • Case Studies • used by optimized executor Inspector/Executor • Partial Sum • Parallel Code Generation • Related Work • Conclusion

  7. Inspector Construction - Coalesce Transformation struct c { Input code: int c_inv[][2]; for(i=0; i < n; i++) int k; for(j=index[i];j<index[i+1];j++) void create_mapping(int i, int j) { y[i]+=a[j]*x[col[j]] c_inv[k][0] = i; c_inv[k][1] = j; AST & Iteration Space Manipulation k++; }} AST T coalesce ={[i,j]->[k]| k=c(i,j) ∧ 0 ≤ k < NNZ} Inspector code: for(i = 0; i < n; i++) Executor code: Polyhedral for(j = index[i]; j < index[i+1]; j++) for (k = 0; k < NNZ; k++) code c.create_mapping(i,j); code y[c_inv[k][0]] += • Introduction • Case Studies a[c_inv[k][1]]*x[col[c_inv[k][1]]]; • Inspector/Executor • Partial Sum • Statement update Parallel Code Generation • Related Work • Conclusion

  8. More Complicated I/E Transformations - BCSR Input code: for(i = 0; i < n; i++) for(i = 0; i < n; i++) for(k = 0; k < n; k++) make-dense for(j = index[i]; j < index[i+1]; j++) for(j = index[i]; j < index[i+1]; j++) y[i] += a[j]*x[col[j]]; if(k == col[j]) Tile(i,k) y[i]+=a[j]*x[k]; Inspector Code: for(ii=0; ii < n/r; ii++){ //reset marked to false (code not shown) for(ii=0; ii < n/r; ii++) for(i=0; i < r; i++) for(kk=0; kk < n/c; kk++) for(j=index[ii*r +i]; j < index[ii*r+i+1];j++) { for(i=0; I < r; i++) code for(k=0; k < c; k++) for(j=index[ii*r+i]; j < index[ii*r+i+1]; j++) kk = col[j]/c; k=col[j]/c – kk*c; if(kk*c+k == col[j]) if(marked[kk] == false){ y[ii*r+i] += a[j]*x[kk*c+k]; marked[kk] = true; • explicit_index[kk] = count; Introduction • Case Studies } //initialize a’[count][0 -r][0-c] to 0 • Inspector/Executor • count++; } Partial Sum Compact-and-pad(kk,a,a ’) • Parallel Code Generation a’[count][ i][k] = a[j]; } • Related Work offset_index [ii+1] = count; • Conclusion

  9. Partial Sum Transformation – Stencil Optimization • Constant-coefficient Stencils Jacobi • Weighted sum • High-order Stencils • Introduction • Case Studies • Inspector/Executor • Partial Sum p = 2 p = 4 p = 6 p = 10 • Parallel Code Generation • Related Work • Conclusion

  10. Still affine Partial Sum Transformation - Reuse r1 = in[j][i+1]; 2D 9-point for (j=0; j<N; j++) stencil r2 = in[j+1][i+1] + in[j-1][i+1]; for (i=0; i<N; i++) { out[j][i] = w1*( in[j-1][i] + in[j+1][i] + R[i] = w1 * r1 + w2 * r2; 1 AST in[j][i-1] + in[j][i+1] ) + w2*( in[j-1][i-1] + in[j+1][i-1] + C[i+1] = w3 * r1 + w1 * r2; 2 in[j-1][i+1] + in[j+1][i+1] ) + L[i+2] = R[i]; w3*( in[j][i] ); } 3 j out[j][i] = L[i] + C[i]+ R[i]; … … i R • Composable with communication- i … … avoiding optimizations C i+1 (j,i+2) … … • Overlapped tiling L i+2 • Loop fusion (j,i+1) • Introduction 1 • Wavefront • Case Studies • Inspector/Executor 2 (j,i) • Partial Sum 3 • Parallel Code Generation • Related Work • Conclusion

  11. Parallel Code Generation • Introduces • Parallel threads • Synchronization • Scaffolding code • Approach • Apply transformations to set up for parallelization • E.g., tiling, datacopy • Annotate AST with aspects of parallel code generation • AST and polyhedral abstractions preserved until code generation, to facilitate composing transformations • Introduction • Case Studies • Code generation emits specialized code • Inspector/Executor • Partial Sum • Parallel Code Generation CUDA OpenMP • Related Work • Conclusion

  12. Parallel Code Generation - CUDA void MM(int c[N][N], int a[N][N], int b[N][N]) { • Impact to AST for (i = 0; i < N; i++) for (j = 0; j < N; j++) • AST annotation of block/thread loops for (k = 0; k < N; k++) • Loops are marked for elimination c[j][i] = c[j][i] + a[k][i] * b[j][k]; } • Polyhedral and AST abstractions remain until code generation tile_by_index(0,{"i","j"},{Ti,Tj}, {l1_control="ii",l2_control="jj"}, {"ii","jj","i","j","k"}) for(t2 = 0; t2 <= 7; t2++) // loop ii, block dimension x{ for(t4 = 0; t4 <= 15; t4++) // loop jj, block dimension y{ for(t6 = 128*t2; t6 <= 128*t2+127; t6++) // loop i { • Introduction for(t8 = 64*t4; t8 <= 64*t4+63; t8++) // loop j { • Case Studies • Inspector/Executor for(t10 = 0; t10 <= 1023; t10++) // loop k { • Partial Sum • Parallel Code Generation s0(t2,t4,t6,t8,t10); }}}}} CUDA OpenMP • Related Work cudaize(0,"mm_GPU",{}, {block={"ii","jj"},thread={"i","j"}},{}) • Conclusion

  13. Parallel Code Generation - CUDA void MM(int c[N][N], int a[N][N], int b[N][N]) { • Impact to AST for (i = 0; i < N; i++) • AST annotation of block/thread loops for (j = 0; j < N; j++) • Loops are mark for elimination for (k = 0; k < N; k++) • Polyhedral and AST abstractions remain until code c[j][i] = c[j][i] + a[k][i] * b[j][k]; } generation • Loop iterators are replaced with block/thread index tile_by_index(0,{"i","j"},{Ti,Tj}, • Eg, ii, jj replaced with blockIdx.x, blockIdx.y {l1_control="ii",l2_control="jj"}, {"ii","jj","i","j","k"}) for(t2 = 0; t2 <= 7; t2++) // loop ii, block dimension x{ for(t4 = 0; t4 <= 15; t4++) // loop jj, block dimension y{ for(t6 = 128*t2; t6 <= 128*t2+127; t6++) // loop i { • Introduction for(t8 = 64*t4; t8 <= 64*t4+63; t8++) // loop j { • Case Studies • Inspector/Executor for(t10 = 0; t10 <= 1023; t10++) // loop k { • Partial Sum • Parallel Code Generation s0(t2,t4,t6,t8,t10); }}}}} blockIdx.x, blockIdx.y CUDA OpenMP • Related Work cudaize(0,"mm_GPU",{}, {block={"ii","jj"},thread={"i","j"}},{}) • Conclusion

  14. Parallel Code Generation - CUDA for (kk = 0; kk <= 63; kk += 1) • Data Copy Transformation for (iii = 0; iii <= 7; iii += 1) • Synchronization for (jjj = 0; jjj <= 3; jjj += 1) • AST annotation for (k = 16 * kk; k <= 16 * kk + 15; k += 1) c[...][...] = c[...][...] + a[...][...] * b[...][...]; • Scaffolding code copy_to_shared(0,"tx","a",-16) AST ... Kernel inlining mm_GPU <<<dimGrid0 ,dimBlock0 >>>(...); for (kk = 0; kk <= 63; kk += 1) { ... for (tmp_tx = 0; tmp_tx <= 7; tmp_tx += 1) __global__ void mm_GPU(...) _P1[...][...] = a[...][...]; { ... } AST __syncthreads(); • for (iii = 0; iii <= 7; iii += 1) Introduction • Case Studies for (jjj = 0; jjj <= 3; jjj += 1) • Inspector/Executor • Partial Sum for (k = 16 * kk; k <= 16 * kk + 15; k += 1) • Parallel Code Generation CUDA c[...][...] = c[...][...] + _P1[...][...] * b[...][...]; OpenMP • Related Work __syncthreads(); } • Conclusion

Recommend


More recommend