Combining Polyhedral and AST Transformations in CHiLL Huihui Zhang , Anand Venkat, Protonu Basu, Mary Hall University of Utah January 19, 2016
Outline • Introduction • Problem • Limitations of polyhedral transformation • CHiLL Compiler Abstractions • Combining polyhedral and AST transformations • Case Studies • Inspector/executor transformation for sparse matrix computation • Partial sum transformation for stencil optimization • Parallel code generation • CUDA • OpenMP • Related Work • Conclusion
Introduction • Limitation of typical polyhedral transformation • Limited to affine domain • Transform iteration spaces • Array indices of statements updated • Complicated optimizations • AST transformation as a post-pass outside of polyhedral framework • Challenges • Leverage the power of composability of polyhedral framework • Introduction • Problem • CHiLL Compiler Abstractions • Case Studies • Related Work • Conclusion
CHiLL Compiler Abstractions Input code: for(i=0; i < n; i++) Input code s0: a[i+1]=a[i] + 5; Shift by 4 CHiLL Compiler CHiLL Abstractions: Dep: <+1> Statement: s0: a[i+1]=a[i] + 5; Loop transformation framework IS: {[i] : 0 <= i < n} Polyhedral xform: {[i]->[0,i+4,0]} Code generation code: a[i+1]=a[i] + 5; xform_inv = {[i]->[i-4]} • Introduction • Problem • CHiLL Compiler Abstractions Generated code: Generated code • Case Studies for(i=4; i < n+ 4; i++) • Related Work s0: a[i-3]=a[i-4]+5; • Conclusion
CHiLL Compiler Abstractions Input code: for(i=0; i < n; i++) Input code s0: a[i+1]=a[i] + 5; CHiLL Compiler CHiLL Abstractions: Dep: <+1> Statement: s0: a[i+1]=a[i] + 5; Loop transformation framework IS: {[i] : 0 <= i < n} Polyhedral xform: {[i] -> [0,1,0]} Code generation code: AST • Introduction • Problem • CHiLL Compiler Abstractions Modified AST Generated code • Case Studies • Related Work • Conclusion
Non-Affine Extension – Coalesce Transformation • Sparse matrix computation • Non-affine indirection through index arrays • Subscript expressions CSR: • x[col[j]] for(i=0; i < n; i++) • Upper/lower loop bounds for(j=index[i];j<index[i+1];j++) • index[i], index[i+1] y[i]+=a[j]*x[col[j]] • Uninterpreted function symbol abstraction • Model functions or mappings (non-affine) • Inspector/Executor mechanism • Introduction • Inspector collects information at runtime • Case Studies • used by optimized executor Inspector/Executor • Partial Sum • Parallel Code Generation • Related Work • Conclusion
Inspector Construction - Coalesce Transformation struct c { Input code: int c_inv[][2]; for(i=0; i < n; i++) int k; for(j=index[i];j<index[i+1];j++) void create_mapping(int i, int j) { y[i]+=a[j]*x[col[j]] c_inv[k][0] = i; c_inv[k][1] = j; AST & Iteration Space Manipulation k++; }} AST T coalesce ={[i,j]->[k]| k=c(i,j) ∧ 0 ≤ k < NNZ} Inspector code: for(i = 0; i < n; i++) Executor code: Polyhedral for(j = index[i]; j < index[i+1]; j++) for (k = 0; k < NNZ; k++) code c.create_mapping(i,j); code y[c_inv[k][0]] += • Introduction • Case Studies a[c_inv[k][1]]*x[col[c_inv[k][1]]]; • Inspector/Executor • Partial Sum • Statement update Parallel Code Generation • Related Work • Conclusion
More Complicated I/E Transformations - BCSR Input code: for(i = 0; i < n; i++) for(i = 0; i < n; i++) for(k = 0; k < n; k++) make-dense for(j = index[i]; j < index[i+1]; j++) for(j = index[i]; j < index[i+1]; j++) y[i] += a[j]*x[col[j]]; if(k == col[j]) Tile(i,k) y[i]+=a[j]*x[k]; Inspector Code: for(ii=0; ii < n/r; ii++){ //reset marked to false (code not shown) for(ii=0; ii < n/r; ii++) for(i=0; i < r; i++) for(kk=0; kk < n/c; kk++) for(j=index[ii*r +i]; j < index[ii*r+i+1];j++) { for(i=0; I < r; i++) code for(k=0; k < c; k++) for(j=index[ii*r+i]; j < index[ii*r+i+1]; j++) kk = col[j]/c; k=col[j]/c – kk*c; if(kk*c+k == col[j]) if(marked[kk] == false){ y[ii*r+i] += a[j]*x[kk*c+k]; marked[kk] = true; • explicit_index[kk] = count; Introduction • Case Studies } //initialize a’[count][0 -r][0-c] to 0 • Inspector/Executor • count++; } Partial Sum Compact-and-pad(kk,a,a ’) • Parallel Code Generation a’[count][ i][k] = a[j]; } • Related Work offset_index [ii+1] = count; • Conclusion
Partial Sum Transformation – Stencil Optimization • Constant-coefficient Stencils Jacobi • Weighted sum • High-order Stencils • Introduction • Case Studies • Inspector/Executor • Partial Sum p = 2 p = 4 p = 6 p = 10 • Parallel Code Generation • Related Work • Conclusion
Still affine Partial Sum Transformation - Reuse r1 = in[j][i+1]; 2D 9-point for (j=0; j<N; j++) stencil r2 = in[j+1][i+1] + in[j-1][i+1]; for (i=0; i<N; i++) { out[j][i] = w1*( in[j-1][i] + in[j+1][i] + R[i] = w1 * r1 + w2 * r2; 1 AST in[j][i-1] + in[j][i+1] ) + w2*( in[j-1][i-1] + in[j+1][i-1] + C[i+1] = w3 * r1 + w1 * r2; 2 in[j-1][i+1] + in[j+1][i+1] ) + L[i+2] = R[i]; w3*( in[j][i] ); } 3 j out[j][i] = L[i] + C[i]+ R[i]; … … i R • Composable with communication- i … … avoiding optimizations C i+1 (j,i+2) … … • Overlapped tiling L i+2 • Loop fusion (j,i+1) • Introduction 1 • Wavefront • Case Studies • Inspector/Executor 2 (j,i) • Partial Sum 3 • Parallel Code Generation • Related Work • Conclusion
Parallel Code Generation • Introduces • Parallel threads • Synchronization • Scaffolding code • Approach • Apply transformations to set up for parallelization • E.g., tiling, datacopy • Annotate AST with aspects of parallel code generation • AST and polyhedral abstractions preserved until code generation, to facilitate composing transformations • Introduction • Case Studies • Code generation emits specialized code • Inspector/Executor • Partial Sum • Parallel Code Generation CUDA OpenMP • Related Work • Conclusion
Parallel Code Generation - CUDA void MM(int c[N][N], int a[N][N], int b[N][N]) { • Impact to AST for (i = 0; i < N; i++) for (j = 0; j < N; j++) • AST annotation of block/thread loops for (k = 0; k < N; k++) • Loops are marked for elimination c[j][i] = c[j][i] + a[k][i] * b[j][k]; } • Polyhedral and AST abstractions remain until code generation tile_by_index(0,{"i","j"},{Ti,Tj}, {l1_control="ii",l2_control="jj"}, {"ii","jj","i","j","k"}) for(t2 = 0; t2 <= 7; t2++) // loop ii, block dimension x{ for(t4 = 0; t4 <= 15; t4++) // loop jj, block dimension y{ for(t6 = 128*t2; t6 <= 128*t2+127; t6++) // loop i { • Introduction for(t8 = 64*t4; t8 <= 64*t4+63; t8++) // loop j { • Case Studies • Inspector/Executor for(t10 = 0; t10 <= 1023; t10++) // loop k { • Partial Sum • Parallel Code Generation s0(t2,t4,t6,t8,t10); }}}}} CUDA OpenMP • Related Work cudaize(0,"mm_GPU",{}, {block={"ii","jj"},thread={"i","j"}},{}) • Conclusion
Parallel Code Generation - CUDA void MM(int c[N][N], int a[N][N], int b[N][N]) { • Impact to AST for (i = 0; i < N; i++) • AST annotation of block/thread loops for (j = 0; j < N; j++) • Loops are mark for elimination for (k = 0; k < N; k++) • Polyhedral and AST abstractions remain until code c[j][i] = c[j][i] + a[k][i] * b[j][k]; } generation • Loop iterators are replaced with block/thread index tile_by_index(0,{"i","j"},{Ti,Tj}, • Eg, ii, jj replaced with blockIdx.x, blockIdx.y {l1_control="ii",l2_control="jj"}, {"ii","jj","i","j","k"}) for(t2 = 0; t2 <= 7; t2++) // loop ii, block dimension x{ for(t4 = 0; t4 <= 15; t4++) // loop jj, block dimension y{ for(t6 = 128*t2; t6 <= 128*t2+127; t6++) // loop i { • Introduction for(t8 = 64*t4; t8 <= 64*t4+63; t8++) // loop j { • Case Studies • Inspector/Executor for(t10 = 0; t10 <= 1023; t10++) // loop k { • Partial Sum • Parallel Code Generation s0(t2,t4,t6,t8,t10); }}}}} blockIdx.x, blockIdx.y CUDA OpenMP • Related Work cudaize(0,"mm_GPU",{}, {block={"ii","jj"},thread={"i","j"}},{}) • Conclusion
Parallel Code Generation - CUDA for (kk = 0; kk <= 63; kk += 1) • Data Copy Transformation for (iii = 0; iii <= 7; iii += 1) • Synchronization for (jjj = 0; jjj <= 3; jjj += 1) • AST annotation for (k = 16 * kk; k <= 16 * kk + 15; k += 1) c[...][...] = c[...][...] + a[...][...] * b[...][...]; • Scaffolding code copy_to_shared(0,"tx","a",-16) AST ... Kernel inlining mm_GPU <<<dimGrid0 ,dimBlock0 >>>(...); for (kk = 0; kk <= 63; kk += 1) { ... for (tmp_tx = 0; tmp_tx <= 7; tmp_tx += 1) __global__ void mm_GPU(...) _P1[...][...] = a[...][...]; { ... } AST __syncthreads(); • for (iii = 0; iii <= 7; iii += 1) Introduction • Case Studies for (jjj = 0; jjj <= 3; jjj += 1) • Inspector/Executor • Partial Sum for (k = 16 * kk; k <= 16 * kk + 15; k += 1) • Parallel Code Generation CUDA c[...][...] = c[...][...] + _P1[...][...] * b[...][...]; OpenMP • Related Work __syncthreads(); } • Conclusion
Recommend
More recommend