locus a system and a language for program optimization
play

Locus: A System and a Language for Program Optimization Thiago - PowerPoint PPT Presentation

Locus: A System and a Language for Program Optimization Thiago Teixeira *, Corinne Ancourt + , David Padua*, William Gropp* *Department of Computer Science, University of Illinois at Urbana-Champaign, USA + MINES ParisTech, PSL University, France


  1. Contributions • Defined Locus language: – describe concisely complex space of optimizations – agnostic of any specific traversal method – decouple performance expert role from application expert role • Implemented a system with flexible API for plugging in: – different variant selection techniques (optimization space traversal) – collection of transformations developed internally and externally � 8

  2. Contributions • Defined Locus language: – describe concisely complex space of optimizations – agnostic of any specific traversal method – decouple performance expert role from application expert role • Implemented a system with flexible API for plugging in: – different variant selection techniques (optimization space traversal) – collection of transformations developed internally and externally • Optimizer and interpreter for the Locus programs: – prune the space automatically – speeds-up the empirical search � 8

  3. Locus Approach • Baseline code: defined by the developer, no platform- or compiler- specific optimizations • Annotated regions of interest (i.e., code regions) • Program the application of the optimizations for each code region � 9

  4. Locus System Annotated Source Code #pragma @Locus loop = matmul for (i=0; i<M; i++) for (j=0; j<N; j++) for (k=0; k<K; k++) C[i][j] = beta*C[i][j] + alpha*A[i][k]*B[k][j]; � 10

  5. Locus System Annotated Source Code #pragma @Locus loop = matmul for (i=0; i<M; i++) for (j=0; j<N; j++) for (k=0; k<K; k++) C[i][j] = beta*C[i][j] + alpha*A[i][k]*B[k][j]; � 11

  6. Locus System Annotated Source Code Locus Program #pragma @Locus loop = matmul CodeReg matmul { for (i=0; i<M; i++) tiledim = 4; for (j=0; j<N; j++) tiletype = Tiling2D() OR Tiling3D(); for (k=0; k<K; k++) printstatus(tiletype); C[i][j] = beta*C[i][j] if (tiletype == "2D") { + alpha*A[i][k]*B[k][j]; RoseLocus.Unroll(loop=innermost, factor=tiledim); } } � 12

  7. Locus System Annotated Source Code Locus Program #pragma @Locus loop = matmul CodeReg matmul { for (i=0; i<M; i++) tiledim = 4; for (j=0; j<N; j++) tiletype = Tiling2D() OR Tiling3D(); for (k=0; k<K; k++) printstatus(tiletype); C[i][j] = beta*C[i][j] if (tiletype == "2D") { + alpha*A[i][k]*B[k][j]; RoseLocus.Unroll(loop=innermost, factor=tiledim); } } � 13

  8. Locus System Annotated Source Code Locus Program #pragma @Locus loop = matmul CodeReg matmul { for (i=0; i<M; i++) tiledim = 4; for (j=0; j<N; j++) tiletype = Tiling2D() OR Tiling3D(); for (k=0; k<K; k++) printstatus(tiletype); C[i][j] = beta*C[i][j] if (tiletype == "2D") { + alpha*A[i][k]*B[k][j]; RoseLocus.Unroll(loop=innermost, factor=tiledim); } } • Optimizations are target-specific and region-specific • Separated from the application’s code � 13

  9. Locus Optimization Language � 14

  10. Locus Optimization Language • Optimization recipes for each code region (CodeReg, OptSeq) � 14

  11. Locus Optimization Language • Optimization recipes for each code region (CodeReg, OptSeq) • Loops, If-then-else � 14

  12. Locus Optimization Language • Optimization recipes for each code region (CodeReg, OptSeq) • Loops, If-then-else • Special Search Constructs: – OR blocks and statements; – Optional statements; – enum , integer , permutation , poweroftwo … � 14

  13. Locus Optimization Language Interchange Tiling Distribute Unroll � 15

  14. Locus Optimization Language Interchange Tiling Distribute Unroll � 15

  15. Locus Optimization Language Interchange Tiling Unroll-and-jam Distribute Distribute Unroll Unroll � 15

  16. Locus Optimization Language Interchange OR Tiling Unroll-and-jam Distribute Distribute Unroll Unroll � 15

  17. Locus Optimization Language Interchange OR Tiling Unroll-and-jam Distribute Distribute Unroll Unroll � 15

  18. Locus Optimization Language Interchange OR Tiling Unroll-and-jam Distribute is optional Distribute Distribute * Unroll Unroll � 15

  19. Locus Optimization Language Interchange OR Tiling Unroll-and-jam Distribute is optional CodeReg test { Interchange(…); Distribute Distribute * { Tiling(…); Distribute(…); Unroll(…); Unroll } OR { Unroll-and-jam(…); Unroll *Distribute(…); Unroll(…); } } � 15

  20. Modules Integration 1/3 � 16

  21. Modules Integration 1/3 • Collaborative environment, reuse other’s work � 16

  22. Modules Integration 1/3 • Collaborative environment, reuse other’s work • Locus defines an entire search space � 16

  23. Modules Integration 1/3 • Collaborative environment, reuse other’s work • Locus defines an entire search space • Locus allows for both multiple search and transformation modules � 16

  24. Modules Integration 1/3 • Collaborative environment, reuse other’s work • Locus defines an entire search space • Locus allows for both multiple search and transformation modules • Given the search space, one must: – decide which variants to evaluate (search module) – use tools to generate code that follows each variant’s transformation plan (transformation module) � 16

  25. Modules Integration 2/3 � 17

  26. Modules Integration 2/3 • Search modules (OpenTuner, HyperOpt): � 17

  27. Modules Integration 2/3 • Search modules (OpenTuner, HyperOpt): Locus program � 17

  28. Modules Integration 2/3 • Search modules (OpenTuner, HyperOpt): Locus Locus's program space � 17

  29. Modules Integration 2/3 • Search modules (OpenTuner, HyperOpt): Locus Locus's Search Module’s program space opt space � 17

  30. Modules Integration 2/3 • Search modules (OpenTuner, HyperOpt): Select a point and converts Locus Locus's Search Module’s Code program space opt space Generator � 17

  31. Modules Integration 2/3 • Search modules (OpenTuner, HyperOpt): Select a point and converts Locus Locus's Search Module’s Code program space opt space Generator Evaluate a variant � 17

  32. Modules Integration 2/3 • Search modules (OpenTuner, HyperOpt): Select a point and converts Locus Locus's Search Module’s Code program space opt space Generator Return a metric Evaluate a variant � 17

  33. Modules Integration 2/3 • Search modules (OpenTuner, HyperOpt): – Convert the Locus' space to module’s space • parameters, OR statements and blocks, conditionals Select a point and converts Locus Locus's Search Module’s Code program space opt space Generator Return a metric Evaluate a variant � 17

  34. Modules Integration 2/3 • Search modules (OpenTuner, HyperOpt): – Convert the Locus' space to module’s space • parameters, OR statements and blocks, conditionals – For each point converts it back to Locus representation, and invokes the interpreter Select a point and converts Locus Locus's Search Module’s Code program space opt space Generator Return a metric Evaluate a variant � 17

  35. Modules Integration 2/3 • Search modules (OpenTuner, HyperOpt): – Convert the Locus' space to module’s space • parameters, OR statements and blocks, conditionals – For each point converts it back to Locus representation, and invokes the interpreter – Search: start process Select a point and converts Locus Locus's Search Module’s Code program space opt space Generator Return a metric Evaluate a variant � 17

  36. Modules Integration 3/3 � 18

  37. Modules Integration 3/3 • Transformation modules (Pips, RoseLocus, Pragmas, BuiltIn): – Allows for fine-grain selection • Can pick a different module for each transformation (e.g., Interchange, Tiling) – Work on code region level – Workflow: • Locus transforms to modules notation • Module applies the optimization • Locus transforms the resulting code into its internal representation (AST and code region structure) – Flexible enough to integrate other transformations if needed � 18

  38. Optimizations for Pruning • During conversion: – Dead code elimination – Constant folding – Constant propagation � 19

  39. Optimizations for Pruning • During conversion: – Dead code elimination – Constant folding – Constant propagation Select a point and converts Locus Locus's Search Module’s Code program space opt space Generator Return a metric Evaluate a variant � 19

  40. Optimizations for Pruning • During conversion: – Dead code elimination – Constant folding – Constant propagation Select a point and converts Locus Locus's Search Module’s Code program space opt space Generator Return a metric Evaluate a variant � 19

  41. Optimizations for Pruning • During conversion: – Dead code elimination CodeReg test { – Constant folding perfect = IsPerfectLoopNest(); if (perfect) – Constant propagation { Interchange(…); } Tiling(…); Distribute(…); Select a point and converts Unroll(…); } Locus Locus's Search Module’s Code program space opt space Generator Return a metric Evaluate a variant � 19

  42. Optimizations for Pruning • During conversion: – Dead code elimination CodeReg test { – Constant folding False perfect = IsPerfectLoopNest(); if (perfect) – Constant propagation { Interchange(…); } Tiling(…); Distribute(…); Select a point and converts Unroll(…); } Locus Locus's Search Module’s Code program space opt space Generator Return a metric Evaluate a variant � 19

  43. Optimizations for Pruning • During conversion: – Dead code elimination CodeReg test { – Constant folding False perfect = IsPerfectLoopNest(); if (perfect) – Constant propagation { Interchange(…); } Tiling(…); Distribute(…); Select a point and converts Unroll(…); } Locus Locus's Search Module’s Code program space opt space Generator Return a metric Evaluate a variant � 19

  44. Experimental Results • Intel Xeon E5-2660 10-Core 2.60 GHz • Compared to Pluto and Intel MKL – Default values for parameters, no search • Examples: – Matrix-Matrix Multiplication – Stencil Kernels – Kripke – Arbitrary Loop Nests • Generic enough to be applied on known and unknown code applications � 20

  45. Matrix-Matrix Multiplication 600 Pluto MKL Locus Speedup over sequential 500 400 300 200 100 0 1 2 4 6 8 10 CPU Cores � 21

  46. Matrix-Matrix Multiplication 600 Pluto MKL Locus Speedup over sequential 500 400 300 200 100 0 1 2 4 6 8 10 CPU Cores • Empirical search could find very efficient variants • Comparable with Intel MKL performance � 21

  47. Matrix-Matrix Multiplication � 22

  48. Matrix-Matrix Multiplication Interchange � 22

  49. Matrix-Matrix Multiplication Interchange Tiling � 22

  50. Matrix-Matrix Multiplication Interchange Tiling Tiling � 22

  51. Matrix-Matrix Multiplication Interchange Tiling Tiling Parallel For � 22

  52. Matrix-Matrix Multiplication Interchange Tiling Tiling Parallel For OR Static Dynamic + + chunk chunk � 22

  53. Matrix-Matrix Multiplication Interchange • Large space of optimization Tiling Tiling Parallel For OR Static Dynamic + + chunk chunk � 22

  54. Matrix-Matrix Multiplication Interchange • Large space of optimization • 34,012,224 possible variants Tiling Tiling Parallel For OR Static Dynamic + + chunk chunk � 22

  55. Matrix-Matrix Multiplication Interchange • Large space of optimization • 34,012,224 possible variants Tiling • Average of ~450 variants evaluated per setup Tiling Parallel For OR Static Dynamic + + chunk chunk � 22

  56. Matrix-Matrix Multiplication Interchange • Large space of optimization • 34,012,224 possible variants Tiling • Average of ~450 variants evaluated per setup • 80 minutes search per setup Tiling Parallel For OR Static Dynamic + + chunk chunk � 22

  57. Stencils � 23

  58. Stencils • 6 different stencils � 23

  59. Stencils • 6 different stencils • Skew tiling accross time-space � 23

  60. Stencils • 6 different stencils • Skew tiling accross time-space • Found better tiling shapes � 23

  61. Stencils • 6 different stencils • Skew tiling accross time-space • Found better tiling shapes � 23

  62. Stencils • 6 different stencils • Skew tiling accross time-space • Found better tiling shapes 4 Pluto Locus 3 Speedup 2 1 0 Jacobi 1d Jacobi 2d Heat 1d Heat 2d Seidel 1d Seidel 2d � 23

  63. Kripke • Deterministic particle transport code and proxy-app for the Ardra project developed at LLNL • 5 kernels: LTimes, LPlusTimes, Scattering , Source, and Sweep • 6 hand-optimized versions (6 angular fluxes using a 3D array indexed by direction D, group G and zone Z) • From a single source code generate the 6 hand-optimized versions using Locus � 24

  64. Kripke 9 Hand-Optimized Locus 8 7 Execution Time (sec) 6 5 4 3 2 1 0 DGZ DZG GDZ GZD ZDG ZGD � 25

  65. Kripke - Scattering Kernel for(int nm = 0; nm < num_moments; ++nm) for(int g = 0; g < num_groups; ++g) for(int gp = 0; gp < num_groups; ++gp) for(int zone = 0; zone < num_zones; ++zone) for(int mix = z_mixed[z]; mix < z_mixed[z]+num_mixed[z]; ++mix) { int material = mixed_material[mix]; double fraction = mixed_fraction[mix]; int n = moment_to_coeff[nm]; ##### # Address calculation to be included here. ##### *phi_out += *sigs * *phi * fraction; } � 26

  66. Kripke - Scattering Kernel for(int nm = 0; nm < num_moments; ++nm) for(int g = 0; g < num_groups; ++g) for(int gp = 0; gp < num_groups; ++gp) for(int zone = 0; zone < num_zones; ++zone) for(int mix = z_mixed[z]; mix < z_mixed[z]+num_mixed[z]; ++mix) { int material = mixed_material[mix]; double fraction = mixed_fraction[mix]; int n = moment_to_coeff[nm]; datalayout=enum("DZG","DGZ","GDZ","GZD","ZDG","ZGD"); CodeReg Scattering { ##### if (datalayout == "DGZ") { # Address calculation to be included here. omploop="0.0.0.0"; ##### } elif (datalayout == "GDZ") { looporder=[1,2,0,3,4]; *phi_out += *sigs * *phi * fraction; omploop="0.0.0.0"; } } elif (datalayout == "GZD") { looporder=[1,2,3,4,0]; omploop="0.0.0"; } elif (datalayout == "ZGD") { looporder=[3,4,1,2,0]; omploop="0"; } elif (datalayout == "ZDG") { looporder=[3,4,0,1,2]; omploop="0"; } elif (datalayout == "DZG") { looporder=[0,3,4,1,2]; omploop="0.0"; } sourcepath="scatter_"+datalayout+".txt"; BuiltIn.Altdesc(stmt="0.0.0.0.0.3", source=sourcepath); RoseLocus.Interchange(order=looporder); RoseLocus.LICM(); RoseLocus.ScalarRepl(); Pragma.OMPFor(loop=omploop); } � 27

  67. Kripke - Scattering Kernel for(int nm = 0; nm < num_moments; ++nm) for(int g = 0; g < num_groups; ++g) for(int gp = 0; gp < num_groups; ++gp) for(int zone = 0; zone < num_zones; ++zone) for(int mix = z_mixed[z]; mix < z_mixed[z]+num_mixed[z]; ++mix) { int material = mixed_material[mix]; double fraction = mixed_fraction[mix]; int n = moment_to_coeff[nm]; datalayout=enum("DZG","DGZ","GDZ","GZD","ZDG","ZGD"); CodeReg Scattering { ##### if (datalayout == "DGZ") { # Address calculation to be included here. omploop="0.0.0.0"; ##### } elif (datalayout == "GDZ") { looporder=[1,2,0,3,4]; *phi_out += *sigs * *phi * fraction; omploop="0.0.0.0"; } } elif (datalayout == "GZD") { looporder=[1,2,3,4,0]; omploop="0.0.0"; } elif (datalayout == "ZGD") { looporder=[3,4,1,2,0]; omploop="0"; } elif (datalayout == "ZDG") { looporder=[3,4,0,1,2]; omploop="0"; } elif (datalayout == "DZG") { looporder=[0,3,4,1,2]; omploop="0.0"; } sourcepath="scatter_"+datalayout+".txt"; BuiltIn.Altdesc(stmt="0.0.0.0.0.3", source=sourcepath); RoseLocus.Interchange(order=looporder); RoseLocus.LICM(); RoseLocus.ScalarRepl(); Pragma.OMPFor(loop=omploop); } � 28

  68. Kripke - Scattering Kernel for(int nm = 0; nm < num_moments; ++nm) for(int g = 0; g < num_groups; ++g) for(int gp = 0; gp < num_groups; ++gp) for(int zone = 0; zone < num_zones; ++zone) for(int mix = z_mixed[z]; mix < z_mixed[z]+num_mixed[z]; ++mix) { int material = mixed_material[mix]; double fraction = mixed_fraction[mix]; int n = moment_to_coeff[nm]; datalayout=enum("DZG","DGZ","GDZ","GZD","ZDG","ZGD"); CodeReg Scattering { ##### if (datalayout == "DGZ") { # Address calculation to be included here. omploop="0.0.0.0"; ##### } elif (datalayout == "GDZ") { *phi_out += *sigs * *phi * fraction; looporder=[1,2,0,3,4]; } omploop="0.0.0.0"; } elif (datalayout == "GZD") { looporder=[1,2,3,4,0]; omploop="0.0.0"; } elif (datalayout == "ZGD") { looporder=[3,4,1,2,0]; omploop="0"; } elif (datalayout == "ZDG") { looporder=[3,4,0,1,2]; omploop="0"; } elif (datalayout == "DZG") { looporder=[0,3,4,1,2]; omploop="0.0"; } sourcepath="scatter_"+datalayout+".txt"; BuiltIn.Altdesc(stmt="0.0.0.0.0.3", source=sourcepath); RoseLocus.Interchange(order=looporder); RoseLocus.LICM(); RoseLocus.ScalarRepl(); � 29 Pragma.OMPFor(loop=omploop); }

  69. Kripke - Scattering Kernel for(int nm = 0; nm < num_moments; ++nm) for(int g = 0; g < num_groups; ++g) for(int gp = 0; gp < num_groups; ++gp) for(int zone = 0; zone < num_zones; ++zone) for(int mix = z_mixed[z]; mix < z_mixed[z]+num_mixed[z]; ++mix) { int material = mixed_material[mix]; double fraction = mixed_fraction[mix]; int n = moment_to_coeff[nm]; datalayout=enum("DZG","DGZ","GDZ","GZD","ZDG","ZGD"); CodeReg Scattering { ##### if (datalayout == "DGZ") { # Address calculation to be included here. omploop="0.0.0.0"; ##### } elif (datalayout == "GDZ") { looporder=[1,2,0,3,4]; *phi_out += *sigs * *phi * fraction; omploop="0.0.0.0"; } } elif (datalayout == "GZD") { looporder=[1,2,3,4,0]; omploop="0.0.0"; } elif (datalayout == "ZGD") { looporder=[3,4,1,2,0]; omploop="0"; } elif (datalayout == "ZDG") { looporder=[3,4,0,1,2]; omploop="0"; } elif (datalayout == "DZG") { looporder=[0,3,4,1,2]; omploop="0.0"; } sourcepath="scatter_"+datalayout+".txt"; BuiltIn.Altdesc(stmt="0.0.0.0.0.3", source=sourcepath); RoseLocus.Interchange(order=looporder); RoseLocus.LICM(); RoseLocus.ScalarRepl(); Pragma.OMPFor(loop=omploop); } � 30

Recommend


More recommend