AlphaZ: A System for Design Space Exploration in the Polyhedral Model Tomofumi Yuki, Gautam Gupta, DaeGon Kim, Tanveer Pathan, and Sanjay Rajopadhye
Polyhedral Compilation n The Polyhedral Model n Now a well established approach for automatic parallelization n Based on mathematical formalism n Works well for regular/dense computation n Many tools and compilers: n PIPS, PLuTo, MMAlpha, RStream, GRAPHITE(gcc), Polly (LLVM), ... 2
Design Space (still a subset) n Space Time + Tiling: schedule + parallel loops n Primary focus of existing tools n Memory Allocation n Most tools for general purpose processors do not modify the original allocation n Complex interaction with space time n Higher-level Optimizations n Reduction detection n Simplifying Reduction (complexity reduction) 3
AlphaZ n Tool for Exploration n Provides a collection of analyses, transformations, and code generators n Unique Features n Memory Allocation n Reductions n Can be used as a push-button system n e.g., Parallelization à la PLuTo is possible n Not our current focus 4
This Paper: Case Studies n adi.c from PolyBench n Re-considering memory allocation allows the program to be fully tiled n Outperforms PLuTo that only tiles inner loops n UNAfold (RNA folding application) n Complexity reduction from O(n 4 ) to O(n 3 ) n Application of the transformations is fully automatic 5
This Talk: Focus on Memory n Tiling requires more memory n e.g., Smith-Waterman dependence Sequential Tiled 6
ADI-like Computation n Updates 2D grid with outer time loop n PLuTo only tiles inner two dimensions n Due to a memory based dependence n With an extra scalar , becomes tilable in all three dimensions n PolyBench implementation has a bug n It does not correctly implement ADI n ADI is not tilable in all dimensions 7
adi.c: Original Allocation for (t=0; t < tsteps; t++) { � for (t=0; t < tsteps; t++) { � for (i1 = 0; i1 < n; i1++) � for (i1 = 0; i1 < n; i1++) � for (i2 = 0; i2 < n; i2++) � for (i2 = 0; i2 < n; i2++) � X[i1][i2] = foo(X[i1][i2], X[i1][i2-1], …) � X[i1][i2] = foo(X[i1][i2], X[i1][i2-1], …) � … � … � for (i1 = 0; i1 < n; i1++) � for (i1 = 0; i1 < n; i1++) � for (i2 = n-1; i2 >= 1; i2--) � for (i2 = n-1; i2 >= 1; i2--) � X[i1][i2] = bar(X[i1][i2], X[i1][i2-1], …) � X[i1][i2] = bar(X[i1][i2], X[i1][i2-1], … ) � … � … � } � } � � n Not tilable because of the reverse loop n Memory based dependence: (i1,i2 -> i1,i2+1) n Require all dependences to be non-negative 8
adi.c: Original Allocation � � for (i2 = 0; i2 < n; i2++) � S1: X[i1][i2] = foo(X[i1][i2], X[i1][i2-1], …) � … � � for (i2 = n-1; i2 >= 1; i2--) � S2: X[i1][i2] = bar(X[i1][i2], X[i1][i2-1], … ) � … � S1 X[i1] � S2 X[i1] � 9
adi.c: With Extra Memory n Once the two loops are fused: � � n Value of X only needs to be preserved for one for (i2 = 0; i2 < n; i2++) � S1: X[i1][i2] = foo(X[i1][i2], X[i1][i2-1], …) � iteration of i2 … � n We don’t need a full array X’ , just a scalar � for (i2 = 1; i2 < n; i2++) � S2: X’[i1][i2] = bar(X[i1][i2], X[i1][i2-1], … ) � … � X[i1] � X’[i1] � 10
adi.c: Performance Speedup of Optimized Code on Cray XT6m Speedup of Optimized Code on Xeon Speed up compared to original code Speed up compared to original code 24 8 AlphaZ AlphaZ PLuTo PLuTo 20 16 12 4 8 2 4 1 0 0 0 4 8 12 16 20 24 0 1 2 4 8 Number of Threads (Cores) Number of Threads (Cores) n PLuTo does not scale because the outer loop is not tiled 11
UNAfold n UNAfold [Markham and Zuker 2008] n RNA secondary structure prediction algorithm n O(n 3 ) algorithm was known [Lyngso et al. 1999] n too complicated to implement n “good enough” workaround exists n AlphaZ n Systematically transform O(n 4 ) to O(n 3 ) n Most of the process can be automated 12
UNAFold: Optimization n Key: Simplifying Reductions [POPL 2006] n Finds “hidden scans” in reductions n Rare case: compiler can reduce complexity n Almost automatic: n The O(n 4 ) section must be separated n many boundary cases n Require function to be inlined to expose reuse n Transformations to perform the above is available; no manual modification of code 13
UNAfold: Performance Execution Time of UNAfold Log plot of Execution Time 2500 8 Execution Time in Seconds original original 7 y = 4x + b1 Log of Execution Time 2000 simplified simplified 6 1500 5 4 y = 3x + b2 1000 3 2 500 1 0 0 200 400 600 800 1000 1400 2.0 2.2 2.4 2.6 2.8 3.0 3.2 Sequence Length (N) Log of Sequence Length n Complexity reduction is empirically confirmed 14
AlphaZ System Overview n Target Mapping: n Specifies schedule, C Alpha memory allocation, etc. Polyhedral C+OpenMP Transformations Representation Analyses C+CUDA Target Code Gen Mapping C+MPI 15
Human-in-the-Loop n Automatic parallelization—“holy grail” goal n Current automatic tools are restrictive n A strategy that works well is “hard-coded” n difficult to pass domain specific knowledge n Human-in-the-Loop n Provide full control to the user n Help finding new “good” strategies n Guide the transformation with domain specific knowledge 16
Conclusions n There are more strategies worth exploring n some may currently be difficult to automate n Case Studies n adi.c: memory n UNAfold: reductions n AlphaZ: Tool for trying out new ideas 17
Acknowledgements n AlphaZ Developers/Users n Members of Mélange at CSU n Members of CAIRN at IRISA, Rennes n Dave Wonnacott at Haverford University and his students 18
Key: Simplifying Reductions n Simplifying Reductions [POPL 2006] n Finds “hidden scans” in reductions n Rare case: compiler can reduce complexity n Main idea: i ∑ X [ i ] = A [ i ] O(n 2 ) k = 0 n can be written # i = 0 : A [ i ] X [ i ] = $ i > 0 : X [ i − 1] + A [ i ] % O(n) 19
Recommend
More recommend