Polyhedral Loop Optimization (Part I) Armin Größlinger SPPEXA Doctoral Retreat 2015 September 14, 2015
Overview ● Monday: Basics of the polyhedral model – Modeling – Transformation – Code generation ● Wednesday: Practice – Available tools – Use LLVM+Polly to analyze and optimize codes 2
Polyhedral Compilation 3
Phases of the Model 1) Modeling Describe loop iterations (= iteration domain) • Compute dependences between iterations • 2) Transformation(s) Reorder iterations to exhibit desired properties, e.g., • parallelism, increased data locality, etc. Must respect dependences • 3) Code Generation Turn transformed model into efficient code • 4
Model ● For each statement: – iteration domain: (unordered) set of loop iterations – schedule: relation between iteration domain and (virtual) multi-dimensional execution time – all schedules have same dimensionality ● For each array access: – access relation: maps iteration to memory location ● For each pair of statements: – dependences: relation between iteration domains ● All sets and relations are defined by Presburger formulas 5
Presburger Formulas I ● Modeling must be based on a decidable theory ● Loop iterations and memory locations are discrete units → must use integers (not rationals or reals) ● Theory of inequalities (and congruences) over the integers with addition is decidable. ● Theory of inequalities over the integers with addition and multiplication is not decidable. ● Consequence: use affine expressions (affine = linear + additive constant) and congruences, e.g. 6
Presburger Formulas II ● Presburger formulas allow existential and universal quantification ● Formulas with quantifiers are equivalent to formulas without quantifiers, e.g., ● Geometrically, Presburger formulas describe the integer points in polyhedra (possibly with “holes” when quantifiers or congruences are used). 7
Modeling Example I for (i = 1; i <= n; ++i) for (j = 1; j <= i; ++j) S: A[i][j] = 0.5 * (A[i-1][j] + A[i][j-1]) ● Iteration domain: ● Dependences: 8
Modeling Example II for (i = 2; i <= n; i += 2 ) for (j = 1+ (i/4)*4 ; j <= i; ++j) S: A[i][j] = 0.5 * (A[i-1][j] + A[i][j-1]) 9
Modeling Example III for (i = n; i >= 1; --i ) for (j = 1; j <= i; ++j) S: A[i][j] = 0.5 * (A[i-1][j] + A[i][j-1]) Historical solution: loops must be normalized, i.e., loops run forward and have unit stride: for (i = -n; i <= -1; ++ i ) for (j = 1; j <= i; ++j) S: A[-i][j] = 0.5 * (A[-i-1][j] + A[-i][j-1]) 10
Modeling with Domain and Schedule for (i = n; i >= 1; --i ) for (j = 1; j <= i; ++j) S: A[i][j] = 0.5 * (A[i-1][j] + A[i][j-1]) ● Use unordered iteration domain + schedule ● Execution order is defined by lexicographic order on schedule ● When no explicit schedule is given, an identity schedule is assumed. 11
Modeling Example IV ● Do not confuse loop and array dimensions for (i=0; i<=n; ++i) for (j=0; j<=n; ++j) A[i+j] = … – Two loops but only one array dimension! – Do not identify loop dimensions with array dimensions ● Exercise: – Draw the iteration space and the dependences between iterations 12
Model Extraction from Source Code ● Iteration domains extracted from loop bounds (directly) ● Access relations extracted from array accesses (directly) ● Schedule (for sequential program) can be constructed systematically: for n nested loops, construct schedule with 2 n +1 dimensions, every second dimension ensures the textual order of the statements. A; for (i=…) { B; for (j=…) { C; D; } E; } F; 13
Computing Dependences ● Computing the dependences is the main work ● There exists a dependence when – existence: – conflict: – order: all hold (in the integers!) ● Optimization: remove transitive dependences from solution, e.g. only last write before a read access → compute lexicographic maximum of i in dependence of j. 14
Dependence Computation Example for (i = 1; i <= n; ++i) for (j = 0; j <= 1; ++j) S: A[2*i+j] = … for (k = 1; k <= 2*n+1; ++k) T: … = A[k] ● Existence: ● Conflict: ● (Order: S textually before T) 15
Tools and Compilers ● Tools to solve dependence systems (among others): – PIP (“parametric integer programming, P. Feautrier) – isl (“integer set library”, S. Verdoolaege) ● Today part of production compilers – IBM XL – GCC (“Graphite”) – LLVM (“Polly”) 16
Practical Challenges in Modeling ● Additional complications in real codes: – Are “A” and “B” the same array or different arrays? – Aliasing of inner dimensions in arrays of arrays (or arrays represented by pointer to pointers), e.g., float **A = …; A[3] = A[4]; → not enough dependences calculated 17
Transformations ● Compute a new schedule for each statement ● Desired properties – maximal parallelism – minimal latency – locality improvement 18
Transformation Example for (k=1; k<=m; k++) { for (i=1; i<n; i++) A[i] = (A[i-1] + A[i+1]) * 0.5f; } k k ,i k ,i 1 k ,i k 1, i − 1 k ,i k 1, i → flow dependence → output dependence i Which iterations can be executed in parallel? 19
Parallel Schedule k independent iterations i ● Parallel schedule: ● The second component (k) could be anything else as long as it is linearly independent from the first component. 20
Transformed Space p t for (int t=0; t<=n+2*m-4; t++) { Transformation: int lb = max(0, (t-n+3)/2); t = i + 2k – 3 int ub = min(m-1, t/2); p = k – 1 ⇒ parfor (int p=lb; p<=ub; p++) { i = t – 2p + 1 int i = t - 2*p + 1; A[i] = (A[i-1] + A[i+1]) * 0.5f; } 21 }
Schedules with Rational Coefficients ● Schedules with fractional coefficients make sense! ● Dependence: ● Schedule: ● Interpretation of the schedule: ● Rational schedules are implicitly “floored”. 22
Computing a new Schedule ● Correctness criterion for a schedule: ● Challenges: – How to solve the “implies” part? – How to treat lexicographic precedence? 23
Feautrier's Scheduling Algorithm I ● For each statement S , try to find a schedule where, for each dimension d , i.e. determine the coefficients (for each dimension) such that holds. 24
Feautrier's Scheduling Algorithm II ● Main idea of Feautrier Scheduler: A linear function is guaranteed to be non-negative over a polyhedron if it is a positive combination of the polyhedron's bounds. (Linear form of Farkas' Lemma) ● Try to solve (starting with ) where are the expressions defining the iteration domains of the involved statements. ● Technique: equate coefficients of the right and left side (the resulting system is affine!) 25
Feautrier Scheduler III ● In general, not all dependences can be satisfied in the first schedule dimension → try to satisfy as many dependences as possible, satisfy remaining dependences in second dimension if possible, and so on. ● Every dependence is “carried” by a certain schedule dimension c . The schedule must ensure 26
Optimization Criteria ● Order in which dependences are satisfied not determined ● Different solutions possible even for a fixed order ● Select an additional target, e.g., – minimize latency (shortest theoretical execution) – do not use more processors than necessary – improve locality (put source and target of dependences with high data volume on same processor) ● Problem: relation between theoretical optimum (e.g. minimal latency) and performance on real hardware unknown → room for exploration/autotuning/machine learning 27
Performance of Different Solutions 3D Jacobi Stencil, Experiment by Stefan Kronawitter 28
Another Transformation: Tiling ● To improve data locality, tiling is often necessary to speed up a code. for (i=0; i<n; ++i) for (j=0; j<n; ++j) ... for (iT=0; i<n/B; ++i) for (i=B*iT; i<B*iT+B; ++i) for (jT=0; j<n/B; ++j) for (j=B*jT; j<B*jT+B; ++j) ... for (iT=0; i<n/B; ++i) for (jT=0; j<n/B; ++j) for (i=B*iT; i<B*iT+B; ++i) for (j=B*jT; j<B*jT+B; ++j) ... 29
Limitations with Tiling ● Tiling transformation is only affine for a given tile size ● Parametric tiling requires extensions of the polyhedral model which are not part of wide-spread tools (yet?) ● Tiling not legal in all cases; sufficient criterion: for all d. 30
Other Transformations ● Index Set Splitting: split iteration domains to allow for better schedules ● Memory Layout Transformations: more cache-friendly data layout ● Save Memory: compute minimum required working set size ● Communication Introduction: enumerate memory elements that must be transferred between tiles ● ... 31
Code Generation ● After transformation we must generate executable code for the transformed model ● Generated code must be efficient ● Generation of good code is non-trivial 32
Recommend
More recommend