Iterative Optimization in the Polyhedral Model: Part II, Multidimensional Time Louis-Noël Pouchet 1 Cédric Bastoul 1 Albert Cohen 1 John Cavazos 2 1 ALCHEMY group, INRIA Saclay / University of Paris-Sud 11, France 2 Dept. of Computer & Information Sciences, University of Delaware, USA June 9, 2008 ACM SIGPLAN 2008 Conference on Programming Languages Design and Implementation Tucson, Arizona
Introduction: Situation PLDI’08 Motivation ◮ New architecture → New high-performance libraries needed ◮ New architecture → New optimization flow needed ◮ Architecture complexity/diversity increases faster than optimization progress ◮ Traditional approaches lose performance portability. . . We want a portable optimization process! INRIA Saclay / U. of Delaware 2 / 18
Introduction: The Problem PLDI’08 The Optimization Problem Architectural Compiler optimization Domain characteristics interaction knowledge ALU, SIMD, Caches, ... GCC has 205 passes... Linear algebra, FFT, ... Optimizing compilation process Code for Code for Code for ......... architecture 1 architecture 2 architecture N INRIA Saclay / U. of Delaware 3 / 18
Introduction: The Problem PLDI’08 The Optimization Problem Architectural Compiler optimization Domain characteristics interaction knowledge ALU, SIMD, Caches, ... GCC has 205 passes... Linear algebra, FFT, ... Optimizing locality improvement, compilation = vectorization, process parallelization, etc... Code for Code for Code for ......... architecture 1 architecture 2 architecture N INRIA Saclay / U. of Delaware 3 / 18
Introduction: The Problem PLDI’08 The Optimization Problem Architectural Compiler optimization Domain characteristics interaction knowledge ALU, SIMD, Caches, ... GCC has 205 passes... Linear algebra, FFT, ... Optimizing parameter tuning, compilation = phase ordering, process etc... Code for Code for Code for ......... architecture 1 architecture 2 architecture N INRIA Saclay / U. of Delaware 3 / 18
Introduction: The Problem PLDI’08 The Optimization Problem Architectural Compiler optimization Domain characteristics interaction knowledge ALU, SIMD, Caches, ... GCC has 205 passes... Linear algebra, FFT, ... Optimizing pattern recognition, compilation = hand-tuned kernel codes, process etc... Code for Code for Code for ......... architecture 1 architecture 2 architecture N INRIA Saclay / U. of Delaware 3 / 18
Introduction: The Problem PLDI’08 The Optimization Problem Architectural Compiler optimization Domain characteristics interaction knowledge ALU, SIMD, Caches, ... GCC has 205 passes... Linear algebra, FFT, ... Optimizing compilation = Auto-tuning libraries process Code for Code for Code for ......... architecture 1 architecture 2 architecture N INRIA Saclay / U. of Delaware 3 / 18
Introduction: The Problem PLDI’08 The Optimization Problem Architectural Compiler optimization Domain characteristics interaction knowledge ALU, SIMD, Caches, ... GCC has 205 passes... Linear algebra, FFT, ... In reality, there is a complex interplay between all components Our approach: Optimizing compilation build an expressive process set of program versions Code for Code for Code for ......... architecture 1 architecture 2 architecture N INRIA Saclay / U. of Delaware 3 / 18
Generating Program Versions: Overview PLDI’08 Iterative Optimization Flow High-level transformations Input ......... Optimization 1 Optimization 2 Optimization N code Target Compiler code INRIA Saclay / U. of Delaware 4 / 18
Generating Program Versions: Overview PLDI’08 Iterative Optimization Flow Set of Input program code versions Target Compiler code Program version = result of a sequence of loop transformation INRIA Saclay / U. of Delaware 4 / 18
Generating Program Versions: Overview PLDI’08 Iterative Optimization Flow Set of Input Space program code explorer versions Final Target Run Compiler code code Program version = result of a sequence of loop transformation INRIA Saclay / U. of Delaware 4 / 18
Generating Program Versions: Properties PLDI’08 Set of Program Versions What matters is the result of the application of optimizations , not the optimization sequence All-in-one approach: ◮ Legality: semantics is always preserved ◮ Uniqueness: all versions of the set are distinct ◮ Expressiveness: a version is the result of an arbitrarily complex sequence of loop transformation INRIA Saclay / U. of Delaware 5 / 18
Generating Program Versions: The Representation PLDI’08 The Polyhedral Model in a Nutshell ◮ Arbitrarily complex sequence of loop transformations are modeled in a single optimization step : new scheduling matrix ◮ Granularity: each executed instance of each statement for (i = ...; i < ...; ++i) S1(i); Θ : for (i = ...; i < ...; ++i) S2(i); ◮ First row → all outer-most loops INRIA Saclay / U. of Delaware 6 / 18
Generating Program Versions: The Representation PLDI’08 The Polyhedral Model in a Nutshell ◮ Arbitrarily complex sequence of loop transformations are modeled in a single optimization step : new scheduling matrix ◮ Granularity: each executed instance of each statement for (i = ...; i < ...; ++i) for (j = ...; j < ...; ++j) S1(i,j); Θ : for (i = ...; i < ...; ++i) for (j = ...; j < ...; ++j) S2(i,j); ◮ Second row → all next outer-most loops INRIA Saclay / U. of Delaware 6 / 18
Generating Program Versions: The Representation PLDI’08 The Polyhedral Model in a Nutshell ◮ Arbitrarily complex sequence of loop transformations are modeled in a single optimization step : new scheduling matrix ◮ Granularity: each executed instance of each statement for (j = ...; j < ...; ++j) S2(...,j); for (i = ...; i < ...; ++i) Θ : for (j = ...; j < ...; ++j) S1(i,j); S2(i,j); ◮ Minor change → significant impact INRIA Saclay / U. of Delaware 6 / 18
Generating Program Versions: The Representation PLDI’08 The Polyhedral Model in a Nutshell ◮ Arbitrarily complex sequence of loop transformations are modeled in a single optimization step : new scheduling matrix ◮ Granularity: each executed instance of each statement for (j = ...; j < ...; ++j) S2(...,j); � � ı p c for (i = ...; i < ...; ++i) Θ : for (j = ...; j < ...; ++j) � � S1(i,j); ı p c S2(i,j); Transformation Description Changes the direction in which a loop traverses its iteration range reversal � Makes the bounds of a given loop depend on an outer loop counter skewing ı Exchanges two loops in a perfectly nested loop, a.k.a. permutation interchange fusion Fuses two loops, a.k.a. jamming � p distribution Splits a single loop nest into many, a.k.a. fission or splitting peeling Extracts one iteration of a given loop c shifting Allows to reorder loops INRIA Saclay / U. of Delaware 6 / 18
Generating Program Versions: Contributions PLDI’08 Previous Contributions Previous work (CGO’07, Part I, One-Dimensional Time ): ◮ Focus on Static Control Parts (SCoP) ◮ SCoP: Consecutive set of statements with affine control flow ◮ Complete framework for one-dimensional schedules ◮ Efficient search space construction, efficient traversal ◮ Drawbacks in applicability ◮ Drawbacks in expressiveness We previously solved a simpler problem... INRIA Saclay / U. of Delaware 7 / 18
Generating Program Versions: Contributions PLDI’08 New Contributions Dealing with multidimensional schedules: ◮ Applicability on any Static Control Parts ◮ Increased expressiveness ◮ Design of scalable traversal methods ◮ Dedicated genetic algorithm ◮ Dedicated heuristic INRIA Saclay / U. of Delaware 8 / 18
Generating Program Versions: Looking Into Details PLDI’08 Deeper In The Method Multidimensional schedules: high expressiveness, complex problem Space Space construction traversal Set of Distinct Tested program schedules versions versions - combinatorial expression of legality - multiple polytopes to traverse - heuristic needed: greedy selection of - large and expressive spaces 50 dependences + ordering (up to 10 ) (see Some Efficient Solutions to the Affine Scheduling Problem, Part II: Multidimensional Time , Feautrier, 1992 ) - partial enumeration (mandatory): completion mechanism + subspace partitioning - Code generation friendly bounds on the schedule coefficients - shape the space: optimized polytope projection ( required) + constrained dynamic scan INRIA Saclay / U. of Delaware 9 / 18
Traversing the Search Space: Extensive Analysis PLDI’08 Observations on the Performance Distribution Performance distribution - 8x8 DCT Best 1.6 Average Worst for (i = 0; i < M; i++) 1.4 for (j = 0; j < M; j++) { tmp[i][j] = 0.0; Performance improvement 1.2 for (k = 0; k < M; k++) tmp[i][j] += block[i][k] * 1 cos1[j][k]; } 0.8 for (i = 0; i < M; i++) for (j = 0; j < M; j++) { 0.6 sum2 = 0.0; for (k = 0; k < M; k++) 0.4 sum2 += cos1[i][k] * tmp[k][j]; block[i][j] = ROUND(sum2); 0.2 } 0 10 20 30 40 50 60 Point index for the first schedule row ◮ Extensive study of 8x8 Discrete Cosine Transform (UTDSP) ◮ Search space analyzed: 66 × 19683 = 1 . 29 × 10 6 different legal program versions INRIA Saclay / U. of Delaware 10 / 18
Recommend
More recommend