a note on the performance distribution of affine schedules
play

A Note on the Performance Distribution of Affine Schedules Louis-Nol - PowerPoint PPT Presentation

A Note on the Performance Distribution of Affine Schedules Louis-Nol Pouchet 1 , Cdric Bastoul 1 , John Cavazos 2 and Albert Cohen 1 1 ALCHEMY, INRIA Futurs / University of Paris-Sud XI, France 2 Computer and Information Sciences, University of


  1. A Note on the Performance Distribution of Affine Schedules Louis-Noël Pouchet 1 , Cédric Bastoul 1 , John Cavazos 2 and Albert Cohen 1 1 ALCHEMY, INRIA Futurs / University of Paris-Sud XI, France 2 Computer and Information Sciences, University of Delaware, USA January 27, 2008 2nd Workshop on Statistical and Machine learning approaches to ARchitectures and compilaTion Göteborg, Sweden

  2. Outline: SMART’08 Outline Motivation ◮ Automatic performance portability: iterative compilation ◮ Search space expressiveness → bring the iterative optimization problem into the polyhedral model ◮ Tradeoff expressiveness / traversal easiness ◮ Improve static characterization of the search space ◮ Highlight dynamic properties ◮ Validate a dedicated heuristic to traverse the space 2

  3. Building the Search Space: SMART’08 The Model Original Schedule   i � 1 0 0 0 for (i = 0; i < n; ++i) for (i = 0; i < n; ++i) � j Θ S 1 . x S 1 = � .   for (j = 0; j < n; ++j){ 0 1 0 0 for (j = 0; j < n; ++j){  n  S1: C[i][j] = 0; 1 C[i][j] = 0; for (k = 0; k < n; ++k) for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* i C[i][j] += A[i][k]*   � 1 0 0 0 0 B[k][j]; � j B[k][j]; Θ S 2 .   � x S 2 = . 0 1 0 0 0 k   } }   0 0 1 0 0 n 1 ◮ Represent Static Control Parts (control flow and dependences must be statically computable) ◮ Use code generator (e.g. CLooG) to generate C code from polyhedral representation (provided iteration domains + schedules) 3

  4. Building the Search Space: SMART’08 The Model Original Schedule  i  � 1 0 0 0 for (i = 0; i < n; ++i) j for (i = 0; i < n; ++i) � Θ S 1 . x S 1 = � .   for (j = 0; j < n; ++j){ 0 1 0 0 n for (j = 0; j < n; ++j){   S1: C[i][j] = 0; 1 C[i][j] = 0; for (k = 0; k < n; ++k) for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* C[i][j] += A[i][k]* i   � 1 0 0 0 0 B[k][j]; � j B[k][j]; Θ S 2 .   x S 2 = � 0 1 0 0 0 . k   } }   0 0 1 0 0 n 1 ◮ Represent Static Control Parts (control flow and dependences must be statically computable) ◮ Use code generator (e.g. CLooG) to generate C code from polyhedral representation (provided iteration domains + schedules) 3

  5. Building the Search Space: SMART’08 The Model Original Schedule  i  � 1 0 0 0 for (i = 0; i < n; ++i) j for (i = 0; i < n; ++i) � Θ S 1 . x S 1 = � .   for (j = 0; j < n; ++j){ 0 1 0 0 n for (j = 0; j < n; ++j){   S1: C[i][j] = 0; 1 C[i][j] = 0; for (k = 0; k < n; ++k) for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* i C[i][j] += A[i][k]*   � 1 0 0 0 0 B[k][j]; � j B[k][j]; Θ S 2 .   � x S 2 = . 0 1 0 0 0 k   } }   0 0 1 0 0 n 1 ◮ Represent Static Control Parts (control flow and dependences must be statically computable) ◮ Use code generator (e.g. CLooG) to generate C code from polyhedral representation (provided iteration domains + schedules) 3

  6. Building the Search Space: SMART’08 The Model Distribute loops  i  � 1 0 0 0 for (i = 0; i < n; ++i) for (i = 0; i < n; ++i) � j Θ S 1 . x S 1 = � .   for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0 n   C[i][j] = 0; S1: C[i][j] = 0; 1 for (i = n ; i < 2* n; ++i) for (k = 0; k < n; ++k) for (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i   � 1 0 0 1 0 for (k = 0; k < n; ++k) B[k][j]; � j Θ S 2 .   C[i -n ][j] += A[i -n ][k]* � x S 2 = . 0 1 0 0 0 k   } B[k][j];   0 0 1 0 0 n 1 ◮ All instances of S1 are executed before the first S2 instance 3

  7. Building the Search Space: SMART’08 The Model Distribute loops + Interchange loops for S2  i  � 1 0 0 0 for (i = 0; i < n; ++i) for (i = 0; i < n; ++i) j � Θ S 1 . x S 1 = � .   for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0 n   C[i][j] = 0; S1: C[i][j] = 0; 1 for ( k = n; k < 2*n; ++k) for (k = 0; k < n; ++k) for (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i   � 0 0 1 1 0 for (i = 0; i < n; ++i) B[k][j]; � j Θ S 2 .   C[i][j] += A[i][k-n]* � x S 2 = . 0 1 0 0 0 k   } B[k-n][j];   1 0 0 0 0 n 1 ◮ The outer-most loop for S2 becomes k 3

  8. Building the Search Space: SMART’08 The Model Illegal schedule  i  � 1 0 1 0 for (k = 0; k < n; ++k) for (i = 0; i < n; ++i) j � Θ S 1 . x S 1 = � .   for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0 n   for (i = 0; i < n; ++i) S1: C[i][j] = 0; 1 C[i][j] += A[i][k]* for (k = 0; k < n; ++k) B[k][j]; S2: C[i][j] += A[i][k]* i   � 0 0 1 0 0 for (i = n; i < 2*n; ++i) B[k][j]; � j Θ S 2 .   for (j = 0; j < n; ++j) � x S 2 = . 0 1 0 0 0 k   } C[i-n][j] = 0;   1 0 0 0 0 n 1 ◮ All instances of S1 are executed after the last S2 instance 3

  9. Building the Search Space: SMART’08 The Model A legal schedule  i  � 1 0 1 0 for (i = n; i < 2*n; ++i) for (i = 0; i < n; ++i) j � Θ S 1 . x S 1 = � .   for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0 n   C[i][j] = 0; S1: C[i][j] = 0; 1 for (k= n+1; k<= 2*n; ++k) for (k = 0; k < n; ++k) for (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i   � 0 0 1 1 1 for (i = 0; i < n; ++i) B[k][j]; � j Θ S 2 .   C[i][j] += A[i][k-n-1]* � x S 2 = . 0 1 0 0 0 k   } B[k-n-1][j];   1 0 0 0 0 n 1 ◮ Delay the S2 instances ◮ Constraints must be expressed between Θ S 1 and Θ S 2 3

  10. Building the Search Space: SMART’08 The Model Implicit fine-grain parallelism  i  for (i = 0; i < n; ++i) for (i = 0; i < n; ++i) j Θ S 1 . x S 1 = ( 1 0 0 0 ) . �   pfor (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ n   C[i][j] = 0; S1: C[i][j] = 0; 1 for (k = n; k < 2*n; ++k) for (k = 0; k < n; ++k) pfor (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i   pfor (i = 0; i < n; ++i) B[k][j]; j Θ S 2 .   C[i][j] += A[i][k-n]* � x S 2 = ( 0 0 1 1 0 ) . k   } B[k-n][j];   n 1 ◮ Number of rows of Θ ↔ number of outer-most sequential loops 3

  11. Building the Search Space: SMART’08 The Model Representing a schedule   i � 1 0 1 0 for (i = n; i < 2*n; ++i) for (i = 0; i < n; ++i) � j Θ S 1 . x S 1 = � .   for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0  n  C[i][j] = 0; S1: C[i][j] = 0; 1 for (k= n+1; k<= 2*n; ++k) for (k = 0; k < n; ++k) for (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i   � 0 0 1 1 1 for (i = 0; i < n; ++i) B[k][j]; � j Θ S 2 .   C[i][j] += A[i][k-n-1]* x S 2 = � 0 1 0 0 0 . k   } B[k-n-1][j];   1 0 0 0 0 n 1 � 1 0 0 0 1 1 1 0 1 � . ( i j i j k n n 1 1 ) T x = Θ . � 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 3

  12. Building the Search Space: SMART’08 The Model Representing a schedule   i � 1 0 1 0 for (i = n; i < 2*n; ++i) for (i = 0; i < n; ++i) � j Θ S 1 . x S 1 = � .   for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0  n  C[i][j] = 0; S1: C[i][j] = 0; 1 for (k= n+1; k<= 2*n; ++k) for (k = 0; k < n; ++k) for (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i   � 0 0 1 1 1 for (i = 0; i < n; ++i) B[k][j]; � j Θ S 2 .   C[i][j] += A[i][k-n-1]* x S 2 = � 0 1 0 0 0 . k   } B[k-n-1][j];   1 0 0 0 0 n 1 � 1 0 0 0 1 1 1 0 1 � . ( i j i j k n n 1 1 ) T x = Θ . � 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 � ı � p c 3

  13. Building the Search Space: SMART’08 The Model Representing a schedule   i � 1 0 1 0 for (i = n; i < 2*n; ++i) for (i = 0; i < n; ++i) � j Θ S 1 . x S 1 = � .   for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0  n  C[i][j] = 0; S1: C[i][j] = 0; 1 for (k= n+1; k<= 2*n; ++k) for (k = 0; k < n; ++k) for (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i   � 0 0 1 1 1 for (i = 0; i < n; ++i) B[k][j]; � j Θ S 2 .   C[i][j] += A[i][k-n-1]* x S 2 = � 0 1 0 0 0 . k   } B[k-n-1][j];   1 0 0 0 0 n 1 Transformation Description Changes the direction in which a loop traverses its iteration range reversal � ı Makes the bounds of a given loop depend on an outer loop counter skewing Exchanges two loops in a perfectly nested loop, a.k.a. permutation interchange fusion Fuses two loops, a.k.a. jamming � p distribution Splits a single loop nest into many, a.k.a. fission or splitting peeling Extracts one iteration of a given loop c shifting Allows to reorder loops 3

  14. Building the Search Space: SMART’08 The Search Space Challenges ◮ Completeness (combinatorial problem) ◮ Scalability (large integer polyhedra computation) Proposed solution ◮ Philosophically close to Feautrier’s maximal fine-grain parallelism ◮ One point in the space ⇔ one distinct legal program version ◮ Bound schedule coefficients in [ − 1 , 1 ] to limit control overhead ◮ No completeness, but decent scalability ◮ Deliver a mechanism to automatically complete / correct schedules 4

Recommend


More recommend