A Note on the Performance Distribution of Affine Schedules Louis-Noël Pouchet 1 , Cédric Bastoul 1 , John Cavazos 2 and Albert Cohen 1 1 ALCHEMY, INRIA Futurs / University of Paris-Sud XI, France 2 Computer and Information Sciences, University of Delaware, USA January 27, 2008 2nd Workshop on Statistical and Machine learning approaches to ARchitectures and compilaTion Göteborg, Sweden
Outline: SMART’08 Outline Motivation ◮ Automatic performance portability: iterative compilation ◮ Search space expressiveness → bring the iterative optimization problem into the polyhedral model ◮ Tradeoff expressiveness / traversal easiness ◮ Improve static characterization of the search space ◮ Highlight dynamic properties ◮ Validate a dedicated heuristic to traverse the space 2
Building the Search Space: SMART’08 The Model Original Schedule i � 1 0 0 0 for (i = 0; i < n; ++i) for (i = 0; i < n; ++i) � j Θ S 1 . x S 1 = � . for (j = 0; j < n; ++j){ 0 1 0 0 for (j = 0; j < n; ++j){ n S1: C[i][j] = 0; 1 C[i][j] = 0; for (k = 0; k < n; ++k) for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* i C[i][j] += A[i][k]* � 1 0 0 0 0 B[k][j]; � j B[k][j]; Θ S 2 . � x S 2 = . 0 1 0 0 0 k } } 0 0 1 0 0 n 1 ◮ Represent Static Control Parts (control flow and dependences must be statically computable) ◮ Use code generator (e.g. CLooG) to generate C code from polyhedral representation (provided iteration domains + schedules) 3
Building the Search Space: SMART’08 The Model Original Schedule i � 1 0 0 0 for (i = 0; i < n; ++i) j for (i = 0; i < n; ++i) � Θ S 1 . x S 1 = � . for (j = 0; j < n; ++j){ 0 1 0 0 n for (j = 0; j < n; ++j){ S1: C[i][j] = 0; 1 C[i][j] = 0; for (k = 0; k < n; ++k) for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* C[i][j] += A[i][k]* i � 1 0 0 0 0 B[k][j]; � j B[k][j]; Θ S 2 . x S 2 = � 0 1 0 0 0 . k } } 0 0 1 0 0 n 1 ◮ Represent Static Control Parts (control flow and dependences must be statically computable) ◮ Use code generator (e.g. CLooG) to generate C code from polyhedral representation (provided iteration domains + schedules) 3
Building the Search Space: SMART’08 The Model Original Schedule i � 1 0 0 0 for (i = 0; i < n; ++i) j for (i = 0; i < n; ++i) � Θ S 1 . x S 1 = � . for (j = 0; j < n; ++j){ 0 1 0 0 n for (j = 0; j < n; ++j){ S1: C[i][j] = 0; 1 C[i][j] = 0; for (k = 0; k < n; ++k) for (k = 0; k < n; ++k) S2: C[i][j] += A[i][k]* i C[i][j] += A[i][k]* � 1 0 0 0 0 B[k][j]; � j B[k][j]; Θ S 2 . � x S 2 = . 0 1 0 0 0 k } } 0 0 1 0 0 n 1 ◮ Represent Static Control Parts (control flow and dependences must be statically computable) ◮ Use code generator (e.g. CLooG) to generate C code from polyhedral representation (provided iteration domains + schedules) 3
Building the Search Space: SMART’08 The Model Distribute loops i � 1 0 0 0 for (i = 0; i < n; ++i) for (i = 0; i < n; ++i) � j Θ S 1 . x S 1 = � . for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0 n C[i][j] = 0; S1: C[i][j] = 0; 1 for (i = n ; i < 2* n; ++i) for (k = 0; k < n; ++k) for (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i � 1 0 0 1 0 for (k = 0; k < n; ++k) B[k][j]; � j Θ S 2 . C[i -n ][j] += A[i -n ][k]* � x S 2 = . 0 1 0 0 0 k } B[k][j]; 0 0 1 0 0 n 1 ◮ All instances of S1 are executed before the first S2 instance 3
Building the Search Space: SMART’08 The Model Distribute loops + Interchange loops for S2 i � 1 0 0 0 for (i = 0; i < n; ++i) for (i = 0; i < n; ++i) j � Θ S 1 . x S 1 = � . for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0 n C[i][j] = 0; S1: C[i][j] = 0; 1 for ( k = n; k < 2*n; ++k) for (k = 0; k < n; ++k) for (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i � 0 0 1 1 0 for (i = 0; i < n; ++i) B[k][j]; � j Θ S 2 . C[i][j] += A[i][k-n]* � x S 2 = . 0 1 0 0 0 k } B[k-n][j]; 1 0 0 0 0 n 1 ◮ The outer-most loop for S2 becomes k 3
Building the Search Space: SMART’08 The Model Illegal schedule i � 1 0 1 0 for (k = 0; k < n; ++k) for (i = 0; i < n; ++i) j � Θ S 1 . x S 1 = � . for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0 n for (i = 0; i < n; ++i) S1: C[i][j] = 0; 1 C[i][j] += A[i][k]* for (k = 0; k < n; ++k) B[k][j]; S2: C[i][j] += A[i][k]* i � 0 0 1 0 0 for (i = n; i < 2*n; ++i) B[k][j]; � j Θ S 2 . for (j = 0; j < n; ++j) � x S 2 = . 0 1 0 0 0 k } C[i-n][j] = 0; 1 0 0 0 0 n 1 ◮ All instances of S1 are executed after the last S2 instance 3
Building the Search Space: SMART’08 The Model A legal schedule i � 1 0 1 0 for (i = n; i < 2*n; ++i) for (i = 0; i < n; ++i) j � Θ S 1 . x S 1 = � . for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0 n C[i][j] = 0; S1: C[i][j] = 0; 1 for (k= n+1; k<= 2*n; ++k) for (k = 0; k < n; ++k) for (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i � 0 0 1 1 1 for (i = 0; i < n; ++i) B[k][j]; � j Θ S 2 . C[i][j] += A[i][k-n-1]* � x S 2 = . 0 1 0 0 0 k } B[k-n-1][j]; 1 0 0 0 0 n 1 ◮ Delay the S2 instances ◮ Constraints must be expressed between Θ S 1 and Θ S 2 3
Building the Search Space: SMART’08 The Model Implicit fine-grain parallelism i for (i = 0; i < n; ++i) for (i = 0; i < n; ++i) j Θ S 1 . x S 1 = ( 1 0 0 0 ) . � pfor (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ n C[i][j] = 0; S1: C[i][j] = 0; 1 for (k = n; k < 2*n; ++k) for (k = 0; k < n; ++k) pfor (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i pfor (i = 0; i < n; ++i) B[k][j]; j Θ S 2 . C[i][j] += A[i][k-n]* � x S 2 = ( 0 0 1 1 0 ) . k } B[k-n][j]; n 1 ◮ Number of rows of Θ ↔ number of outer-most sequential loops 3
Building the Search Space: SMART’08 The Model Representing a schedule i � 1 0 1 0 for (i = n; i < 2*n; ++i) for (i = 0; i < n; ++i) � j Θ S 1 . x S 1 = � . for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0 n C[i][j] = 0; S1: C[i][j] = 0; 1 for (k= n+1; k<= 2*n; ++k) for (k = 0; k < n; ++k) for (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i � 0 0 1 1 1 for (i = 0; i < n; ++i) B[k][j]; � j Θ S 2 . C[i][j] += A[i][k-n-1]* x S 2 = � 0 1 0 0 0 . k } B[k-n-1][j]; 1 0 0 0 0 n 1 � 1 0 0 0 1 1 1 0 1 � . ( i j i j k n n 1 1 ) T x = Θ . � 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 3
Building the Search Space: SMART’08 The Model Representing a schedule i � 1 0 1 0 for (i = n; i < 2*n; ++i) for (i = 0; i < n; ++i) � j Θ S 1 . x S 1 = � . for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0 n C[i][j] = 0; S1: C[i][j] = 0; 1 for (k= n+1; k<= 2*n; ++k) for (k = 0; k < n; ++k) for (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i � 0 0 1 1 1 for (i = 0; i < n; ++i) B[k][j]; � j Θ S 2 . C[i][j] += A[i][k-n-1]* x S 2 = � 0 1 0 0 0 . k } B[k-n-1][j]; 1 0 0 0 0 n 1 � 1 0 0 0 1 1 1 0 1 � . ( i j i j k n n 1 1 ) T x = Θ . � 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 � ı � p c 3
Building the Search Space: SMART’08 The Model Representing a schedule i � 1 0 1 0 for (i = n; i < 2*n; ++i) for (i = 0; i < n; ++i) � j Θ S 1 . x S 1 = � . for (j = 0; j < n; ++j) for (j = 0; j < n; ++j){ 0 1 0 0 n C[i][j] = 0; S1: C[i][j] = 0; 1 for (k= n+1; k<= 2*n; ++k) for (k = 0; k < n; ++k) for (j = 0; j < n; ++j) S2: C[i][j] += A[i][k]* i � 0 0 1 1 1 for (i = 0; i < n; ++i) B[k][j]; � j Θ S 2 . C[i][j] += A[i][k-n-1]* x S 2 = � 0 1 0 0 0 . k } B[k-n-1][j]; 1 0 0 0 0 n 1 Transformation Description Changes the direction in which a loop traverses its iteration range reversal � ı Makes the bounds of a given loop depend on an outer loop counter skewing Exchanges two loops in a perfectly nested loop, a.k.a. permutation interchange fusion Fuses two loops, a.k.a. jamming � p distribution Splits a single loop nest into many, a.k.a. fission or splitting peeling Extracts one iteration of a given loop c shifting Allows to reorder loops 3
Building the Search Space: SMART’08 The Search Space Challenges ◮ Completeness (combinatorial problem) ◮ Scalability (large integer polyhedra computation) Proposed solution ◮ Philosophically close to Feautrier’s maximal fine-grain parallelism ◮ One point in the space ⇔ one distinct legal program version ◮ Bound schedule coefficients in [ − 1 , 1 ] to limit control overhead ◮ No completeness, but decent scalability ◮ Deliver a mechanism to automatically complete / correct schedules 4
Recommend
More recommend