Tiling: A Data Locality Optimizing Algorithm Loop Unrolling Motivation Previously – Reduces loop overhead – Kelly & Pugh transformation framework – Improves effectiveness of other transformations – Affine space partitions for parallelism – Code scheduling – CSE Today The Transformation – “Unroll and Jam” and Tiling − Make n copies of the loop: n is the unrolling factor – Specifying tiling in the Kelly and Pugh transformation framework − Adjust loop bounds accordingly – Status of code generation for tiling CS553 Lecture Tiling 1 CS553 Lecture Tiling 2 Loop Unrolling (cont) Loop Balance Example Problem do i=1,n do i=1,n-1 by 2 – We’d like to produce loops with the right balance of memory operations A(i) = B(i) + C(i) A(i) = B(i) + C(i) and floating point operations enddo A(i+1) = B(i+1) + C(i+1) – The ideal balance is machine-dependent enddo – e.g. How many load-store units are connected to the L1 cache? if (i=n) – e.g. How many functional units are provided? A(i) = B(i) + C(i) Example − The inner loop has 1 memory do j = 1,2*n Details operation per iteration and 1 floating do i = 1,m point operation per iteration − When is loop unrolling legal? A(j) = A(j) + B(i) − If our target machine can only − Handle end cases with a cloned copy of the loop enddo support 1 memory operation for − Enter this special case if the remaining number of iteration is less enddo every two floating point operations, than the unrolling factor this loop will be memory bound What can we do? CS553 Lecture Tiling 3 CS553 Lecture Tiling 4 1
Unroll and Jam Unroll and Jam Example (cont) Unroll the Outer Loop Idea do j = 1,2*n by 2 – Restructure loops so that loaded values are used many times per iteration do i = 1,m Unroll and Jam A(j) = A(j) + B(i) enddo – Unroll the outer loop some number of times do i = 1,m – Fuse (Jam) the resulting inner loops A(j+1) = A(j+1) + B(i) Example Unroll the Outer Loop enddo enddo do j = 1,2*n do j = 1,2*n by 2 Jam the inner loops do i = 1,m do i = 1,m − The inner loop has 1 load per do j = 1,2*n by 2 A(j) = A(j) + B(i) A(j) = A(j) + B(i) iteration and 2 floating point do i = 1,m enddo enddo operations per iteration A(j) = A(j) + B(i) enddo do i = 1,m − We reuse the loaded value of B(i) A(j+1) = A(j+1) + B(i) A(j+1) = A(j+1) + B(i) − The Loop Balance matches the enddo enddo machine balance enddo enddo CS553 Lecture Tiling 5 CS553 Lecture Tiling 6 Unroll and Jam (cont) Tiling Legality A non-unimodular transformation that ... – When is Unroll and Jam legal? – groups iteration points into tiles that are executed atomically – can improve spatial and temporal data Disadvantages locality – What limits the degree of unrolling? – can expose larger granularities of j parallelism i Implementing tiling do ii = 1,6, by 2 – how can we specify tiling? do jj = 1, 5, by 2 – when is tiling legal? do i = ii, ii+2-1 – how do we generate tiled code? do j = jj, min(jj+2-1,5) A(i,j) = ... CS553 Lecture Tiling 7 CS553 Lecture Tiling 8 2
Specifying Tiling Legality of Tiling Rectangular tiling A legal rectangular tiling – tile size vector – each tile executed atomically – no dependence cycles between tiles – tile offset, – Check legality by verifying that transformed data dependences are lexicographically j positive j i Possible Transformation Mappings i – creating a tile space Fully permutable loops – rectangular tiling is legal on fully permutable loops – keeping tile iterators in original iteration space j’ CS553 Lecture Tiling 9 CS553 Lecture Tiling 10 i’ Code Generation for Tiling Unroll and Jam IS Tiling (followed by inner loop unrolling) Original Loop do ii = 1,6, by 2 do j = 1,2*n Fixed-size Tiles do jj = 1, 5, by 2 do i = 1,m – Omega library do i = ii, ii+2-1 – Cloog A(j)= A(j) + B(i) do j = jj, min(jj+2-1,5) – for rectangular space and tiles, straight-forward enddo A(i,j) = ... enddo j Parameterized tile sizes i – Parameterized tiled loops for free, PLDI 2007 – TLOG - A Tiled Loop Generator, http://www.cs.colostate.edu/~ln/TLOG/ After Tiling After Unroll and Jam do jj = 1,2*n by 2 Overview of decoupled approach do jj = 1,2*n by 2 do i = 1,m – find polyhedron that may contain any loop origins do i = 1,m do j = jj, jj+2-1 – generate code that traverses that polyhedron A(j)= A(j)+B(i) A(j)= A(j)+B(i) – post process the code to start a tile origins and step by tile size A(j+1)= A(j+1)+B(i) enddo – generate loops over points in tile to stay within original iteration space and within enddo tile enddo enddo enddo CS553 Lecture Tiling 11 CS553 Lecture Tiling 12 3
Concepts Next Time Unroll and Jam is the same as Tiling with the inner loop unrolled Lecture – Run-time reordering transformations Tiling can improve ... – loop balance Suggested Exercises – spatial locality – after array expansion of the scalar T, is it legal to tile the three loops in Figure 11.23? write the tiled code for a block size of your choice. – data locality – computation to communication ratio Implementing tiling – specification – checking legality – code generation CS553 Lecture Tiling 13 CS553 Lecture Tiling 14 4
Recommend
More recommend