263 2810 advanced compiler design 8 2 scheduling for ilp

263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors - PowerPoint PPT Presentation

263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors Thomas R. Gross Computer Science Department ETH Zurich, Switzerland Overview 8.1 InstrucLon scheduling basics 8.2 Scheduling for ILP processors 8.2 Scheduling for ILP

  1. 263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors Thomas R. Gross Computer Science Department ETH Zurich, Switzerland

  2. Overview § 8.1 InstrucLon scheduling basics § 8.2 Scheduling for ILP processors

  3. 8.2 Scheduling for ILP processors § IntroducLon to ILP § Scheduling for acyclic regions § Types and shapes § Region forma2on § Schedule construc2on § Resource management § Scheduling for cyclic regions § So8ware pipelining § Modulo scheduling § Specula2on and predica2on

  4. 8.2.3 Scheduling for cyclic regions § Scheduling loops § The majority of program execu2on 2me is spent in loops § We already know several techniques to speed up loop execu2on § Parallelizing loops § Loop unrolling § Loop fusion § … § All of these techniques have a scheduling barrier at the end of one (or several) itera2ons

  5. Increasing ILP w/ loop unrolling § Running example mov r1 ← @a mov r2 ← @b add r5 ← r1, #0x4000 for (i=0; i<0x1000; i++) { loop: ld r3 ← mem[r1] b[i] = a[i] * 3; mul r4 ← r3, #3 } st mem[r2] ← r4 add r1 ← r1, #4 § Machine model add r2 ← r2, #4 clt p1 ← r1, r5 § 4 issue b p1, @loop § 1 control, 2 ALU, 1 memory § Latencies: § Add: 1 cycle § Mul: 3 cycles § Ld: 2 cycles § St: 1 cycle § Cmp: 1 cycle § Branch: 1 cycle

  6. Increasing ILP w/ loop unrolling § Scheduling the loop with list scheduling ( Baseline) cycle ALU 1 ALU 2 MEM control 0 4 1 1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 1 6 3 st mem[r2] ← r4 4 add r1 ← r1, #4 2 2 5 add r2 ← r2, #4 6 clt p1 ← r1, r5 3 7 b p1, @loop 4 5 5 3 7 time ... iteration 1 2 3 1000 Throughput: 6 cycles / 1 iteraLon (100%) Code size (schedule length): 6 (100%)

  7. Increasing ILP w/ loop unrolling § Unrolling twice cycle ALU 1 ALU 2 MEM control 0 14 11 11 ld r3 ← mem[r1] 12 mul r4 ← r3, #3 1 24 21 13 st mem[r2] ← r4 14 add r1 ← r1, #4 2 12 06 15 add r2 ← r2, #4 21 ld r8 ← mem[r6] 3 22 22 mul r9 ← r8, #3 23 st mem[r7] ← r9 4 24 add r6 ← r6, #4 25 add r7 ← r7, #4 06 clt p1 ← r6, r5 5 15 13 07 b p1, @loop 6 25 23 07 time ... iteration 1,2 3,4 5,6 999,1000 § Throughput: 7 cycles / 2 iteraLons (-42%) Code size (schedule length): 7 (+17%)

  8. Increasing ILP w/ loop unrolling § Unrolling 4x cycle ALU 1 ALU 2 MEM control 11 ld r3 ← mem[r1] 0 14 11 12 mul r4 ← r3, #3 13 st mem[r2] ← r4 1 24 21 14 add r1 ← r1, #4 15 add r2 ← r2, #4 31 2 34 12 21 ld r8 ← mem[r6] 22 mul r9 ← r8, #3 22 41 3 44 23 st mem[r7] ← r9 24 add r6 ← r6, #4 32 4 25 add r7 ← r7, #4 31 ld r12 ← mem[r10] 42 5 15 13 32 mul r13 ← r12, #3 33 st mem[r11] ← r13 6 25 23 34 add r10 ← r10, #4 35 add r11 ← r11, #4 06 7 35 33 41 ld r16 ← mem[r14] 42 mul r17 ← r16, #3 8 45 34 07 43 st mem[r15] ← r17 44 add r14 ← r14, #4 time 45 add r15 ← r15, #4 06 clt p1 ← r6, r5 ... iteration 997,998 
 9,10,11,12 1,2,3,4 5,6,7,8 07 b p1, @loop 999,1000 Throughput: 9 cycles / 4 iteraLons (-63%) Code size (scheduled instrucLons): 9 (+50%)

  9. Increasing ILP w/ loop unrolling § Scheduling loops § Unrolling: performance improvements, but § Scheduling barrier is s2ll there, only unroll factor loop bodies can be overlapped § Increase in § Code size § Register pressure § Unrolling is useful for loops with § Lots of control flow within the loop body § Trace can find most likely path § Unrolling “ignores” loop structure

  10. So^ware pipelining § Exploit loop structure of program § Pipelining: overlap of stages 10

  11. So^ware pipelining § Exploit loop structure of program § Pipelining: overlap of stages 11

  12. Let’s try again 1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 3 st mem[r2] ← r4 § Scheduling the first iteraLon 4 add r1 ← r1, #4 5 add r2 ← r2, #4 on our 4-issue machine 6 clt p1 ← r1, r5 7 b p1, @loop iteraLon 1 cycle ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control 0 1 1 4 2 2 3 6 4 5 5 3 7 6 7 8 9

  13. Let’s try again § Consider 8-issue machine cycle ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control § Duplicate 4-issue 0 machine 1 § Control 2 nd group 2 by predicate Q 3 § Execute only if 4 Q==true 5 § “Predicated 6 execu2on” 7 8 9 13

  14. Let’s try again 1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 3 st mem[r2] ← r4 4 add r1 ← r1, #4 § Scheduling the 2 nd iteraLon 5 add r2 ← r2, #4 6 clt p1 ← r1, r5 on an 8-issue machine 7 b p1, @loop iteraLon 1 iteraLon 2 cycle ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control 0 1 1 4 2 2 1 3 4 6 4 2 5 5 3 7 6 6 5 7 3 7 8 9

  15. Let’s try again 1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 3 st mem[r2] ← r4 4 add r1 ← r1, #4 § Scheduling the 3 rd iteraLon 5 add r2 ← r2, #4 6 clt p1 ← r1, r5 on a 12-issue machine 7 b p1, @loop iteraLon 1 iteraLon 2 iteraLon 3 cycle ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control 0 1 1 4 2 2 1 3 4 6 4 2 1 5 5 3 7 4 6 6 2 5 7 3 7 6 8 5 9 3 7

  16. Let’s try again § ObservaLon: schedules of different iteraLons are idenLcal cycle ALU 1 ALU 2 MEM control § except for the start-up 0 1 cycles 1 4 § except for the wind-down 1 2 2 cycles 3 4 § Schedules can be 6 1 4 2 overlapped 5 5 4 3 7 6 1 6 2 5 7 4 3 7 6 1 8 2 5 9 4 3 7 6 10 2 1 … … … … …

  17. Performance § Throughput cycle ALU 1 ALU 2 MEM control § for each individual 0 1 itera2on: 6 cycles 1 4 § 1 st itera2on: 6 cycles 1 2 2 § each addi2onal itera2on: 2 cycles 3 4 § n itera2ons: 2*n + 4 6 1 4 2 5 5 4 3 7 § Code size (w/ predicated 6 1 6 2 execuLon) 5 7 4 3 7 § 7 instruc2ons 6 1 8 2 § Predicate Q: r1 < r5 5 9 4 3 7 6 10 2 1 … … … … …

  18. So^ware pipelining § Standard techniques (region scheduling, loop fusion, loop unrolling) do not yield sufficient ILP § Method of choice: so^ware pipelining time § Overlap itera2ons of ... the loop body iteration 1 2 3 4 997 998 999 1000 1 § Steady-state: kernel 2 1 3 2 1 § Peak performance: 4 3 2 1 1 loop itera2on/cycle 4 3 2 1 4 3 2 4 3 1 § No scheduling-barriers 4 2 1 3 2 1 between itera2ons 4 3 2 1 4 3 2 1 § No loop unrolling necessary 4 3 2 4 3 § Requires sufficient resources 4

  19. So^ware pipelining 2me § Basic idea ... itera2on 1 2 3 4 § Unroll the loop “completely” § Correctly schedule the loop under 1 II (ini2a2on interval) two constraints II 2 1 § All itera2on bodies have iden2cal schedules II 3 2 1 § Each new itera2on starts exactly II (ini2a2on interval) cycles 4 3 2 1 a8er the previous itera2on 4 3 2 § ExecuLon Lme in terms of stage count (SC) SC (stage count) 4 3 § One loop itera2on: SC×II cycles § Prologue/epilogue: (SC-1)×II cycles 4 § Kernel steady state: II cycles § ExecuLon Lme of a so^ware pipelined loop: II×(n+sc-1) cycles

  20. Modulo scheduling § Most common technique to find so^ware pipelined schedules § Basic concept § Unroll the loop “completely” § Schedule the loop under two constraints § All itera2on bodies have iden2cal schedules § Each new itera2on starts exactly II cycles a8er the previous itera2on

  21. Modulo scheduling: problem formulaLon § Problem : find a schedule for one loop body iteraLon such that when the schedule is repeated at intervals of II cycles § No hardware resource conflict arises between opera2ons of the same and successive itera2ons of the loop body § No intra/inter–loop dependences are violated

  22. Modulo scheduling: resource constraints § Handling resource constraints § No resource must be used by different opera2ons at two points in 2me that are separated by an interval that is a mul2ple of the ini2a2on interval § This requirement is iden2cal to: Within a single itera2on, no resource is ever used more than once at the same 2me modulo II § Search for suitable iniLaLon interval

  23. Modulo scheduling: resource constraints § Modulo reservaLon tables § Table containing II rows and one column for each resource II = 2 cycle ALU 1 ALU 2 MEM control 0 1 II = 3 cycle ALU 1 ALU 2 MEM control 0 1 2 § Scheduling op at 2me t on resource r § Entry for r at t mod II must be free § Mark t mod II busy for r

  24. Modulo scheduling: resource constraints § Modulo reservaLon tables § Table containing II rows and one column for each resource iteraLon 1 1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 cycle ALU 1 ALU 2 MEM control 3 st mem[r2] ← r4 4 add r1 ← r1, #4 0 1 5 add r2 ← r2, #4 1 4 6 clt p1 ← r1, r5 7 b p1, @loop 2 2 3 4 6 5 5 3 7 cycle ALU 1 ALU 2 MEM control II = 2 6 0 2 1 5 3 7 1 4

  25. Modulo scheduling: dependence constraints § Dependence constraints § Both loop-independent and loop-carried dependences must be considered § Annotate each edge in the data dependence graph with a tuple t = <distance, delay> § Delay: minimum 2me interval between the start of opera2ons Dependence Delay Conserva2ve delay Latency(pred) Latency(pred) True 1-latency(succ) 0 An2 1+latency(pred)− Latency(pred) Output latency(succ)

More recommend