263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors Thomas R. Gross Computer Science Department ETH Zurich, Switzerland
Overview § 8.1 InstrucLon scheduling basics § 8.2 Scheduling for ILP processors
8.2 Scheduling for ILP processors § IntroducLon to ILP § Scheduling for acyclic regions § Types and shapes § Region forma2on § Schedule construc2on § Resource management § Scheduling for cyclic regions § So8ware pipelining § Modulo scheduling § Specula2on and predica2on
8.2.3 Scheduling for cyclic regions § Scheduling loops § The majority of program execu2on 2me is spent in loops § We already know several techniques to speed up loop execu2on § Parallelizing loops § Loop unrolling § Loop fusion § … § All of these techniques have a scheduling barrier at the end of one (or several) itera2ons
Increasing ILP w/ loop unrolling § Running example mov r1 ← @a mov r2 ← @b add r5 ← r1, #0x4000 for (i=0; i<0x1000; i++) { loop: ld r3 ← mem[r1] b[i] = a[i] * 3; mul r4 ← r3, #3 } st mem[r2] ← r4 add r1 ← r1, #4 § Machine model add r2 ← r2, #4 clt p1 ← r1, r5 § 4 issue b p1, @loop § 1 control, 2 ALU, 1 memory § Latencies: § Add: 1 cycle § Mul: 3 cycles § Ld: 2 cycles § St: 1 cycle § Cmp: 1 cycle § Branch: 1 cycle
Increasing ILP w/ loop unrolling § Scheduling the loop with list scheduling ( Baseline) cycle ALU 1 ALU 2 MEM control 0 4 1 1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 1 6 3 st mem[r2] ← r4 4 add r1 ← r1, #4 2 2 5 add r2 ← r2, #4 6 clt p1 ← r1, r5 3 7 b p1, @loop 4 5 5 3 7 time ... iteration 1 2 3 1000 Throughput: 6 cycles / 1 iteraLon (100%) Code size (schedule length): 6 (100%)
Increasing ILP w/ loop unrolling § Unrolling twice cycle ALU 1 ALU 2 MEM control 0 14 11 11 ld r3 ← mem[r1] 12 mul r4 ← r3, #3 1 24 21 13 st mem[r2] ← r4 14 add r1 ← r1, #4 2 12 06 15 add r2 ← r2, #4 21 ld r8 ← mem[r6] 3 22 22 mul r9 ← r8, #3 23 st mem[r7] ← r9 4 24 add r6 ← r6, #4 25 add r7 ← r7, #4 06 clt p1 ← r6, r5 5 15 13 07 b p1, @loop 6 25 23 07 time ... iteration 1,2 3,4 5,6 999,1000 § Throughput: 7 cycles / 2 iteraLons (-42%) Code size (schedule length): 7 (+17%)
Increasing ILP w/ loop unrolling § Unrolling 4x cycle ALU 1 ALU 2 MEM control 11 ld r3 ← mem[r1] 0 14 11 12 mul r4 ← r3, #3 13 st mem[r2] ← r4 1 24 21 14 add r1 ← r1, #4 15 add r2 ← r2, #4 31 2 34 12 21 ld r8 ← mem[r6] 22 mul r9 ← r8, #3 22 41 3 44 23 st mem[r7] ← r9 24 add r6 ← r6, #4 32 4 25 add r7 ← r7, #4 31 ld r12 ← mem[r10] 42 5 15 13 32 mul r13 ← r12, #3 33 st mem[r11] ← r13 6 25 23 34 add r10 ← r10, #4 35 add r11 ← r11, #4 06 7 35 33 41 ld r16 ← mem[r14] 42 mul r17 ← r16, #3 8 45 34 07 43 st mem[r15] ← r17 44 add r14 ← r14, #4 time 45 add r15 ← r15, #4 06 clt p1 ← r6, r5 ... iteration 997,998 9,10,11,12 1,2,3,4 5,6,7,8 07 b p1, @loop 999,1000 Throughput: 9 cycles / 4 iteraLons (-63%) Code size (scheduled instrucLons): 9 (+50%)
Increasing ILP w/ loop unrolling § Scheduling loops § Unrolling: performance improvements, but § Scheduling barrier is s2ll there, only unroll factor loop bodies can be overlapped § Increase in § Code size § Register pressure § Unrolling is useful for loops with § Lots of control flow within the loop body § Trace can find most likely path § Unrolling “ignores” loop structure
8.2.3.1 So^ware pipelining § Exploit loop structure of program § Pipelining: overlap of stages 10
8.2.3.1 So^ware pipelining § Exploit loop structure of program § Pipelining: overlap of stages 11
Let’s try again 1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 3 st mem[r2] ← r4 § Scheduling the first iteraLon 4 add r1 ← r1, #4 5 add r2 ← r2, #4 on our 4-issue machine 6 clt p1 ← r1, r5 7 b p1, @loop iteraLon 1 cycle ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control 0 1 1 4 2 2 3 6 4 5 5 3 7 6 7 8 9
Let’s try again § Consider 8-issue machine cycle ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control § Duplicate 4-issue 0 machine 1 § Control 2 nd group 2 by predicate Q 3 § Execute only if 4 Q==true 5 § “Predicated 6 execu2on” 7 8 9 13
Let’s try again 1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 3 st mem[r2] ← r4 4 add r1 ← r1, #4 § Scheduling the 2 nd iteraLon 5 add r2 ← r2, #4 6 clt p1 ← r1, r5 on an 8-issue machine 7 b p1, @loop iteraLon 1 iteraLon 2 cycle ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control 0 1 1 4 2 2 1 3 4 6 4 2 5 5 3 7 6 6 5 7 3 7 8 9
Let’s try again 1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 3 st mem[r2] ← r4 4 add r1 ← r1, #4 § Scheduling the 3 rd iteraLon 5 add r2 ← r2, #4 6 clt p1 ← r1, r5 on a 12-issue machine 7 b p1, @loop iteraLon 1 iteraLon 2 iteraLon 3 cycle ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control ALU 1 ALU 2 MEM control 0 1 1 4 2 2 1 3 4 6 4 2 1 5 5 3 7 4 6 6 2 5 7 3 7 6 8 5 9 3 7
Let’s try again § ObservaLon: schedules of different iteraLons are idenLcal cycle ALU 1 ALU 2 MEM control § except for the start-up 0 1 cycles 1 4 § except for the wind-down 1 2 2 cycles 3 4 § Schedules can be 6 1 4 2 overlapped 5 5 4 3 7 6 1 6 2 5 7 4 3 7 6 1 8 2 5 9 4 3 7 6 10 2 1 … … … … …
Performance § Throughput cycle ALU 1 ALU 2 MEM control § for each individual 0 1 itera2on: 6 cycles 1 4 § 1 st itera2on: 6 cycles 1 2 2 § each addi2onal itera2on: 2 cycles 3 4 § n itera2ons: 2*n + 4 6 1 4 2 5 5 4 3 7 § Code size (w/ predicated 6 1 6 2 execuLon) 5 7 4 3 7 § 7 instruc2ons 6 1 8 2 § Predicate Q: r1 < r5 5 9 4 3 7 6 10 2 1 … … … … …
So^ware pipelining § Standard techniques (region scheduling, loop fusion, loop unrolling) do not yield sufficient ILP § Method of choice: so^ware pipelining time § Overlap itera2ons of ... the loop body iteration 1 2 3 4 997 998 999 1000 1 § Steady-state: kernel 2 1 3 2 1 § Peak performance: 4 3 2 1 1 loop itera2on/cycle 4 3 2 1 4 3 2 4 3 1 § No scheduling-barriers 4 2 1 3 2 1 between itera2ons 4 3 2 1 4 3 2 1 § No loop unrolling necessary 4 3 2 4 3 § Requires sufficient resources 4
So^ware pipelining 2me § Basic idea ... itera2on 1 2 3 4 § Unroll the loop “completely” § Correctly schedule the loop under 1 II (ini2a2on interval) two constraints II 2 1 § All itera2on bodies have iden2cal schedules II 3 2 1 § Each new itera2on starts exactly II (ini2a2on interval) cycles 4 3 2 1 a8er the previous itera2on 4 3 2 § ExecuLon Lme in terms of stage count (SC) SC (stage count) 4 3 § One loop itera2on: SC×II cycles § Prologue/epilogue: (SC-1)×II cycles 4 § Kernel steady state: II cycles § ExecuLon Lme of a so^ware pipelined loop: II×(n+sc-1) cycles
Modulo scheduling § Most common technique to find so^ware pipelined schedules § Basic concept § Unroll the loop “completely” § Schedule the loop under two constraints § All itera2on bodies have iden2cal schedules § Each new itera2on starts exactly II cycles a8er the previous itera2on
Modulo scheduling: problem formulaLon § Problem : find a schedule for one loop body iteraLon such that when the schedule is repeated at intervals of II cycles § No hardware resource conflict arises between opera2ons of the same and successive itera2ons of the loop body § No intra/inter–loop dependences are violated
Modulo scheduling: resource constraints § Handling resource constraints § No resource must be used by different opera2ons at two points in 2me that are separated by an interval that is a mul2ple of the ini2a2on interval § This requirement is iden2cal to: Within a single itera2on, no resource is ever used more than once at the same 2me modulo II § Search for suitable iniLaLon interval
Modulo scheduling: resource constraints § Modulo reservaLon tables § Table containing II rows and one column for each resource II = 2 cycle ALU 1 ALU 2 MEM control 0 1 II = 3 cycle ALU 1 ALU 2 MEM control 0 1 2 § Scheduling op at 2me t on resource r § Entry for r at t mod II must be free § Mark t mod II busy for r
Modulo scheduling: resource constraints § Modulo reservaLon tables § Table containing II rows and one column for each resource iteraLon 1 1 ld r3 ← mem[r1] 2 mul r4 ← r3, #3 cycle ALU 1 ALU 2 MEM control 3 st mem[r2] ← r4 4 add r1 ← r1, #4 0 1 5 add r2 ← r2, #4 1 4 6 clt p1 ← r1, r5 7 b p1, @loop 2 2 3 4 6 5 5 3 7 cycle ALU 1 ALU 2 MEM control II = 2 6 0 2 1 5 3 7 1 4
Modulo scheduling: dependence constraints § Dependence constraints § Both loop-independent and loop-carried dependences must be considered § Annotate each edge in the data dependence graph with a tuple t = <distance, delay> § Delay: minimum 2me interval between the start of opera2ons Dependence Delay Conserva2ve delay Latency(pred) Latency(pred) True 1-latency(succ) 0 An2 1+latency(pred)− Latency(pred) Output latency(succ)
Recommend
More recommend