exploitation of instruction level parallelism
play

Exploitation of instruction level parallelism Computer Architecture - PowerPoint PPT Presentation

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel Garca Snchez (coordinator) David Expsito Singh Francisco Javier Garca Blas ARCOS Group Computer Science and


  1. Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 1/55

  2. Exploitation of instruction level parallelism Compilation techniques and ILP Compilation techniques and ILP 1 Advanced branch prediction techniques 2 3 Introduction to dynamic scheduling Speculation 4 Multiple issue techniques 5 6 ILP limits Thread level parallelism 7 Conclusion 8 cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 2/55

  3. Exploitation of instruction level parallelism Compilation techniques and ILP Taking advantage of ILP ILP directly applicable to basic blocks. Basic block : sequence of instructions without branching. Typical program in MIPS: Basic block average size from 3 to 6 instructions. Low ILP exploitation within block. Need to exploit ILP across basic blocks. Loop level parallelism . Example Can be transformed to ILP . By compiler or hardware. for ( i=0;i<1000;i++) { Alternative : x[ i ] = x[ i ] + y[ i ]; } Vector instructions. SIMD instructions in processor. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 3/55

  4. Exploitation of instruction level parallelism Compilation techniques and ILP Scheduling and loop unrolling Parallelism exploitation : Interleave execution of unrelated instructions. Fill stalls with instructions. Do not alter original program effects. Compiler can do this with detailed knowledge of the architecture. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 4/55

  5. Exploitation of instruction level parallelism Compilation techniques and ILP ILP exploitation Example Each iteration body is for ( i=999;i>=0;i −− ) { independent. x[ i ] = x[ i ] + s; } Latencies between instructions Instruction Instruction Latency (cycles) producing result using result FP ALU operation FP ALU operation 3 FP ALU operation Store double 2 Load double FP ALU operation 1 Load double Store double 0 cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 5/55

  6. Exploitation of instruction level parallelism Compilation techniques and ILP Compiled code R1 → Last array element. F2 → Scalar s . R2 → Precomputed so that 8(R2) is the first element in array. Assembler code Loop : L.D F0 , 0(R1) ; F0 < − x [ i ] ADD.D F4 , F0 , F2 ; F4 < − F0 + s S.D F4 , 0(R1) ; x [ i ] < − F4 DADDUI R1, R1, # − 8 ; i − − BNE R1, R2, Loop ; Branch i f R1!=R2 cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 6/55

  7. Exploitation of instruction level parallelism Compilation techniques and ILP Stalls in execution Original Stalls Loop : L.D F0 , 0(R1) Loop : L.D F0 , 0(R1) < s t a l l > ADD.D F4 , F0 , F2 ADD.D F4 , F0 , F2 S.D F4 , 0(R1) < s t a l l > DADDUI R1, R1, # − 8 < s t a l l > BNE R1, R2, Loop S.D F4 , 0(R1) DADDUI R1, R1, # − 8 < s t a l l > BNE R1, R2, Loop cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 7/55

  8. Exploitation of instruction level parallelism Compilation techniques and ILP Loop scheduling Original Scheduled Loop : L.D F0 , 0(R1) Loop : L.D F0 , 0(R1) DADDUI R1, R1, # − 8 < s t a l l > ADD.D F4 , F0 , F2 ADD.D F4 , F0 , F2 < s t a l l > < s t a l l > < s t a l l > < s t a l l > S.D F4 , 8(R1) S.D F4 , 0(R1) BNE R1, R2, Loop DADDUI R1, R1, # − 8 < s t a l l > BNE R1, R2, Loop 7 cycles per iteration. 9 cycles per iteration. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 8/55

  9. Exploitation of instruction level parallelism Compilation techniques and ILP Loop unrolling Idea : Replicate loop body several times. Adjust termination code. Use different registers for each iteration replica to reduce dependencies. Effect : Increase basic block length. Increase use of available ILP . cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 9/55

  10. Exploitation of instruction level parallelism Compilation techniques and ILP Unrolling Unrolling (x4) Unrolling (x4) Loop : L.D F0 , 0(R1) ADD.D F12 , F10 , F2 ADD.D F4 , F0 , F2 S.D F12 , − 16(R1) S.D F4 , 0(R1) L.D F14 , − 24(R1) L.D F6 , − 8(R1) ADD.D F16 , F14 , F2 ADD.D F8 , F6 , F2 S.D F16 , − 24(R1) S.D F8 , − 8(R1) DADDUI R1, R1, # − 32 L.D F10 , − 16(R1) BNE R1, R2, Loop 4 iterations require more registers. This example assumes that array size is multiple of 4. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 10/55

  11. Exploitation of instruction level parallelism Compilation techniques and ILP Stalls and unrolling Unrolling (x4) Unrolling (x4) Loop : L.D F0 , 0(R1) ADD.D F12 , F10 , F2 < s t a l l > < s t a l l > ADD.D F4 , F0 , F2 < s t a l l > < s t a l l > S.D F12 , − 16(R1) < s t a l l > L.D F14 , − 24(R1) S.D F4 , 0(R1) < s t a l l > L.D F6 , − 8(R1) ADD.D F16 , F14 , F2 < s t a l l > < s t a l l > ADD.D F8 , F6 , F2 < s t a l l > < s t a l l > S.D F16 , − 24(R1) < s t a l l > DADDUI R1, R1, # − 32 S.D F8 , − 8(R1) < s t a l l > L.D F10 , − 16(R1) BNE R1, R2, Loop < s t a l l > 27 cycles for every 4 iterations → 6 . 75 cycles per iteration. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 11/55

  12. Exploitation of instruction level parallelism Compilation techniques and ILP Scheduling and unrolling Code reorganization. Unrolling (x4) Preserve Loop : L.D F0 , 0(R1) dependencies. L.D F6 , − 8(R1) Semantically L.D F10 , − 16(R1) equivalent. L.D F14 , − 24(R1) ADD.D F4 , F0 , F2 Goal : Make use of ADD.D F8 , F6 , F2 stalls . ADD.D F12 , F10 , F2 ADD.D F16 , F14 , F2 Update of R1 at enough S.D F4 , 0(R1) distance from BNE . S.D F8 , − 8(R1) S.D F12 , − 16(R1) 14 cycles for every 4 DADDUI R1, R1, # − 32 iterations → 3 . 5 cycles S.D F16 , 8(R1) BNE R1, R2, Loop per iteration. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 12/55

  13. Exploitation of instruction level parallelism Compilation techniques and ILP Limits of loop unrolling Improvement is decreased with each additional unrolling. Improvement limited to stalls removal. Overhead amortized among iterations. Increase in code size . May affect to instruction cache miss rate. Pressure on register file . May generate shortage of registers. Advantages are lost if there are not enough available registers. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 13/55

  14. Exploitation of instruction level parallelism Advanced branch prediction techniques Compilation techniques and ILP 1 Advanced branch prediction techniques 2 3 Introduction to dynamic scheduling Speculation 4 Multiple issue techniques 5 6 ILP limits Thread level parallelism 7 Conclusion 8 cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 14/55

  15. Exploitation of instruction level parallelism Advanced branch prediction techniques Branch prediction High impact of branches on programs performance. To reduce impact: Loop unrolling. Branch prediction: Compile time. Each branch handled isolated. Advanced branch prediction: Correlated predictors. Tournament predictors. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 15/55

  16. Exploitation of instruction level parallelism Advanced branch prediction techniques Dynamic scheduling Hardware reorders instructions execution to reduce stalls while keeping data flow and exceptions. Able to handle unknown cases at compile time: Cache misses/hits. Code less dependent on a concrete pipeline. Simplifies compiler. Permits the hardware speculation . cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 16/55

  17. Exploitation of instruction level parallelism Advanced branch prediction techniques Correlated prediction example If first and second branch are taken, if (a==2) { a=0; } third is NOT-taken. if (b==2) { b=0; } if (a!=b) { f () ; } Maintains last branches history to select among several predictors. A ( m , n ) predictor: Uses the result of m last branches to select among 2 m predictors. Each predictor has n bits. Predictor ( 1 , 2 ) : Result of last branch to select among 2 predictors. cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 17/55

  18. Exploitation of instruction level parallelism Advanced branch prediction techniques Size of predictor A predictor ( m , n ) has several entries for each branch address. Total size: S = 2 m × n × entries address Examples: ( 0 , 2 ) with 4K entries → 8 Kb ( 2 , 2 ) with 4K entries → 32 Kb ( 2 , 2 ) with 1K entries → 8 Kb cbed – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 18/55

Recommend


More recommend