previous lecture
play

Previous lecture stalls reduce performance but are required to - PowerPoint PPT Presentation

Previous lecture stalls reduce performance but are required to get correct results compiler arranges code to avoid hazards and stalls requires knowledge of the pipeline structure dt10 2011 13.1 Branch hazards


  1. Previous lecture • stalls – reduce performance – but are required to get correct results • compiler – arranges code to avoid hazards and stalls – requires knowledge of the pipeline structure dt10 2011 13.1

  2. Branch hazards • branch outcome is determined in MEM stage Flush these instructions (Set control values to 0) PC dt10 2011 13.2

  3. Reducing branch delay • move hardware to determine outcome to ID stage – target address adder – register comparator • example: branch taken 36: sub $10, $4, $8 40: beq $1, $3, 7 44: and $12, $2, $5 48: or $13, $2, $6 52: add $14, $4, $2 56: slt $15, $6, $7 ... ??: lw $4, 50($7) dt10 2011 13.3

  4. Example: branch taken dt10 2011 13.4

  5. Example: branch taken dt10 2011 13.5

  6. Delay slots: clawing back the stalls • taken branch always means one stall cycle – nothing we can do to get rid of it – can we use the stall cycle to do something useful? • MIPS approach : change the ISA specification – instruction following branch is always executed – delay slot instruction : executed even when branch taken do{ 24 mul $2, $2, $3 24 mul $2, $2, $3 $2 = $2 * $3; 28 addi $3, $3, -1 28 beq $3, $0, -2 $3 = $3 - 1; 32 beq $3, $0, -3 32 addi $3, $3, -1 }while($3==0) ; 36 add $3, $2, $4 stall $3 = $2 + $4; 36 add $3, $2, $4 taken : no delay slot taken : with delay slot dt10 2011 13.6

  7. Data hazards for branches • if a comparison register is a destination of 2 nd or 3 rd preceding ALU instruction add $1 , $2, $3 IF ID EX MEM WB IF ID EX MEM WB add $4 , $5, $6 IF ID EX MEM WB … beq $1 , $4 , target IF ID EX MEM WB • can resolve using forwarding dt10 2011 13.7

  8. Data hazards for branches • two data hazards that cause stall on branch – comparison reg. is destination of preceding ALU instr. – comparison reg. is destination of 2 nd preceding load instr. • need 1 stall cycle IF ID EX MEM WB lw $1 , addr IF ID EX MEM WB add $4 , $5, $6 EX MEM WB beq $1 , $4 , target IF ID dt10 2011 13.8

  9. Data hazards for branches • two data hazards that cause stall on branch – comparison reg. is destination of preceding ALU instr. – comparison reg. is destination of 2 nd preceding load instr. • need 1 stall cycle IF ID EX MEM WB lw $1 , addr IF ID EX MEM WB add $4 , $5, $6 beq stalled IF ID ID EX MEM WB beq $1 , $4 , target dt10 2011 13.9

  10. Data hazards for branches • if a comparison register is a destination of immediately preceding load instruction – need 2 stall cycles IF ID EX MEM WB lw $1 , addr IF ID beq stalled beq stalled ID ID EX MEM WB beq $1 , $0 , target dt10 2011 13.10

  11. Dynamic branch prediction • deeper and superscalar pipelines – branch penalty is more significant • use dynamic prediction – branch prediction buffer (aka branch history table) – indexed by recent branch instruction addresses – stores outcome (taken/not taken) • dynamic prediction: execute a branch – check table, expect the same outcome – start fetching from fall-through or target – if wrong, flush pipeline and flip prediction dt10 2011 13.11

  12. 1-bit predictor: shortcoming • inner loop branches mispredicted twice! outer: … … inner: … … beq …, …, inner … beq …, …, outer – mispredict as taken on last iteration of inner loop – then mispredict as not taken on first iteration of inner loop next time around dt10 2011 13.12

  13. 2-Bit predictor • only change prediction on two successive mispredictions dt10 2011 13.13

  14. Calculating the branch target • even with predictor, still need to calculate the target address – 1-cycle penalty for a taken branch • branch target buffer – cache of target addresses – indexed by PC when instruction fetched – if hit and instruction is branch predicted taken, can fetch target immediately dt10 2011 13.14

  15. The role of the compiler • compilers can have a huge impact on performance – register allocation – instruction selection – data placement • optimisation is subordinate to correctness – must always compile against ISA specification – can try to optimise code according to architecture • CPU specific optimisation may reduce performance – optimise for P4 → might be slower than generic code on P3 • ISA extensions remove backwards compatibility – optimise for P4 → SSE not available on P2 dt10 2011 13.15

  16. Compiling to avoid hazards • data hazards – instruction scheduling: avoid load-use data hazard – register allocation: avoid immediate re-use of registers – MIPS: large number of registers to make this easier • structural hazards – instruction selection: select simple instructions – e.g. : sll $1,$2,1 vs. add $1,$2,$2 – instruction scheduling: move instructions apart • control hazards – instruction selection: eliminate branches if possible – e.g.: cmov : conditional move – e.g.: predicated execution dt10 2011 13.16

  17. Exceptions and interrupts • unexpected events requiring change in flow of control – different ISAs use the terms differently • exception : internal signal, arises from within the CPU – e.g. undefined opcode, overflow, syscall, … • interrupt : external signal, source is outside CPU – e.g. external IO: hard disk saying “your data is ready now”! • must handle them without sacrificing performance – exceptions are... exceptional – not the common/expected case – interrupts are frequent, but not that frequent • CPU instruction rate: >1GHz; interrupt rate <10KHz dt10 2011 13.17

  18. Handling exceptions in MIPS • exceptions managed by System Control Coprocessor – follows set of steps to record and handle exception 1. save PC of offending (or interrupted) instruction – in Exception Program Counter, EPC 2. save indication of the problem – in Cause Register – 0 for undefined opcode, 1 for overflow 3. jump to handler at 8000 00180 dt10 2011 13.18

  19. An alternate mechanism • vectored interrupts – handler address determined by Cause Register • example: – undefined opcode: C000 0000 – overflow: C000 0020 – …: C000 0040 • instructions either – deal with the interrupt, or jump to real handler dt10 2011 13.19

  20. Handler actions • read Cause Register, and transfer to relevant handler • determine action required • if restartable – take corrective action – use EPC to return to program • otherwise – terminate program – report error using EPC, Cause, … dt10 2011 13.20

  21. Exceptions in a pipeline • another form of control hazard • consider exception on add in EX stage add $1, $2, $1 – prevent $1 from being clobbered – complete previous instructions – flush add and subsequent instructions – set Cause and EPC register values – transfer control to handler • similar to mispredicted branch – use much of the same hardware dt10 2011 13.21

  22. Pipeline with exceptions dt10 2011 13.22

  23. Exception properties • restartable exceptions – pipeline can flush the instruction – handler executes, then returns to the instruction – refetched and executed from scratch • PC saved in EPC register – identifies offending instruction – actually PC + 4 is saved, handler must adjust dt10 2011 13.23

  24. Exception example • exception on add in 40 sub $11, $2, $4 44 and $12, $2, $5 48 or $13, $2, $6 4C add $1, $2, $1 50 slt $15, $6, $7 54 lw $16, 50($7) … • handler 80000180 sw $25, 1000($0) 80000184 sw $26, 1004($0) … dt10 2011 13.24

  25. Exception example dt10 2011 13.25

  26. Exception example dt10 2011 13.26

  27. Multiple exceptions • pipelining overlaps multiple instructions – could have multiple exceptions at once • simple way: deal with exception from earliest instruction – flush subsequent instructions – “precise” exceptions • in complex pipelines – multiple instructions issued per cycle – out-of-order completion – maintaining precise exceptions is difficult! dt10 2011 13.27

  28. Pipelining: summary • ISA influences design of datapath and control • datapath and control influence design of ISA • pipelining improves instruction throughput – using parallelism – more instructions completed per second – but latency for each instruction not reduced • hazards: structural, data, control • advanced issues – instruction-level parallelism, system-on-chip – courses: custom computing, advanced architectures dt10 2011 13.28

Recommend


More recommend