Previous lecture • stalls – reduce performance – but are required to get correct results • compiler – arranges code to avoid hazards and stalls – requires knowledge of the pipeline structure dt10 2011 13.1
Branch hazards • branch outcome is determined in MEM stage Flush these instructions (Set control values to 0) PC dt10 2011 13.2
Reducing branch delay • move hardware to determine outcome to ID stage – target address adder – register comparator • example: branch taken 36: sub $10, $4, $8 40: beq $1, $3, 7 44: and $12, $2, $5 48: or $13, $2, $6 52: add $14, $4, $2 56: slt $15, $6, $7 ... ??: lw $4, 50($7) dt10 2011 13.3
Example: branch taken dt10 2011 13.4
Example: branch taken dt10 2011 13.5
Delay slots: clawing back the stalls • taken branch always means one stall cycle – nothing we can do to get rid of it – can we use the stall cycle to do something useful? • MIPS approach : change the ISA specification – instruction following branch is always executed – delay slot instruction : executed even when branch taken do{ 24 mul $2, $2, $3 24 mul $2, $2, $3 $2 = $2 * $3; 28 addi $3, $3, -1 28 beq $3, $0, -2 $3 = $3 - 1; 32 beq $3, $0, -3 32 addi $3, $3, -1 }while($3==0) ; 36 add $3, $2, $4 stall $3 = $2 + $4; 36 add $3, $2, $4 taken : no delay slot taken : with delay slot dt10 2011 13.6
Data hazards for branches • if a comparison register is a destination of 2 nd or 3 rd preceding ALU instruction add $1 , $2, $3 IF ID EX MEM WB IF ID EX MEM WB add $4 , $5, $6 IF ID EX MEM WB … beq $1 , $4 , target IF ID EX MEM WB • can resolve using forwarding dt10 2011 13.7
Data hazards for branches • two data hazards that cause stall on branch – comparison reg. is destination of preceding ALU instr. – comparison reg. is destination of 2 nd preceding load instr. • need 1 stall cycle IF ID EX MEM WB lw $1 , addr IF ID EX MEM WB add $4 , $5, $6 EX MEM WB beq $1 , $4 , target IF ID dt10 2011 13.8
Data hazards for branches • two data hazards that cause stall on branch – comparison reg. is destination of preceding ALU instr. – comparison reg. is destination of 2 nd preceding load instr. • need 1 stall cycle IF ID EX MEM WB lw $1 , addr IF ID EX MEM WB add $4 , $5, $6 beq stalled IF ID ID EX MEM WB beq $1 , $4 , target dt10 2011 13.9
Data hazards for branches • if a comparison register is a destination of immediately preceding load instruction – need 2 stall cycles IF ID EX MEM WB lw $1 , addr IF ID beq stalled beq stalled ID ID EX MEM WB beq $1 , $0 , target dt10 2011 13.10
Dynamic branch prediction • deeper and superscalar pipelines – branch penalty is more significant • use dynamic prediction – branch prediction buffer (aka branch history table) – indexed by recent branch instruction addresses – stores outcome (taken/not taken) • dynamic prediction: execute a branch – check table, expect the same outcome – start fetching from fall-through or target – if wrong, flush pipeline and flip prediction dt10 2011 13.11
1-bit predictor: shortcoming • inner loop branches mispredicted twice! outer: … … inner: … … beq …, …, inner … beq …, …, outer – mispredict as taken on last iteration of inner loop – then mispredict as not taken on first iteration of inner loop next time around dt10 2011 13.12
2-Bit predictor • only change prediction on two successive mispredictions dt10 2011 13.13
Calculating the branch target • even with predictor, still need to calculate the target address – 1-cycle penalty for a taken branch • branch target buffer – cache of target addresses – indexed by PC when instruction fetched – if hit and instruction is branch predicted taken, can fetch target immediately dt10 2011 13.14
The role of the compiler • compilers can have a huge impact on performance – register allocation – instruction selection – data placement • optimisation is subordinate to correctness – must always compile against ISA specification – can try to optimise code according to architecture • CPU specific optimisation may reduce performance – optimise for P4 → might be slower than generic code on P3 • ISA extensions remove backwards compatibility – optimise for P4 → SSE not available on P2 dt10 2011 13.15
Compiling to avoid hazards • data hazards – instruction scheduling: avoid load-use data hazard – register allocation: avoid immediate re-use of registers – MIPS: large number of registers to make this easier • structural hazards – instruction selection: select simple instructions – e.g. : sll $1,$2,1 vs. add $1,$2,$2 – instruction scheduling: move instructions apart • control hazards – instruction selection: eliminate branches if possible – e.g.: cmov : conditional move – e.g.: predicated execution dt10 2011 13.16
Exceptions and interrupts • unexpected events requiring change in flow of control – different ISAs use the terms differently • exception : internal signal, arises from within the CPU – e.g. undefined opcode, overflow, syscall, … • interrupt : external signal, source is outside CPU – e.g. external IO: hard disk saying “your data is ready now”! • must handle them without sacrificing performance – exceptions are... exceptional – not the common/expected case – interrupts are frequent, but not that frequent • CPU instruction rate: >1GHz; interrupt rate <10KHz dt10 2011 13.17
Handling exceptions in MIPS • exceptions managed by System Control Coprocessor – follows set of steps to record and handle exception 1. save PC of offending (or interrupted) instruction – in Exception Program Counter, EPC 2. save indication of the problem – in Cause Register – 0 for undefined opcode, 1 for overflow 3. jump to handler at 8000 00180 dt10 2011 13.18
An alternate mechanism • vectored interrupts – handler address determined by Cause Register • example: – undefined opcode: C000 0000 – overflow: C000 0020 – …: C000 0040 • instructions either – deal with the interrupt, or jump to real handler dt10 2011 13.19
Handler actions • read Cause Register, and transfer to relevant handler • determine action required • if restartable – take corrective action – use EPC to return to program • otherwise – terminate program – report error using EPC, Cause, … dt10 2011 13.20
Exceptions in a pipeline • another form of control hazard • consider exception on add in EX stage add $1, $2, $1 – prevent $1 from being clobbered – complete previous instructions – flush add and subsequent instructions – set Cause and EPC register values – transfer control to handler • similar to mispredicted branch – use much of the same hardware dt10 2011 13.21
Pipeline with exceptions dt10 2011 13.22
Exception properties • restartable exceptions – pipeline can flush the instruction – handler executes, then returns to the instruction – refetched and executed from scratch • PC saved in EPC register – identifies offending instruction – actually PC + 4 is saved, handler must adjust dt10 2011 13.23
Exception example • exception on add in 40 sub $11, $2, $4 44 and $12, $2, $5 48 or $13, $2, $6 4C add $1, $2, $1 50 slt $15, $6, $7 54 lw $16, 50($7) … • handler 80000180 sw $25, 1000($0) 80000184 sw $26, 1004($0) … dt10 2011 13.24
Exception example dt10 2011 13.25
Exception example dt10 2011 13.26
Multiple exceptions • pipelining overlaps multiple instructions – could have multiple exceptions at once • simple way: deal with exception from earliest instruction – flush subsequent instructions – “precise” exceptions • in complex pipelines – multiple instructions issued per cycle – out-of-order completion – maintaining precise exceptions is difficult! dt10 2011 13.27
Pipelining: summary • ISA influences design of datapath and control • datapath and control influence design of ISA • pipelining improves instruction throughput – using parallelism – more instructions completed per second – but latency for each instruction not reduced • hazards: structural, data, control • advanced issues – instruction-level parallelism, system-on-chip – courses: custom computing, advanced architectures dt10 2011 13.28
Recommend
More recommend