Review • Data stationary pipeline control EECS 252 Graduate Computer – Micro-instruction & PC track down the pipe Architecture – Accumulate state • Implementing bubbles, stalls, forwarding, multicycle operations • Branch prediction Lec 5 – Out-of-Order Completion – Static vs dynamic – N-bit saturating counters – Local and global history David Culler – Correlated predictors, Tournament, GSHARE Electrical Engineering and Computer Sciences – Branch target buffers, return address predictors University of California, Berkeley http://www.eecs.berkeley.edu/~culler http://www-inst.eecs.berkeley.edu/~cs252 2/1/2005 CS252 SP05, Lec 5 OOC 2 Outline Pipelining with Reg. Reservations • Assumptions • Relax pipeline design to allow out-of-order completions 1. Multiple pipelined function units of different latency » able to accept operations at issue rate – Cray-1: register reservations » may be exceptions (e.g., divide) • Relax pipeline to allow out-of-order issue 2. Issue instructions in order – CDC 6600: Scoreboard 3. Operand fetch in order • Compiler optimizations for ILP 4. Completion out of order » short ops may bypass long ones • Superscalar issue 5. Some shared resources (e.g., reg write port) • Maybe Go back and finish exceptions • Implications – WAR hazard still resolved by pipeline flow (2 & 3) – RAW, WAW, and structural still present • Design philosophy (ala Cray) – Resolve hazards as instruction is issued into pipeline – Pipeline is non-blocking 2/1/2005 CS252 SP05, Lec 5 OOC 3 2/1/2005 CS252 SP05, Lec 5 OOC 4 Resolving Structural Hazards Basic Issue Model • With static pipeline flow, resource usage is known in • Issue unit checks for all advance hazards Instr. Fetch • Instruction requires X at t ticks after issue – Structural RAW, WAW • If reservation X [t] is clear, issue inst and set bit • Holds issue while hazards • Otherwise, delay till clear exist Op Fetch & Issue • At each tick the reservation X [] shifts by one, so will • Upon issue, register values eventually clear provided to F.U • Multiple resources? Range of delays? op valA valB rD • Executes to completion “shift reg.” for resource X without blocking Delay till required NOW resource resource is used CS252 SP05, Lec 5 OOC 5 CS252 SP05, Lec 5 OOC 6 2/1/2005 2/1/2005 NOW Handout Page 1
Hazard Resolution Example • Structural Add r1 := r2 + r3 Instr. Fetch Instr. Fetch – Op code => resource usage Add r2 := r2 + 4 – Check resource resv Lod r5 := mem[r1+16] – Set on issue Lod r6 := mem[r1+32] • Data Op Fetch Op Fetch & Issue & Issue Mul r7 := r5 * r6 – Add reservation bit one each register Bnz r1, foo – Check RegRsv for op valA valB rD op valA valB rD Sub r7 := r0 – r0 source and destination registers – Hold issue till clear – Set bit on destination register – Clear bit on dest reg. Write • Questions: – Forwarding? Motorola 88000 “scoreboard” [sic] 2/1/2005 CS252 SP05, Lec 5 OOC 7 2/1/2005 CS252 SP05, Lec 5 OOC 8 Cray-1 Discussion Pipelining with Scoreboarding • Assumptions • Technological Assumptions 1. Multiple function units of different latency • Why no forwarding? – Especially non-pipelined units • Longevity of the ISA? 2. Issue instructions whenever FU available, unless would cause multiple outstanding writes to same regsiter • Instruction cache? – Operand fetch out of order – Four blocks (RR) of 16x4 “parcels” – Completion out of order – Issue delayed on miss 3. Some shared resources (e.g., reg write port) » 2 CP for change of block • Implications • Branch delays? – Need to resolve RAW, WAR, WAW and structural • Design philosophy (ala CDC 6600) – Brach op code delayed till second parcel is obtained – 5 clocks (reg zero, nz, pos, neg) – Issue unit tracks all outstanding dependences – Holds issue if structural or WAW hazard • I/O system? – Informs FUs when hazards resolved – FUs fetch operands from register file and proceed 2/1/2005 CS252 SP05, Lec 5 OOC 9 2/1/2005 CS252 SP05, Lec 5 OOC 10 Scoreboard Operation Example • Issue Add r1 := r2 + r3 Instr. Fetch Instr. Fetch – Hold while FU unavailable or Add r2 := r2 + 4 destination register reserved (by FU f ) Lod r5 := mem[r1+16] • Read operands Scoreboard Scoreboard FU FU Lod r6 := mem[r1+32] Issue & Issue & – SB informs FU with all sources Resolve Resolve available to fetch & go Mul r7 := r5 * r6 – Limited by read ports Bnz r1, foo Sub r7 := r0 – r0 op rA rB rD op fetch op fetch op fetch op fetch op ex ex valA valB rD • Write back – SB schedules one FU to write – Waits no FU waiting to fetch (old version) of reg CS252 SP05, Lec 5 OOC 11 CS252 SP05, Lec 5 OOC 12 2/1/2005 2/1/2005 NOW Handout Page 2
Discussion Case Study: MIPS R4000 (200 MHz) IF IS RF EX DF DS TC WB • Technological Assumptions ALU reg instr mem reg data mem • Extend to allow forwarding? • How do loads and stores work? • 8 Stage Pipeline: • Instruction cache? – IF–first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. • I/O system? – IS–second half of access to instruction cache. – RF–instruction decode and register fetch, hazard checking and also instruction cache hit detection. – EX–execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. – DF–data fetch, first half of access to data cache. – DS–second half of access to data cache. – TC–tag check, determine whether the data cache access hit. – WB–write back for loads and register-register operations. • 8 Stages: What is impact on Load delay? Branch delay? Why? 2/1/2005 CS252 SP05, Lec 5 OOC 13 2/1/2005 CS252 SP05, Lec 5 OOC 14 Case Study: MIPS R4000 MIPS R4000 Floating Point IF IS RF EX DF DS TC WB TWO Cycle • FP Adder, FP Multiplier, FP Divider IF IS RF EX DF DS TC Load Latency IF IS RF EX DF DS • Last step of FP Multiplier/Divider uses FP Adder HW IF IS RF EX DF • 8 kinds of stages in FP units: IF IS RF EX IF IS RF Stage Functional unit Description IF IS A FP adder Mantissa ADD stage IF D FP divider Divide pipeline stage IF IS RF EX DF DS TC WB THREE Cycle E FP multiplier Exception test stage IF IS RF EX DF DS TC Branch Latency M FP multiplier First stage of multiplier IF IS RF EX DF DS (conditions evaluated N FP multiplier Second stage of multiplier IF IS RF EX DF during EX phase) R FP adder Rounding stage IF IS RF EX Delay slot plus two stalls IF IS RF S FP adder Operand shift stage Branch likely cancels delay slot if not taken IF IS U Unpack FP numbers IF 2/1/2005 CS252 SP05, Lec 5 OOC 15 2/1/2005 CS252 SP05, Lec 5 OOC 16 R4000 Performance MIPS FP Pipe Stages • Not ideal CPI of 1: – Load stalls (1 or 2 clock cycles) FP Instr 1 2 3 4 5 6 7 8 … – Branch stalls (2 cycles + unfilled slots) Add, Subtract U S+A A+R R+S – FP result stalls: RAW data hazard (latency) Multiply U E+M M M M N N+A R – FP structural stalls: Not enough FP hardware (parallelism) 4.5 Divide U A R D 28 … D+A D+R, D+R, D+A, D+R, A, R 4 Square root U E (A+R) 108 … A R 3.5 Negate U S 3 Absolute value U S 2.5 FP compare U A R 2 Stages: 1.5 M First stage of multiplier A Mantissa ADD stage 1 N Second stage of multiplier D Divide pipeline stage 0.5 R Rounding stage E Exception test stage 0 doduc espresso gcc nasa7 ora S Operand shift stage eqntott li spice2g6 su2cor tomcatv U Unpack FP numbers Base Load stalls Branch stalls FP result stalls FP structural stalls CS252 SP05, Lec 5 OOC 17 CS252 SP05, Lec 5 OOC 18 2/1/2005 2/1/2005 NOW Handout Page 3
Recommend
More recommend